Title: test_jy_internal deadlock on Windows
Type: Severity: normal
Components: Core Versions: Jython 2.7
Status: closed Resolution: duplicate
Dependencies: Superseder:
Assigned To: Nosy List: darjus, jamesmudd, stefan.richthofer
Priority: Keywords: patch

Created on 2017-03-06.22:13:06 by jamesmudd, last changed 2017-04-14.18:52:26 by stefan.richthofer.

File name Uploaded Description Edit Remove
jstack_output.txt jamesmudd, 2017-03-06.22:12:18 jstack output
deadlock_patch.diff jamesmudd, 2017-03-09.17:40:53
msg11195 (view) Author: James Mudd (jamesmudd) Date: 2017-03-06.22:12:18
I have just observed another test deadlock in test_jy_internal this time on Windows. Here is the abridged jstack output.

Found one Java-level deadlock:
  waiting for ownable synchronizer 0x00000000eab642c8, (a java.util.concurrent.locks.ReentrantLock$NonfairSync),
  which is held by "MainThread"
  waiting to lock monitor 0x000000001e423238 (object 0x00000000eb578e28, a java.lang.Class),
  which is held by "Finalizer"

The full output is attached.
msg11197 (view) Author: Stefan Richthofer (stefan.richthofer) Date: 2017-03-07.01:39:07
One side-note: Looking at the jstack-output it looks like your local file-paths appear for .py-files (see #2535). I thought this had been fixed as of Do you have an idea why this fails in your case?

Regarding the deadlock, I suspect it is kind of a hidden deadlock, see

Please run

jstack -l

next time. The -l option includes more info about "ownable synchronizers".

Also, it is maybe related to
(pobably all our recently observed deadlocks are, they all involve fromClass, createType and such). The main problem with analyzing this is that the message

waiting for ownable synchronizer 0x00000000eab642c8, (a java.util.concurrent.locks.ReentrantLock$NonfairSync),
  which is held by "MainThread"

Doesn't tell where and how MainThread got this lock. (Maybe it's a read-lock on what PyType.getClassToType() returns.) The link above explains that read-locks don't show up in jstac output. So this will be involved.
msg11198 (view) Author: James Mudd (jamesmudd) Date: 2017-03-07.09:14:47
@Stefan Thanks for the comments. I did have the jstack -l output but didn't save it sorry. As for the file paths I will have a look at why.

About the deadlocks, are these likely to affect real uses, or do we just see them in the tests? If we are seeing them at all I guess that's a issue but wondered about real uses. After a 5 min glance at it looks like a good idea but as always with synchronization it's all about the detail. I will try to take a better look at that as well. Maybe I can recreate the situation with the debugger. I'm not sure if its a read-lock here, jstack can actually see the deadlock so hoping its a bit clearer. I think I had another test hang probably with a deadlock and jstack couldn't find that one so maybe there is a read-lock issue as well. Always tricky to debug!
msg11205 (view) Author: Darjus Loktevic (darjus) Date: 2017-03-08.05:53:56
Hey James, can you try this: and let me know if it makes the issue go away? I know it's not that easy to repeat, but would appreciate if you did a bunch of runs.
msg11209 (view) Author: Stefan Richthofer (stefan.richthofer) Date: 2017-03-08.23:25:22
I looked a bit deeper into this and other recent deadlock issues.
I will focus on test_logging from now on, because that one is easiest to reproduce (occurs most likely).

Anyway. All involve guava data structures (for caching) and ReentrantLock. Under some rare conditions it looks like a guava data structure does not immediately release the lock when the thread finishes the call. In this issue it happens in PyType.fromClass right before the sync-block. Under no circumstances a thread attempting to enter the sync-block should be owning a lock on LazyClassToTypeHolder.classToType - all calls on it should be done and unlocked. But in our deadlock scenario this must be the case.

- Do we somehow misuse guava?
- Or is it a bug in guava?
- Or what do I overlook?

I am going to build my own guava-fork for debug purpose, so I can access the internal locks. However will do this for test_logging for better reproducibility; there the issue is with locking on org.python.modules.sre.SRE_STATE$CACHE.cache. Quite the same cause - MainThread leaves a method of cache without properly being unlocked.
msg11212 (view) Author: James Mudd (jamesmudd) Date: 2017-03-09.17:40:53
Using the idea suggested by @Darjus I removed the syncronized block at the start of PyType.fromClass line 1544. Then I ran 1000 repeats of test_logging (the test which most readily deadlocks) and it completed ok without any issues. So I would suggest that seems to fix the problem. Patch is attached maybe others could try this.

@Darjus I didn't make the change suggested to could you suggest how that might help?
msg11214 (view) Author: Stefan Richthofer (stefan.richthofer) Date: 2017-03-10.17:28:04
I am not convinced that this fixes it, deadlocks maybe just become much less likely. AFAICT the deadlock in test_logging is not related to fromClass, but to ReentrantLock and (maybe) guava. That's the common factor between these.

- note that Darjus experimental fix leaves fromClass un-threadsafe testwise
- for me test_logging also always passed so far when run individually hundreds of times in a shellscript-loop; it only deadlocks sometimes in context of the testsuit
- to reproduce it in a "convenient" way, run testsuite frequently only up to test_logging, e.g. by adjusting Lib/test/; set tests = tests[:275] in line 377.
msg11215 (view) Author: James Mudd (jamesmudd) Date: 2017-03-10.18:25:28
@Stefan I am also not convinced this fixes it. Today I ran the same 1000 iterations with the synchronised block in and it also passed. So running test_logging individually does not seem to trigger this as you note. I will try the approach you suggest.
msg11271 (view) Author: Stefan Richthofer (stefan.richthofer) Date: 2017-03-25.04:22:34
I think, that ultimately explains also this issue.
msg11309 (view) Author: Stefan Richthofer (stefan.richthofer) Date: 2017-04-14.18:52:26
This issue has most likely common cause with #2536.
#2536 was (empirically) solved as a consequence of
(higher default stacksize/-xss)
A detailed fix of #2536 is of low priority.

Closing this issue as duplicate of #2536.

Reopen, if this issue pops up again even after
Date User Action Args
2017-04-14 18:52:26stefan.richthofersetstatus: open -> closed
resolution: duplicate
messages: + msg11309
2017-03-25 04:22:34stefan.richthofersetmessages: + msg11271
2017-03-10 18:25:28jamesmuddsetmessages: + msg11215
2017-03-10 17:28:05stefan.richthofersetmessages: + msg11214
2017-03-09 17:40:54jamesmuddsetfiles: + deadlock_patch.diff
keywords: + patch
messages: + msg11212
2017-03-08 23:25:23stefan.richthofersetmessages: + msg11209
2017-03-08 05:53:56darjussetmessages: + msg11205
2017-03-07 09:14:48jamesmuddsetmessages: + msg11198
2017-03-07 01:39:08stefan.richthofersetnosy: + darjus, stefan.richthofer
messages: + msg11197
2017-03-06 22:13:07jamesmuddcreate