Created on 2016-03-18.00:08:39 by darjus, last changed 2017-11-04.20:31:48 by jeff.allen.
|msg10823 (view)||Author: Jim Baker (zyasoft)||Date: 2016-03-18.02:43:55|
Fixed as of https://hg.python.org/jython/rev/cffe5b824a21
|msg10828 (view)||Author: Darjus Loktevic (darjus)||Date: 2016-03-19.07:58:49|
Mar 19, 2016 6:56:12 PM io.netty.channel.ChannelInitializer exceptionCaught WARNING: Failed to initialize a channel. Closing: [id: 0xabe1bf32, /127.0.0.1:63059 => /127.0.0.1:63054] Traceback (most recent call last): File "/Users/darjus/Documents/jython/dist/Lib/_socket.py", line 639, in initChannel child._ensure_post_connect() File "/Users/darjus/Documents/jython/dist/Lib/_socket.py", line 1421, in _ensure_post_connect self._post_connect() File "/Users/darjus/Documents/jython/dist/Lib/_socket.py", line 877, in _post_connect self.channel.pipeline().addLast(self.python_inbound_handler) SystemError: __getattribute__ not found on type DefaultChannelPipeline Looks like there's a bug in the new synchronization code. Found it while running some unrelated stuff.
|msg10854 (view)||Author: Stefan Richthofer (stefan.richthofer)||Date: 2016-05-08.18:35:52|
Hey. Is there a way to reproduce this issue (on a fast system)? Does someone have jstack output available for the deadlock?
|msg10906 (view)||Author: Jim Baker (zyasoft)||Date: 2016-08-24.20:10:11|
|msg11296 (view)||Author: James Mudd (jamesmudd)||Date: 2017-04-04.22:09:28|
Not sure if its actually related to this problem or not as its closed but a CI build just experienced something related to this. See: https://travis-ci.org/jythontools/jython/jobs/218647190
|msg11297 (view)||Author: Jim Baker (zyasoft)||Date: 2017-04-04.23:24:50|
It is in fact a bug, a classic object publication/data race to be precise. The problem is that it is possible for a Java type to be published by one thread in classToType, as held by PyType#LazyClassToTypeHolder; but for another thread to simultaneously see/lookup this type for the first time. We used to synchronize on this map, but that caused deadlocks (the original bug here). Then we introduced polling with the bug fix (https://hg.python.org/jython/rev/f3458ef83e08), but we realized it was a hack, and linked it to this bug. Probably the right way to fix this bug is to use a scheme actually seen in Lib/_sockets.py - use associated condition variables to mark when the publication is finished. This should be close to the current code: continue polling on the existence of __getattribute__, but wait for the specific publication. So we can start with a ConcurrentMap<Class<?>, Condition>, publish it from one thread the specific condition variable using #putIfAbsent, and cv wait for it in https://github.com/jythontools/jython/blob/master/src/org/python/core/Deriveds.java#L51
|msg11524 (view)||Author: Jeff Allen (jeff.allen)||Date: 2017-08-05.12:19:50|
After backing out (on a scratch repo only, so far) the sets https://hg.python.org/jython/rev/cffe5b824a21 and https://hg.python.org/jython/rev/f3458ef83e08, in the interests of #2609, I ought to find I've re-instated this issue. Is there a sure-fire way to reproduce it? However, I'm not sure the original problem these change sets sought to solve is what we think ... The synchronisation on PyType looks ok to me, even though the two locks seem to have got entangled, according to the informative thread dump from Darjus. In that, "Jython-Netty-Child-60" is doing a reasonable thing. It has entered the monitor on PyType.class and is patiently waiting here: https://hg.python.org/jython/file/1a14d360dc8d/src/org/python/core/PyType.java#l1412, to acquire the lock on the class_to_type map. "MainThread" meanwhile is waiting to enter the monitor on PyType.class (good), but is holding the lock on on the class_to_type map (huh?). This ought not to be possible, since the monitor protects access to the class_to_type map. It looks like somehow, "MainThread" has previously exited the monitor still holding the lock. We've a nice explanation of how that can happen in #2536. ---  Actually, there *is* one place class_to_type map escapes synchronisation, and that's here: https://hg.python.org/jython/file/1a14d360dc8d/src/org/python/core/PyType.java#l1473 I think that's an oversight. And it's not involved here.
|msg11525 (view)||Author: Jim Baker (zyasoft)||Date: 2017-08-05.19:34:56|
I agree with Jeff's analysis. Clearly the necessary cycle of waiting to get a deadlock was caused by what #2536 identified, finally blocks for releasing internal locks not being called due to out of memory errors. (This scenario violated the contract that we assumed through the usage of ConcurrentHashMap.) Other than that issue, we have had no reports of other deadlocks in this specific code; and I also do not see any deadlock path in the use of class_to_type. We should also synchronize fromClassSkippingInners. This change restores correctness. Next, let's look at our caching, and its corresponding performance. Using weak keys/weak values in class_to_type means that when working with Java objects (calling methods, accessing attributes), Python code that does not maintain a strong reference to the Java class will cause the Jython runtime to constantly rebuild these entries in class_to_type. Here are the ways that such strong references would be established: 1. exposedTypes (so Jython's internals) https://hg.python.org/jython/file/1a14d360dc8d/src/org/python/core/PyType.java#l104 2. imports of Java classes into a Python namespace 3. references to `type(java_obj)` 4. subclassing of a Java class by a Python class References to Java *objects*, including objects from factory methods; indirect construction, such as `java.util.HashMap()`; and callbacks (Netty, re/json/strptime) would not introduce such references to Python types for Java *classes*. But calling methods, using attributes, or otherwise using getting the Python type on the Java object, would require this entry in class_to_type to be added, potentially to be potentially quickly discarded by generational GC. Computing these entries is relatively expensive, especially if a given class C has inner classes (and this is a general type graph expansion). See also the analysis in https://bugs.openjdk.java.net/browse/JDK-6389107 (scenario 3); our mapping in class_to_type is a common pattern seen in Java systems. What if instead of current one-level cache, we had a two-level cache? Level 1 uses weak key/weak value for the class/type entries (C, T); (C, T) entries that are expired from Level 1 are moved to Level 2; Level 2 uses an expiration time so a ClassLoader could be unloaded in say some reasonable amount of time. To implement: 1. Level 1 cache is a direct replacment of class_to_type with CacheBuilder weak keys/weak values/removal listener. Also we will not attempt to expose it as a Map going forward; adjust class_to_type usage accordingly. 2. Level 2 cache is weak keys/strong values/expires after write. We can tune this expiration as was done with the regex cache and the python.sre.cachespec option, https://github.com/jythontools/jython/blob/master/src/org/python/modules/sre/SRE_STATE.java#L1253). 3. For removalListener: (C, T) entries that are removed from Level 1 will use removalListener to get placed in Level 2 (safe because Level 1 is concurrent; and obviously hard refs would prevent it from being removed while under construction). 4. Try both caches when looking up the type T for class C. Because synchronized, there's no visible inconsistency; if no entry for C because it was removed, simply recompute T'. There is a small window of time that it could be in the process of being moved to Level 2, but this does not affect correctness; (C, T) will be eventually expired from Level 2 and (C, T') is a valid repllacement for it (because there can be no hard refs to T outside the level 2 cache itself). Lastly, https://docs.oracle.com/javase/7/docs/api/java/lang/ClassValue.html looks potentially useful as an alternative for publishing PyType objects into a cache. But we need to address two key questions: 1. Whether ClassValue#computeValue would work properly in the re-entrant case where the class graph from a given class C is explored, as is done for inners, possibly referring back to C. This is why we could not get the publication of the type in the map done after the init (as one would usually want to do!) in https://github.com/jythontools/jython/blob/master/src/org/python/core/PyType.java#L1515 (putIfAbsent); class C would be referenced, and it would attempt to build again (and stack overflow). No solution I tried could break this problem. 2. When to call ClassValue#removeValue ! We still have to keep track of the cache with respect to PyType.
|msg11646 (view)||Author: Jeff Allen (jeff.allen)||Date: 2017-11-04.20:31:47|
I'd like to believe we fixed this with #2609 in https://hg.python.org/jython/rev/b03972313073, and that the reworking of PyType in https://hg.python.org/jython/rev/55756721fa02 has addressed concerns people might have about needless locking. The critical similarity is a type not having an attribute you know it does, and the explanation is that you looked before we finished constructing it. Can anyone produce evidence to the contrary?
|2017-11-04 20:31:48||jeff.allen||set||status: open -> pending|
messages: + msg11646
|2017-11-04 15:17:50||amoebam||set||nosy: + amoebam|
|2017-08-05 19:35:04||zyasoft||set||milestone: Jython 2.7.1 -> Jython 2.7.2|
|2017-08-05 19:34:58||zyasoft||set||messages: + msg11525|
messages: + msg11524
|2017-04-04 23:25:59||zyasoft||set||title: PyType.fromClass deadlocks on slow systems (circleci for example) -> PyType.fromClass publication bug on slow systems (travis, circleci)|
|2017-04-04 23:25:12||zyasoft||set||priority: immediate -> urgent|
assignee: darjus -> (no value)
status: closed -> open
resolution: fixed -> accepted
|2017-04-04 23:24:51||zyasoft||set||messages: + msg11297|
messages: + msg11296
|2016-08-24 20:10:11||zyasoft||set||status: open -> closed|
messages: + msg10906
messages: + msg10854
|2016-03-19 07:58:50||darjus||set||status: pending -> open|
messages: + msg10828
|2016-03-18 02:43:55||zyasoft||set||status: open -> pending|
messages: + msg10823
nosy: + zyasoft