Issue2487

classification

Title:	PyType.fromClass publication bug on slow systems (travis, circleci)
Type:		Severity:	critical
Components:	Core	Versions:	Jython 2.7
		Milestone:	Jython 2.7.2

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:	jeff.allen	Nosy List:	amoebam, darjus, gbach, jamesmudd, jeff.allen, stefan.richthofer, zyasoft
Priority:	urgent	Keywords:

Created on 2016-03-18.00:08:39 by darjus, last changed 2020-02-01.09:16:21 by jeff.allen.

Files
File name	Uploaded	Description	Edit	Remove
thread-dump-2462.txt	darjus, 2016-03-18.00:08:36
issue2487.tar.gz	gbach, 2019-11-18.14:35:58	files to reproduce error

Messages
msg10823 (view)	Author: Jim Baker (zyasoft)	Date: 2016-03-18.02:43:55
Fixed as of https://hg.python.org/jython/rev/cffe5b824a21
msg10828 (view)	Author: Darjus Loktevic (darjus)	Date: 2016-03-19.07:58:49
Mar 19, 2016 6:56:12 PM io.netty.channel.ChannelInitializer exceptionCaught WARNING: Failed to initialize a channel. Closing: [id: 0xabe1bf32, /127.0.0.1:63059 => /127.0.0.1:63054] Traceback (most recent call last): File "/Users/darjus/Documents/jython/dist/Lib/_socket.py", line 639, in initChannel child._ensure_post_connect() File "/Users/darjus/Documents/jython/dist/Lib/_socket.py", line 1421, in _ensure_post_connect self._post_connect() File "/Users/darjus/Documents/jython/dist/Lib/_socket.py", line 877, in _post_connect self.channel.pipeline().addLast(self.python_inbound_handler) SystemError: __getattribute__ not found on type DefaultChannelPipeline Looks like there's a bug in the new synchronization code. Found it while running some unrelated stuff.
msg10854 (view)	Author: Stefan Richthofer (stefan.richthofer)	Date: 2016-05-08.18:35:52
Hey. Is there a way to reproduce this issue (on a fast system)? Does someone have jstack output available for the deadlock?
msg10906 (view)	Author: Jim Baker (zyasoft)	Date: 2016-08-24.20:10:11
Fixed by https://hg.python.org/jython/rev/f3458ef83e08
msg11296 (view)	Author: James Mudd (jamesmudd)	Date: 2017-04-04.22:09:28
Not sure if its actually related to this problem or not as its closed but a CI build just experienced something related to this. See: https://travis-ci.org/jythontools/jython/jobs/218647190
msg11297 (view)	Author: Jim Baker (zyasoft)	Date: 2017-04-04.23:24:50
It is in fact a bug, a classic object publication/data race to be precise. The problem is that it is possible for a Java type to be published by one thread in classToType, as held by PyType#LazyClassToTypeHolder; but for another thread to simultaneously see/lookup this type for the first time. We used to synchronize on this map, but that caused deadlocks (the original bug here). Then we introduced polling with the bug fix (https://hg.python.org/jython/rev/f3458ef83e08), but we realized it was a hack, and linked it to this bug. Probably the right way to fix this bug is to use a scheme actually seen in Lib/_sockets.py - use associated condition variables to mark when the publication is finished. This should be close to the current code: continue polling on the existence of __getattribute__, but wait for the specific publication. So we can start with a ConcurrentMap<Class<?>, Condition>, publish it from one thread the specific condition variable using #putIfAbsent, and cv wait for it in https://github.com/jythontools/jython/blob/master/src/org/python/core/Deriveds.java#L51
msg11524 (view)	Author: Jeff Allen (jeff.allen)	Date: 2017-08-05.12:19:50
After backing out (on a scratch repo only, so far) the sets https://hg.python.org/jython/rev/cffe5b824a21 and https://hg.python.org/jython/rev/f3458ef83e08, in the interests of #2609, I ought to find I've re-instated this issue. Is there a sure-fire way to reproduce it? However, I'm not sure the original problem these change sets sought to solve is what we think ... The synchronisation on PyType looks ok[1] to me, even though the two locks seem to have got entangled, according to the informative thread dump from Darjus. In that, "Jython-Netty-Child-60" is doing a reasonable thing. It has entered the monitor on PyType.class and is patiently waiting here: https://hg.python.org/jython/file/1a14d360dc8d/src/org/python/core/PyType.java#l1412, to acquire the lock on the class_to_type map. "MainThread" meanwhile is waiting to enter the monitor on PyType.class (good), but is holding the lock on on the class_to_type map (huh?). This ought not to be possible, since the monitor protects access to the class_to_type map. It looks like somehow, "MainThread" has previously exited the monitor still holding the lock. We've a nice explanation of how that can happen in #2536. --- [1] Actually, there is one place class_to_type map escapes synchronisation, and that's here: https://hg.python.org/jython/file/1a14d360dc8d/src/org/python/core/PyType.java#l1473 I think that's an oversight. And it's not involved here.
msg11525 (view)	Author: Jim Baker (zyasoft)	Date: 2017-08-05.19:34:56
I agree with Jeff's analysis. Clearly the necessary cycle of waiting to get a deadlock was caused by what #2536 identified, finally blocks for releasing internal locks not being called due to out of memory errors. (This scenario violated the contract that we assumed through the usage of ConcurrentHashMap.) Other than that issue, we have had no reports of other deadlocks in this specific code; and I also do not see any deadlock path in the use of class_to_type. We should also synchronize fromClassSkippingInners. This change restores correctness. Next, let's look at our caching, and its corresponding performance. Using weak keys/weak values in class_to_type means that when working with Java objects (calling methods, accessing attributes), Python code that does not maintain a strong reference to the Java class will cause the Jython runtime to constantly rebuild these entries in class_to_type. Here are the ways that such strong references would be established: 1. exposedTypes (so Jython's internals) https://hg.python.org/jython/file/1a14d360dc8d/src/org/python/core/PyType.java#l104 2. imports of Java classes into a Python namespace 3. references to `type(java_obj)` 4. subclassing of a Java class by a Python class References to Java objects, including objects from factory methods; indirect construction, such as `java.util.HashMap()`; and callbacks (Netty, re/json/strptime) would not introduce such references to Python types for Java classes. But calling methods, using attributes, or otherwise using getting the Python type on the Java object, would require this entry in class_to_type to be added, potentially to be potentially quickly discarded by generational GC. Computing these entries is relatively expensive, especially if a given class C has inner classes (and this is a general type graph expansion). See also the analysis in https://bugs.openjdk.java.net/browse/JDK-6389107 (scenario 3); our mapping in class_to_type is a common pattern seen in Java systems. What if instead of current one-level cache, we had a two-level cache? Level 1 uses weak key/weak value for the class/type entries (C, T); (C, T) entries that are expired from Level 1 are moved to Level 2; Level 2 uses an expiration time so a ClassLoader could be unloaded in say some reasonable amount of time. To implement: 1. Level 1 cache is a direct replacment of class_to_type with CacheBuilder weak keys/weak values/removal listener. Also we will not attempt to expose it as a Map going forward; adjust class_to_type usage accordingly. 2. Level 2 cache is weak keys/strong values/expires after write. We can tune this expiration as was done with the regex cache and the python.sre.cachespec option, https://github.com/jythontools/jython/blob/master/src/org/python/modules/sre/SRE_STATE.java#L1253). 3. For removalListener: (C, T) entries that are removed from Level 1 will use removalListener to get placed in Level 2 (safe because Level 1 is concurrent; and obviously hard refs would prevent it from being removed while under construction). 4. Try both caches when looking up the type T for class C. Because synchronized, there's no visible inconsistency; if no entry for C because it was removed, simply recompute T'. There is a small window of time that it could be in the process of being moved to Level 2, but this does not affect correctness; (C, T) will be eventually expired from Level 2 and (C, T') is a valid repllacement for it (because there can be no hard refs to T outside the level 2 cache itself). Lastly, https://docs.oracle.com/javase/7/docs/api/java/lang/ClassValue.html looks potentially useful as an alternative for publishing PyType objects into a cache. But we need to address two key questions: 1. Whether ClassValue#computeValue would work properly in the re-entrant case where the class graph from a given class C is explored, as is done for inners, possibly referring back to C. This is why we could not get the publication of the type in the map done after the init (as one would usually want to do!) in https://github.com/jythontools/jython/blob/master/src/org/python/core/PyType.java#L1515 (putIfAbsent); class C would be referenced, and it would attempt to build again (and stack overflow). No solution I tried could break this problem. 2. When to call ClassValue#removeValue ! We still have to keep track of the cache with respect to PyType.
msg11646 (view)	Author: Jeff Allen (jeff.allen)	Date: 2017-11-04.20:31:47
I'd like to believe we fixed this with #2609 in https://hg.python.org/jython/rev/b03972313073, and that the reworking of PyType in https://hg.python.org/jython/rev/55756721fa02 has addressed concerns people might have about needless locking. The critical similarity is a type not having an attribute you know it does, and the explanation is that you looked before we finished constructing it. Can anyone produce evidence to the contrary?
msg12770 (view)	Author: Gunter (gbach)	Date: 2019-11-13.13:50:07
Sorry to inform you that the bug showed up today on a jython2.7.1/java1.8 installation. It is a tomcat-based webapplication. The error shows up for a specific time-span (50sec for me) for many or all requests and then disappears. Right now it is always at the same line of code: class_name = layout_dom.attributes.getNamedItem("class").getValue() layout_dom beeing a org.w3c.dom.Element Is there maybe a work-around? Like using layout_dom.getAttribute("class") thus avoiding the "magic" attribute-access to "attributes"? Thx Gunter
msg12772 (view)	Author: Jeff Allen (jeff.allen)	Date: 2019-11-13.18:47:31
This issue (fix) is old, but 2.7.1 is older. We would be really interested if you could test 2.7.2b2 on your system, where we like to believe it is fixed.
msg12776 (view)	Author: Gunter (gbach)	Date: 2019-11-18.14:35:58
I made a short script (test.py) that is able to reproduce this error. Attached the needed files. In test.py are 2 links to get the needed jar-files (I reproduced the error with the help of an older version of SAXParser - that is the context where the error occurs in my production system) I created also two shell-scripts to start the tests. You must adapt the paths but this should be straight forward. test271 yields the error "SystemError: __getattribute__ not found on type null. See http://bugs.jython.org/issue2487 for details." after a few seconds to minutes. test272b2 yields another one "AttributeError: 'NoneType' object has no attribute 'getValue'" (but maybe the root-cause is the old issue) after several hours. Interestingly enough the error either occurs "en masse" or does not at all. Also if you do not send stdout somewhere else (e.g. /dev/null(), the error seems not to occur in this context. Happy hunting!
msg12838 (view)	Author: Jeff Allen (jeff.allen)	Date: 2019-12-14.14:42:04
Thanks for taking so much care to reproduce it and trying 2.7.2b2. I'll take a look. I think test271 "__getattribute__ not found on type null" sounds like the original bug (here or #2609), or rather sounds like 2.7.1 with the first attempted fix. The symptom was use of a type while functions were still being filled in. That's fixed. test272b2 gives "AttributeError: 'NoneType' object has no attribute 'getValue'", which I think means that getNamedItem("class") returned null (as None), which iot does to signify there is no attribute called "class". So the question is why that would suddenly happen in one thread among thousands doing identical work. It may be a race but it does not indicate publication of an incomplete type, so I've opened #2848.
msg12848 (view)	Author: Gunter (gbach)	Date: 2019-12-16.20:14:30
Hi Jeff, Thank YOU taking care of jython! Cheers, Gunter Am 14.12.2019 um 15:42 schrieb Jeff Allen: > > Jeff Allen <ja.py@farowl.co.uk> added the comment: > > Thanks for taking so much care to reproduce it and trying 2.7.2b2. I'll take a look. > > I think test271 "__getattribute__ not found on type null" sounds like the original bug (here or #2609), or rather sounds like 2.7.1 with the first attempted fix. The symptom was use of a type while functions were still being filled in. That's fixed. > > test272b2 gives "AttributeError: 'NoneType' object has no attribute 'getValue'", which I think means that getNamedItem("class") returned null (as None), which iot does to signify there is no attribute called "class". So the question is why that would suddenly happen in one thread among thousands doing identical work. > > It may be a race but it does not indicate publication of an incomplete type, so I've opened #2848. > > _______________________________________ > Jython tracker <report@bugs.jython.org> > <https://bugs.jython.org/issue2487> > _______________________________________ >
msg12964 (view)	Author: Jeff Allen (jeff.allen)	Date: 2020-02-01.09:16:21
Stays closed, see work under #2834 (and #2609).

History
Date	User	Action	Args
2020-02-01 09:16:21	jeff.allen	set	messages: + msg12964
2019-12-16 20:14:31	gbach	set	messages: + msg12848
2019-12-14 14:42:04	jeff.allen	set	messages: + msg12838
2019-11-18 14:35:58	gbach	set	files: + issue2487.tar.gz messages: + msg12776
2019-11-13 18:47:31	jeff.allen	set	messages: + msg12772
2019-11-13 13:50:07	gbach	set	nosy: + gbach messages: + msg12770
2018-11-04 14:57:01	jeff.allen	set	status: pending -> closed resolution: accepted -> fixed
2017-11-04 20:31:48	jeff.allen	set	status: open -> pending assignee: jeff.allen messages: + msg11646
2017-11-04 15:17:50	amoebam	set	nosy: + amoebam
2017-08-05 19:35:04	zyasoft	set	milestone: Jython 2.7.1 -> Jython 2.7.2
2017-08-05 19:34:58	zyasoft	set	messages: + msg11525
2017-08-05 12:19:52	jeff.allen	set	nosy: + jeff.allen messages: + msg11524
2017-04-04 23:25:59	zyasoft	set	title: PyType.fromClass deadlocks on slow systems (circleci for example) -> PyType.fromClass publication bug on slow systems (travis, circleci)
2017-04-04 23:25:12	zyasoft	set	priority: immediate -> urgent assignee: darjus -> (no value) status: closed -> open resolution: fixed -> accepted
2017-04-04 23:24:51	zyasoft	set	messages: + msg11297
2017-04-04 22:09:28	jamesmudd	set	nosy: + jamesmudd messages: + msg11296
2016-08-24 20:10:11	zyasoft	set	status: open -> closed messages: + msg10906
2016-05-08 18:35:53	stefan.richthofer	set	nosy: + stefan.richthofer messages: + msg10854
2016-03-19 07:58:50	darjus	set	status: pending -> open messages: + msg10828
2016-03-18 02:43:55	zyasoft	set	status: open -> pending resolution: fixed messages: + msg10823 nosy: + zyasoft
2016-03-18 00:08:39	darjus	create