Issue2487

classification
Title: PyType.fromClass publication bug on slow systems (travis, circleci)
Type: Severity: critical
Components: Core Versions: Jython 2.7
Milestone: Jython 2.7.2
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: jeff.allen Nosy List: amoebam, darjus, gbach, jamesmudd, jeff.allen, stefan.richthofer, zyasoft
Priority: urgent Keywords:

Created on 2016-03-18.00:08:39 by darjus, last changed 2020-02-01.09:16:21 by jeff.allen.

Files
File name Uploaded Description Edit Remove
thread-dump-2462.txt darjus, 2016-03-18.00:08:36
issue2487.tar.gz gbach, 2019-11-18.14:35:58 files to reproduce error
Messages
msg10823 (view) Author: Jim Baker (zyasoft) Date: 2016-03-18.02:43:55
Fixed as of https://hg.python.org/jython/rev/cffe5b824a21
msg10828 (view) Author: Darjus Loktevic (darjus) Date: 2016-03-19.07:58:49
Mar 19, 2016 6:56:12 PM io.netty.channel.ChannelInitializer exceptionCaught
WARNING: Failed to initialize a channel. Closing: [id: 0xabe1bf32, /127.0.0.1:63059 => /127.0.0.1:63054]
Traceback (most recent call last):
  File "/Users/darjus/Documents/jython/dist/Lib/_socket.py", line 639, in initChannel
    child._ensure_post_connect()
  File "/Users/darjus/Documents/jython/dist/Lib/_socket.py", line 1421, in _ensure_post_connect
    self._post_connect()
  File "/Users/darjus/Documents/jython/dist/Lib/_socket.py", line 877, in _post_connect
    self.channel.pipeline().addLast(self.python_inbound_handler)
SystemError: __getattribute__ not found on type DefaultChannelPipeline


Looks like there's a bug in the new synchronization code. Found it while running some unrelated stuff.
msg10854 (view) Author: Stefan Richthofer (stefan.richthofer) Date: 2016-05-08.18:35:52
Hey. Is there a way to reproduce this issue (on a fast system)?
Does someone have jstack output available for the deadlock?
msg10906 (view) Author: Jim Baker (zyasoft) Date: 2016-08-24.20:10:11
Fixed by https://hg.python.org/jython/rev/f3458ef83e08
msg11296 (view) Author: James Mudd (jamesmudd) Date: 2017-04-04.22:09:28
Not sure if its actually related to this problem or not as its closed but a CI build just experienced something related to this. See: https://travis-ci.org/jythontools/jython/jobs/218647190
msg11297 (view) Author: Jim Baker (zyasoft) Date: 2017-04-04.23:24:50
It is in fact a bug, a classic object publication/data race to be precise. The problem is that it is possible for a Java type to be published by one thread in classToType, as held by PyType#LazyClassToTypeHolder; but for another thread to simultaneously see/lookup this type for the first time.

We used to synchronize on this map, but that caused deadlocks (the original bug here). Then we introduced polling with the bug fix (https://hg.python.org/jython/rev/f3458ef83e08), but we realized it was a hack, and linked it to this bug.

Probably the right way to fix this bug is to use a scheme actually seen in Lib/_sockets.py - use associated condition variables to mark when the publication is finished. This should be close to the current code: continue polling on the existence of __getattribute__, but wait for the specific publication. So we can start with a ConcurrentMap<Class<?>, Condition>, publish it from one thread the specific condition variable using #putIfAbsent, and cv wait for it in https://github.com/jythontools/jython/blob/master/src/org/python/core/Deriveds.java#L51
msg11524 (view) Author: Jeff Allen (jeff.allen) Date: 2017-08-05.12:19:50
After backing out (on a scratch repo only, so far) the sets https://hg.python.org/jython/rev/cffe5b824a21 and https://hg.python.org/jython/rev/f3458ef83e08, in the interests of #2609, I ought to find I've re-instated this issue.

Is there a sure-fire way to reproduce it?

However, I'm not sure the original problem these change sets sought to solve is what we think ... The synchronisation on PyType looks ok[1] to me, even though the two locks seem to have got entangled, according to the informative thread dump from Darjus.

In that, "Jython-Netty-Child-60" is doing a reasonable thing. It has entered the monitor on PyType.class and is patiently waiting here:
https://hg.python.org/jython/file/1a14d360dc8d/src/org/python/core/PyType.java#l1412,
to acquire the lock on the class_to_type map.

"MainThread" meanwhile is waiting to enter the monitor on PyType.class (good), but is holding the lock on on the class_to_type map (huh?). This ought not to be possible, since the monitor protects access to the class_to_type map. It looks like somehow, "MainThread" has previously exited the monitor still holding the lock.

We've a nice explanation of how that can happen in #2536.


---

[1] Actually, there *is* one place class_to_type map escapes synchronisation, and that's here:
https://hg.python.org/jython/file/1a14d360dc8d/src/org/python/core/PyType.java#l1473
I think that's an oversight. And it's not involved here.
msg11525 (view) Author: Jim Baker (zyasoft) Date: 2017-08-05.19:34:56
I agree with Jeff's analysis. Clearly the necessary cycle of waiting
to get a deadlock was caused by what #2536 identified, finally blocks
for releasing internal locks not being called due to out of memory
errors. (This scenario violated the contract that we assumed through
the usage of ConcurrentHashMap.) Other than that issue, we have had no
reports of other deadlocks in this specific code; and I also do not
see any deadlock path in the use of class_to_type. We should also
synchronize fromClassSkippingInners.

This change restores correctness.

Next, let's look at our caching, and its corresponding
performance. Using weak keys/weak values in class_to_type means that
when working with Java objects (calling methods, accessing
attributes), Python code that does not maintain a strong reference to
the Java class will cause the Jython runtime to constantly rebuild
these entries in class_to_type. Here are the ways that such strong
references would be established:

1. exposedTypes (so Jython's internals)
https://hg.python.org/jython/file/1a14d360dc8d/src/org/python/core/PyType.java#l104

2. imports of Java classes into a Python namespace

3. references to `type(java_obj)`

4. subclassing of a Java class by a Python class

References to Java *objects*, including objects from factory methods;
indirect construction, such as `java.util.HashMap()`; and callbacks
(Netty, re/json/strptime) would not introduce such references to
Python types for Java *classes*. But calling methods, using
attributes, or otherwise using getting the Python type on the Java
object, would require this entry in class_to_type to be added,
potentially to be potentially quickly discarded by generational
GC. Computing these entries is relatively expensive, especially if a
given class C has inner classes (and this is a general type graph
expansion).

See also the analysis in https://bugs.openjdk.java.net/browse/JDK-6389107
(scenario 3); our mapping in class_to_type is a common pattern seen in
Java systems.

What if instead of current one-level cache, we had a two-level cache?
Level 1 uses weak key/weak value for the class/type entries (C, T);
(C, T) entries that are expired from Level 1 are moved to Level 2;
Level 2 uses an expiration time so a ClassLoader could be unloaded in
say some reasonable amount of time.

To implement:

1. Level 1 cache is a direct replacment of class_to_type with
   CacheBuilder weak keys/weak values/removal listener. Also we will
   not attempt to expose it as a Map going forward; adjust
   class_to_type usage accordingly.

2. Level 2 cache is weak keys/strong values/expires after write. We
   can tune this expiration as was done with the regex cache and the
   python.sre.cachespec option,
   https://github.com/jythontools/jython/blob/master/src/org/python/modules/sre/SRE_STATE.java#L1253).

3. For removalListener: (C, T) entries that are removed from Level 1
   will use removalListener to get placed in Level 2 (safe because
   Level 1 is concurrent; and obviously hard refs would prevent it
   from being removed while under construction).

4. Try both caches when looking up the type T for class C. Because
   synchronized, there's no visible inconsistency; if no entry for C
   because it was removed, simply recompute T'. There is a small
   window of time that it could be in the process of being moved to
   Level 2, but this does not affect correctness; (C, T) will be
   eventually expired from Level 2 and (C, T') is a valid repllacement
   for it (because there can be no hard refs to T outside the level 2
   cache itself).

Lastly, https://docs.oracle.com/javase/7/docs/api/java/lang/ClassValue.html
looks potentially useful as an alternative for publishing PyType
objects into a cache. But we need to address two key questions:

1. Whether ClassValue#computeValue would work properly in the
   re-entrant case where the class graph from a given class C is
   explored, as is done for inners, possibly referring back to C. This
   is why we could not get the publication of the type in the map done
   after the init (as one would usually want to do!) in
   https://github.com/jythontools/jython/blob/master/src/org/python/core/PyType.java#L1515
   (putIfAbsent); class C would be referenced, and it would attempt to
   build again (and stack overflow). No solution I tried could break
   this problem.

2. When to call ClassValue#removeValue ! We still have to keep track
   of the cache with respect to PyType.
msg11646 (view) Author: Jeff Allen (jeff.allen) Date: 2017-11-04.20:31:47
I'd like to believe we fixed this with #2609 in https://hg.python.org/jython/rev/b03972313073, and that the reworking of PyType in https://hg.python.org/jython/rev/55756721fa02 has addressed concerns people might have about needless locking.

The critical similarity is a type not having an attribute you know it does, and the explanation is that you looked before we finished constructing it.

Can anyone produce evidence to the contrary?
msg12770 (view) Author: Gunter (gbach) Date: 2019-11-13.13:50:07
Sorry to inform you that the bug showed up today on a jython2.7.1/java1.8 installation.
It is a tomcat-based webapplication. The error shows up for a specific time-span (50sec for me) for many or all requests and then disappears. Right now it is always at the same line of code:

        class_name = layout_dom.attributes.getNamedItem("class").getValue()

layout_dom beeing a org.w3c.dom.Element


Is there maybe a work-around? Like using 

layout_dom.getAttribute("class")

thus avoiding the "magic" attribute-access to "attributes"?

Thx
Gunter
msg12772 (view) Author: Jeff Allen (jeff.allen) Date: 2019-11-13.18:47:31
This issue (fix) is old, but 2.7.1 is older.

We would be really interested if you could test 2.7.2b2 on your system, where we like to believe it is fixed.
msg12776 (view) Author: Gunter (gbach) Date: 2019-11-18.14:35:58
I made a short script (test.py) that is able to reproduce this error.
Attached the needed files.
In test.py are 2 links to get the needed jar-files (I reproduced the error with the help of an older version of SAXParser - that is the context where the error occurs in my production system)
I created also two shell-scripts to start the tests.
You must adapt the paths but this should be straight forward.

test271 yields the error "SystemError: __getattribute__ not found on type null. See http://bugs.jython.org/issue2487 for details." after a few seconds to minutes.
test272b2 yields another one "AttributeError: 'NoneType' object has no attribute 'getValue'" (but maybe the root-cause is the old issue) after several hours.
Interestingly enough the error either occurs "en masse" or does not at all.
Also if you do not send stdout somewhere else (e.g. /dev/null(), the error seems not to occur in this context.

Happy hunting!
msg12838 (view) Author: Jeff Allen (jeff.allen) Date: 2019-12-14.14:42:04
Thanks for taking so much care to reproduce it and trying 2.7.2b2. I'll take a look.

I think test271 "__getattribute__ not found on type null" sounds like the original bug (here or #2609), or rather sounds like 2.7.1 with the first attempted fix. The symptom was use of a type while functions were still being filled in. That's fixed. 

test272b2 gives "AttributeError: 'NoneType' object has no attribute 'getValue'", which I think means that getNamedItem("class") returned null (as None), which iot does to signify there is no attribute called "class". So the question is why that would suddenly happen in one thread among thousands doing identical work.

It may be a race but it does not indicate publication of an incomplete type, so I've opened #2848.
msg12848 (view) Author: Gunter (gbach) Date: 2019-12-16.20:14:30
Hi Jeff,

Thank YOU taking care of jython!

Cheers,
Gunter

Am 14.12.2019 um 15:42 schrieb Jeff Allen:
> 
> Jeff Allen <ja.py@farowl.co.uk> added the comment:
> 
> Thanks for taking so much care to reproduce it and trying 2.7.2b2. I'll take a look.
> 
> I think test271 "__getattribute__ not found on type null" sounds like the original bug (here or #2609), or rather sounds like 2.7.1 with the first attempted fix. The symptom was use of a type while functions were still being filled in. That's fixed. 
> 
> test272b2 gives "AttributeError: 'NoneType' object has no attribute 'getValue'", which I think means that getNamedItem("class") returned null (as None), which iot does to signify there is no attribute called "class". So the question is why that would suddenly happen in one thread among thousands doing identical work.
> 
> It may be a race but it does not indicate publication of an incomplete type, so I've opened #2848.
> 
> _______________________________________
> Jython tracker <report@bugs.jython.org>
> <https://bugs.jython.org/issue2487>
> _______________________________________
>
msg12964 (view) Author: Jeff Allen (jeff.allen) Date: 2020-02-01.09:16:21
Stays closed, see work under #2834 (and #2609).
History
Date User Action Args
2020-02-01 09:16:21jeff.allensetmessages: + msg12964
2019-12-16 20:14:31gbachsetmessages: + msg12848
2019-12-14 14:42:04jeff.allensetmessages: + msg12838
2019-11-18 14:35:58gbachsetfiles: + issue2487.tar.gz
messages: + msg12776
2019-11-13 18:47:31jeff.allensetmessages: + msg12772
2019-11-13 13:50:07gbachsetnosy: + gbach
messages: + msg12770
2018-11-04 14:57:01jeff.allensetstatus: pending -> closed
resolution: accepted -> fixed
2017-11-04 20:31:48jeff.allensetstatus: open -> pending
assignee: jeff.allen
messages: + msg11646
2017-11-04 15:17:50amoebamsetnosy: + amoebam
2017-08-05 19:35:04zyasoftsetmilestone: Jython 2.7.1 -> Jython 2.7.2
2017-08-05 19:34:58zyasoftsetmessages: + msg11525
2017-08-05 12:19:52jeff.allensetnosy: + jeff.allen
messages: + msg11524
2017-04-04 23:25:59zyasoftsettitle: PyType.fromClass deadlocks on slow systems (circleci for example) -> PyType.fromClass publication bug on slow systems (travis, circleci)
2017-04-04 23:25:12zyasoftsetpriority: immediate -> urgent
assignee: darjus -> (no value)
status: closed -> open
resolution: fixed -> accepted
2017-04-04 23:24:51zyasoftsetmessages: + msg11297
2017-04-04 22:09:28jamesmuddsetnosy: + jamesmudd
messages: + msg11296
2016-08-24 20:10:11zyasoftsetstatus: open -> closed
messages: + msg10906
2016-05-08 18:35:53stefan.richthofersetnosy: + stefan.richthofer
messages: + msg10854
2016-03-19 07:58:50darjussetstatus: pending -> open
messages: + msg10828
2016-03-18 02:43:55zyasoftsetstatus: open -> pending
resolution: fixed
messages: + msg10823
nosy: + zyasoft
2016-03-18 00:08:39darjuscreate