Issue2536

classification
Title: deadlocks in regrtests due to StackOverflowError in finally block
Type: Severity: minor
Components: Core Versions:
Milestone:
process
Status: open Resolution: remind
Dependencies: Superseder:
Assigned To: Nosy List: jamesmudd, jeff.allen, stefan.richthofer, zyasoft
Priority: low Keywords:

Created on 2016-12-01.14:43:27 by stefan.richthofer, last changed 2017-05-24.17:33:19 by stefan.richthofer.

Files
File name Uploaded Description Edit Remove
test_logging_hangs.txt stefan.richthofer, 2016-12-01.14:43:25 full jstack-output while test_logging deadlocks.
test_logging_undetected_deadlock.txt jamesmudd, 2017-03-09.08:11:39 undetected deadlock
Test.java stefan.richthofer, 2017-03-27.16:46:45 How to bypass finally-blocks
Test2.java stefan.richthofer, 2017-03-30.10:51:11 How to bypass finally-blocks ... actually
test_pythoninterpreter_jy_hangs.txt stefan.richthofer, 2017-05-04.21:45:40 Also test_pythoninterpreter_jy can be target of this issue.
test_select_hanging.txt stefan.richthofer, 2017-05-19.00:51:39
test_java_integration_deadlock.txt jamesmudd, 2017-05-23.22:00:43
Messages
msg10996 (view) Author: Stefan Richthofer (stefan.richthofer) Date: 2016-12-01.14:43:25
Running on Linux Mint 18, 64 Bit with Java 8 I observe a deadlock in test_logging.py from time to time. It only occurs when running with 'ant regrtest' and the test itself does not give output or error, regrtest just hangs indefinitely.

jstack reports a deadlock (full output attached):


Found one Java-level deadlock:
=============================
"Thread-262":
  waiting for ownable synchronizer 0x00000000e06df030, (a java.util.concurrent.locks.ReentrantLock$NonfairSync),
  which is held by "MainThread"
"MainThread":
  waiting for ownable synchronizer 0x00000000e0787518, (a java.util.concurrent.locks.ReentrantLock$NonfairSync),
  which is held by "Thread-262"

Java stack information for the threads listed above:
===================================================
"Thread-262":
	at sun.misc.Unsafe.park(Native Method)
	- parking to wait for  <0x00000000e06df030> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
	at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
	at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
	at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2228)
	at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2208)
	at com.google.common.cache.LocalCache.get(LocalCache.java:4053)
	at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:4057)
	at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4986)
	at com.google.common.cache.LocalCache$LocalLoadingCache.getUnchecked(LocalCache.java:4992)
	at org.python.modules.sre.SRE_STATE$CACHE.get(SRE_STATE.java:1290)
	at org.python.modules.sre.SRE_STATE$CACHE.access$000(SRE_STATE.java:1253)
	at org.python.modules.sre.SRE_STATE.<init>(SRE_STATE.java:1295)
	at org.python.modules.sre.PatternObject.match(PatternObject.java:80)
	at org.python.modules.sre.PatternObject$match_exposer.__call__(Unknown Source)
	at org.python.core.PyObject.__call__(PyObject.java:480)
	at org.python.core.PyObject.__call__(PyObject.java:484)
	at logging.config$py.valid_ident$10(logging/config.py:275)
	at logging.config$py.call_function(logging/config.py)
	at org.python.core.PyTableCode.call(PyTableCode.java:167)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:139)
	at org.python.core.PyFunction.__call__(PyFunction.java:413)
	at logging.config$py.configure_handler$33(logging/config.py:732)
	at logging.config$py.call_function(logging/config.py)
	at org.python.core.PyTableCode.call(PyTableCode.java:167)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:154)
	at org.python.core.PyFunction.__call__(PyFunction.java:423)
	at org.python.core.PyMethod.__call__(PyMethod.java:141)
	at logging.config$py.configure$29(logging/config.py:640)
	at logging.config$py.call_function(logging/config.py)
	at org.python.core.PyTableCode.call(PyTableCode.java:167)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:139)
	at org.python.core.PyFunction.__call__(PyFunction.java:413)
	at org.python.core.PyMethod.__call__(PyMethod.java:126)
	at logging.config$py.dictConfig$38(logging/config.py:777)
	at logging.config$py.call_function(logging/config.py)
	at org.python.core.PyTableCode.call(PyTableCode.java:167)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:139)
	at org.python.core.PyFunction.__call__(PyFunction.java:413)
	at logging.config$py.handle$41(logging/config.py:840)
	at logging.config$py.call_function(logging/config.py)
	at org.python.core.PyTableCode.call(PyTableCode.java:167)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:139)
	at org.python.core.PyFunction.__call__(PyFunction.java:413)
	at org.python.core.PyMethod.__call__(PyMethod.java:126)
	at SocketServer$py.__init__$47(SocketServer.py:659)
	at SocketServer$py.call_function(SocketServer.py)
	at org.python.core.PyTableCode.call(PyTableCode.java:167)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:308)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:199)
	at org.python.core.PyFunction.__call__(PyFunction.java:482)
	at org.python.core.PyFunction.__call__(PyFunction.java:476)
	at org.python.core.PyInstance.__init__(PyInstance.java:124)
	at org.python.core.PyClass.__call__(PyClass.java:193)
	at org.python.core.PyObject.__call__(PyObject.java:515)
	at org.python.core.PyObject.__call__(PyObject.java:519)
	at SocketServer$py.finish_request$13(SocketServer.py:334)
	at SocketServer$py.call_function(SocketServer.py)
	at org.python.core.PyTableCode.call(PyTableCode.java:167)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:171)
	at org.python.core.PyFunction.__call__(PyFunction.java:434)
	at org.python.core.PyMethod.__call__(PyMethod.java:156)
	at SocketServer$py.process_request_thread$36(SocketServer.py:605)
	at SocketServer$py.call_function(SocketServer.py)
	at org.python.core.PyTableCode.call(PyTableCode.java:167)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:308)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:199)
	at org.python.core.PyFunction.__call__(PyFunction.java:482)
	at org.python.core.PyMethod.instancemethod___call__(PyMethod.java:237)
	at org.python.core.PyMethod.__call__(PyMethod.java:228)
	at org.python.core.PyMethod.__call__(PyMethod.java:223)
	at org.python.core.PyObject._callextra(PyObject.java:620)
	at threading$py.run$35(threading.py:213)
	at threading$py.call_function(threading.py)
	at org.python.core.PyTableCode.call(PyTableCode.java:167)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:139)
	at org.python.core.PyFunction.__call__(PyFunction.java:413)
	at org.python.core.PyMethod.__call__(PyMethod.java:126)
	at threading$py._Thread__bootstrap$36(threading.py:266)
	at threading$py.call_function(threading.py)
	at org.python.core.PyTableCode.call(PyTableCode.java:167)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:308)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:199)
	at org.python.core.PyFunction.__call__(PyFunction.java:482)
	at org.python.core.PyMethod.instancemethod___call__(PyMethod.java:237)
	at org.python.core.PyMethod.__call__(PyMethod.java:228)
	at org.python.core.PyMethod.__call__(PyMethod.java:218)
	at org.python.core.PyMethod.__call__(PyMethod.java:213)
	at org.python.core.FunctionThread.run(FunctionThread.java:25)
"MainThread":
	at sun.misc.Unsafe.park(Native Method)
	- parking to wait for  <0x00000000e0787518> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
	at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
	at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
	at org.python.modules._threading.RLock.RLock_acquire(RLock.java:42)
	at org.python.modules._threading.RLock$RLock_acquire_exposer.__call__(Unknown Source)
	at org.python.core.PyObject.__call__(PyObject.java:468)
	at logging$py._acquireLock$6(logging/__init__.py:219)
	at logging$py.call_function(logging/__init__.py)
	at org.python.core.PyTableCode.call(PyTableCode.java:167)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:125)
	at org.python.core.PyFunction.__call__(PyFunction.java:403)
	at logging.config$py.stopListening$48(logging/config.py:906)
	at logging.config$py.call_function(logging/config.py)
	at org.python.core.PyTableCode.call(PyTableCode.java:167)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:125)
	at org.python.core.PyFunction.__call__(PyFunction.java:403)
	at test.test_logging$py.setup_via_listener$84(test/test_logging.py:1809)
	at test.test_logging$py.call_function(test/test_logging.py)
	at org.python.core.PyTableCode.call(PyTableCode.java:167)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:154)
	at org.python.core.PyFunction.__call__(PyFunction.java:423)
	at org.python.core.PyMethod.__call__(PyMethod.java:141)
	at test.test_logging$py.test_listen_config_10_ok$85(test/test_logging.py:1825)
	at test.test_logging$py.call_function(test/test_logging.py)
	at org.python.core.PyTableCode.call(PyTableCode.java:167)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:139)
	at org.python.core.PyFunction.__call__(PyFunction.java:413)
	at org.python.core.PyMethod.__call__(PyMethod.java:126)
	at unittest.case$py.run$35(unittest/case.py:377)
	at unittest.case$py.call_function(unittest/case.py)
	at org.python.core.PyTableCode.call(PyTableCode.java:167)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:308)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:199)
	at org.python.core.PyFunction.__call__(PyFunction.java:482)
	at org.python.core.PyMethod.instancemethod___call__(PyMethod.java:237)
	at org.python.core.PyMethod.__call__(PyMethod.java:228)
	at org.python.core.PyMethod.__call__(PyMethod.java:223)
	at org.python.core.PyObject._callextra(PyObject.java:620)
	at unittest.case$py.__call__$37(unittest/case.py:396)
	at unittest.case$py.call_function(unittest/case.py)
	at org.python.core.PyTableCode.call(PyTableCode.java:167)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:308)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:199)
	at org.python.core.PyFunction.__call__(PyFunction.java:482)
	at org.python.core.PyMethod.instancemethod___call__(PyMethod.java:237)
	at org.python.core.PyMethod.__call__(PyMethod.java:228)
	at org.python.core.PyMethod.__call__(PyMethod.java:223)
	at org.python.core.PyObjectDerived.__call__(PyObjectDerived.java:1030)
	at org.python.core.PyObject.__call__(PyObject.java:480)
	at org.python.core.PyObject.__call__(PyObject.java:484)
	at unittest.suite$py.run$16(unittest/suite.py:116)
	at unittest.suite$py.call_function(unittest/suite.py)
	at org.python.core.PyTableCode.call(PyTableCode.java:167)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:308)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:199)
	at org.python.core.PyFunction.__call__(PyFunction.java:482)
	at org.python.core.PyMethod.instancemethod___call__(PyMethod.java:237)
	at org.python.core.PyMethod.__call__(PyMethod.java:228)
	at org.python.core.PyMethod.__call__(PyMethod.java:223)
	at org.python.core.PyObject._callextra(PyObject.java:620)
	at unittest.suite$py.__call__$13(unittest/suite.py:70)
	at unittest.suite$py.call_function(unittest/suite.py)
	at org.python.core.PyTableCode.call(PyTableCode.java:167)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:308)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:199)
	at org.python.core.PyFunction.__call__(PyFunction.java:482)
	at org.python.core.PyMethod.instancemethod___call__(PyMethod.java:237)
	at org.python.core.PyMethod.__call__(PyMethod.java:228)
	at org.python.core.PyMethod.__call__(PyMethod.java:223)
	at org.python.core.PyObjectDerived.__call__(PyObjectDerived.java:1030)
	at org.python.core.PyObject.__call__(PyObject.java:480)
	at org.python.core.PyObject.__call__(PyObject.java:484)
	at unittest.suite$py.run$16(unittest/suite.py:116)
	at unittest.suite$py.call_function(unittest/suite.py)
	at org.python.core.PyTableCode.call(PyTableCode.java:167)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:308)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:199)
	at org.python.core.PyFunction.__call__(PyFunction.java:482)
	at org.python.core.PyMethod.instancemethod___call__(PyMethod.java:237)
	at org.python.core.PyMethod.__call__(PyMethod.java:228)
	at org.python.core.PyMethod.__call__(PyMethod.java:223)
	at org.python.core.PyObject._callextra(PyObject.java:620)
	at unittest.suite$py.__call__$13(unittest/suite.py:70)
	at unittest.suite$py.call_function(unittest/suite.py)
	at org.python.core.PyTableCode.call(PyTableCode.java:167)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:308)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:199)
	at org.python.core.PyFunction.__call__(PyFunction.java:482)
	at org.python.core.PyMethod.instancemethod___call__(PyMethod.java:237)
	at org.python.core.PyMethod.__call__(PyMethod.java:228)
	at org.python.core.PyMethod.__call__(PyMethod.java:223)
	at org.python.core.PyObjectDerived.__call__(PyObjectDerived.java:1030)
	at org.python.core.PyObject.__call__(PyObject.java:480)
	at org.python.core.PyObject.__call__(PyObject.java:484)
	at test.junit_xml$py.run$3(test/junit_xml.py:27)
	at test.junit_xml$py.call_function(test/junit_xml.py)
	at org.python.core.PyTableCode.call(PyTableCode.java:167)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:154)
	at org.python.core.PyFunction.__call__(PyFunction.java:423)
	at org.python.core.PyMethod.__call__(PyMethod.java:141)
	at test.test_support$py._run_suite$102(test/test_support.py:1303)
	at test.test_support$py.call_function(test/test_support.py)
	at org.python.core.PyTableCode.call(PyTableCode.java:167)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:308)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:132)
	at org.python.core.PyFunction.__call__(PyFunction.java:413)
	at test.test_support$py.run_unittest$103(test/test_support.py:1320)
	at test.test_support$py.call_function(test/test_support.py)
	at org.python.core.PyTableCode.call(PyTableCode.java:167)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:308)
	at org.python.core.PyFunction.function___call__(PyFunction.java:471)
	at org.python.core.PyFunction.__call__(PyFunction.java:466)
	at org.python.core.PyFunction.__call__(PyFunction.java:456)
	at test.test_logging$py.test_main$96(test/test_logging.py:1927)
	at test.test_logging$py.call_function(test/test_logging.py)
	at org.python.core.PyTableCode.call(PyTableCode.java:167)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:308)
	at org.python.core.PyFunction.function___call__(PyFunction.java:471)
	at org.python.core.PyFunction.__call__(PyFunction.java:466)
	at org.python.core.PyFunction.__call__(PyFunction.java:461)
	at org.python.core.PyObject._callextra(PyObject.java:620)
	at test.test_support$py.inner$83(test/test_support.py:1095)
	at test.test_support$py.call_function(test/test_support.py)
	at org.python.core.PyTableCode.call(PyTableCode.java:167)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:308)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:119)
	at org.python.core.PyFunction.__call__(PyFunction.java:403)
	at org.python.pycode._pyx0.runtest_inner$5(test/regrtest.py:676)
	at org.python.pycode._pyx0.call_function(test/regrtest.py)
	at org.python.core.PyTableCode.call(PyTableCode.java:167)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:308)
	at org.python.core.PyFunction.function___call__(PyFunction.java:471)
	at org.python.core.PyFunction.__call__(PyFunction.java:466)
	at org.python.core.PyFunction.__call__(PyFunction.java:456)
	at org.python.pycode._pyx0.runtest$4(test/regrtest.py:566)
	at org.python.pycode._pyx0.call_function(test/regrtest.py)
	at org.python.core.PyTableCode.call(PyTableCode.java:167)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:308)
	at org.python.core.PyFunction.function___call__(PyFunction.java:471)
	at org.python.core.PyFunction.__call__(PyFunction.java:466)
	at org.python.core.PyFunction.__call__(PyFunction.java:456)
	at org.python.pycode._pyx0.main$2(test/regrtest.py:503)
	at org.python.pycode._pyx0.call_function(test/regrtest.py)
	at org.python.core.PyTableCode.call(PyTableCode.java:167)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:308)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:119)
	at org.python.core.PyFunction.__call__(PyFunction.java:403)
	at org.python.pycode._pyx0.f$0(test/regrtest.py:1558)
	at org.python.pycode._pyx0.call_function(test/regrtest.py)
	at org.python.core.PyTableCode.call(PyTableCode.java:167)
	at org.python.core.PyCode.call(PyCode.java:18)
	at org.python.core.Py.runCode(Py.java:1401)
	at org.python.util.PythonInterpreter.execfile(PythonInterpreter.java:296)
	at org.python.util.jython.run(jython.java:362)
	at org.python.util.jython.main(jython.java:142)

Found 1 deadlock.
msg11145 (view) Author: James Mudd (jamesmudd) Date: 2017-02-28.21:26:21
I can confirm I have also just seen this happen on Ubuntu 16.04 64 bit with Java 1.8.0_121
msg11162 (view) Author: Stefan Richthofer (stefan.richthofer) Date: 2017-03-03.12:32:10
Also observed it on Windows 10 right now.
msg11210 (view) Author: James Mudd (jamesmudd) Date: 2017-03-09.08:11:39
Have just seen test_logging deadlock again but slightly different. Using master with includes reverting the commit 41d8823 to backout the syncronization in _socket.py ans ssl.py

Since this was reverted it does seem to be better and now I see a deadlock jstack can't detect. Maybe a read lock? jstack -l output attached.
msg11234 (view) Author: Stefan Richthofer (stefan.richthofer) Date: 2017-03-15.23:41:54
After more study of this issue I found out that somehow test_json is preparing ground for this deadlock. It sometimes causes guava not to release a certain lock on a cache, which later leads to this deadlock. Especially test_scanstring seems to cause this; I am not yet sure how this can happen.

However it would be helpful if others could confirm the following:

Exclude test_json from regrtests (e.g. insert tests.remove('test_json') into Lib/test/regrtest.py in line 377. There surely is a more proper way to exclude a certain test, feel free to use that one!)

Then, test_logging should not deadlock any more.
msg11236 (view) Author: Jim Baker (zyasoft) Date: 2017-03-16.00:11:17
Very interesting debugging detail!

Note that the json module has Java acceleration support in org.python.modules._json, but otherwise has pure Python implementation; this acceleration can be readily turned off by removing this entry: https://github.com/jythontools/jython/blob/master/src/org/python/modules/Setup.java#L38

So we can see if it's this acceleration causing the problem or not.
msg11270 (view) Author: Stefan Richthofer (stefan.richthofer) Date: 2017-03-25.04:15:54
I think I finally understood the cause of this issue - it is crazily subtle. Good news is that it was already (accidentally) fixed as a side-effect of https://hg.python.org/jython/rev/17e40de9a541 (I wondered why it never occurred in recent trunk-versions).
Bad news is that some things silently go awfully wrong and I don't see a proper way to detect this on Jython-side.

So, the whole story of this bug... here we go.

guava uses reentrant locks to secure all sorts of access to data structures. It uses finally blocks to unlock such locks. Note that - as locks are reentrant - it can occur that finally-enclosed calls are nested.

This bug originates in finally-blocks not being executed; thus some locks fail to unlock and later cause deadlocks. While this can happen in various scenarios, it typically involves guava and the bypassed unlock typically happens in test_json (i.e. in org.python.modules.sre.SRE_STATE$CACHE, which is backed by com.google.common.cache.LocalCache).

But how can a finally-block be bypassed?

https://docs.oracle.com/javase/tutorial/essential/exceptions/finally.html
tells us it can be caused by thread interruption or System.exit, but I asserted that none of these is the case here. The document misses to mention that OutOfMemoryError and StackOverflowError can terminate the JVM without executing an enclosing finally-block:

http://stackoverflow.com/questions/22969000/finally-block-will-be-executed-in-case-of-outofmemoryerror
http://stackoverflow.com/questions/7753443/behaviour-of-jvm-during-out-of-memory-error-list-s-new-arrayliststring
http://stackoverflow.com/questions/17836120/stack-overflow-error-handling-in-finally-block

Normally that doesn't matter much since everything blows up anyway. However recall that in guava, calls enclosed by finally blocks are sometimes nested. In that case the outer finally block can catch the error (and is often able to, because while the error propagates up the stack, memory might get freed) and then resume normal execution - with the slight detail that the inner finally block was bypassed and lock-count was messed up. The nasty thing here is that no error is displayed - things go silently wrong. Also, it is somewhat random how much of the stack is skipped before normal execution resumes - sometimes it drops even two unlocks at a time.

I observed that test_json just happens to yield very deep call-stacks, probably because it is a nested test suite. Extending xss (https://hg.python.org/jython/rev/17e40de9a541) somewhat solves it; however the issue only shifts to deeper stacks. My concern is that OutOfMemoryError or StackOverflowError are not displayed properly; just a (seemingly) unexplainable deadlock occurs later on (and I can tell, it takes a lot of time and thought to discover the actual cause).

I suppose, this phenomenon causes all of the recently observed deadlocks (e.g. in #2565 an unlock is lost on PyType$LazyClassToTypeHolder.classToType, which is backed by com.google.common.collect.MapMakerInternalMap) and I don't see a way to detect it properly in Jython. Increasing xss is the best thing we can do on Jython-side so far.
guava could maybe detect bypassed finally-blocks by explicitly catching OutOfMemoryError/StackOverflowError and issue a warning or enforce a hard exit then (or double-check lock-consistency once proper execution eventually resumed).
Actually, the JVM itself should not resume execution once an OutOfMemoryError or StackOverflowError was thrown. These errors should IMHO explicitly propagate through finally blocks in the entire stack rather than bypassing only some of them in non-deterministic fashion.

So I am likelyy going to file issues regarding this scenario in guava and at Oracle/JVM.

Opinions?
msg11273 (view) Author: Jeff Allen (jeff.allen) Date: 2017-03-25.11:21:26
Impressive analysis Stefan. Not much I can add.

I suppose the finally clauses fail to operate because they need to allocate objects? It's almost as if a rul of successful finally clauses was that they should have all their object allocation done ahead of time. If once you run out of memory, all contracts enforced by finally are unreliable, the condition should not be handled too locally but propagate out to where all affected objects are dereferenced. But how can anyone know where that is?

Only one thought: even after it is (mis-)handled, is there any chance of spotting that we were, for a time, out of memory?
msg11274 (view) Author: James Mudd (jamesmudd) Date: 2017-03-25.14:54:08
@Stefan Really good work, that's a really interesting read.

That also explains why I haven't seen this happen for a while which is good.

I'm still not exactly clear what happens the the OutOfMemoryError or StackOverflowError. If this error get thrown, but not caught, you should go through each finally block down the stack until the JVM exits? If one finally block tries to create new objects (which it seems like calling unlock on a ReentrantLock might) then that finally should throw, and you will drop out into the next enclosing finally (which may do the same) and so on, but the Throwable itself shouldn't disappear unless its caught? So for "normal" execution to resume the Throwable must be caught? I can't see anywhere in Guava where that could happen, there are a few places in Jython but non look very likely to me, how does normal execution resume?

Do you have any way to reproduce this?

I'm really glad someone is looking at this one, I tried to reproduce it for a while and eventually gave up.
msg11277 (view) Author: Stefan Richthofer (stefan.richthofer) Date: 2017-03-27.16:46:45
>I'm still not exactly clear what happens the the OutOfMemoryError or StackOverflowError

I think it's StackOverflowError, but I suppose under heavy memory load the former one could cause a similar phenomenon.

Actually I wasn't able to detect the Error explicitly "in field", but the attached Test.java can provoke the explained behavior (You'll have to run it several times, but it runs fast and I observe the issue every 5th run or so, so it's still easy to reproduce).

Additionally, the fact that xss-value can affect likeliness of the issue supports the StackOverflowError thesis.

If you run Test.java often enough you will observe something like this:

...
recursion 4607
recursion 4608
recursion 4609
recursion 4610
recursion 4611java.lang.Stacjava.lang.StackOverflowErrorjava.lang.StackOverflowErrorjava.lang.StackOverflowError
4608
10619136
10628355
continue normal execution...

If the checksums (10619136, 10628355) differ some finally blocks were bypassed; note that the difference is 4609+4610, which belong to the frames that were also dropped by the catch: It printed 4608. The construct as a whole however still resumes normal execution.

I am currently not able to pin down how this program structure is exactly embedded in guava, but I am convinced it is present there in manifold ways (which is why we observed various slightly different deadlocks).
msg11279 (view) Author: James Mudd (jamesmudd) Date: 2017-03-27.22:14:41
@Stefan I'm now also convinced this is happening although I still can't figure out exactly where the StackOverflowError is going. With the debugger connected and an exception breakpoint on StackOverflowError, run

james@james-ubuntu:~/git/jython/dist$ java -agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=1045 -Xss1024k -jar jython.jar Lib/test/test_json.py

I can reliably get test_json to be paused as the exception is thrown, usually in test_endless_recursion which sounds reasonable, and after disconnecting the test runs to completion, so normal execution has resumed somehow...

Unfortunately Eclipse seems to be having serious issues getting the stack and allowing me to step through it, probably due to the state the JVM is in, the debugger just freezes which is strange I wonder if there is some other way...
msg11280 (view) Author: James Mudd (jamesmudd) Date: 2017-03-29.22:15:26
So having spent a bit more time on this, I think Stefan is right with the deadlocks being caused by skipped finally blocks. IntelliJ's debugger seems to have no issues pausing on StackOverflowError and watching the execution. So if you want to see what happens I recommend trying IntelliJ.

So what happens is, at some point a StackOverflowError is thrown, some finally blocks are executed but some are skipped when another StackOverflowError is thrown during the finally, anyway eventually this is caught in Jython code, line 168 in PyTableCode which calls Py.JavaError(t), which then explicitly handles the StackOverflowError returning a Py.RuntimeError("maximum recursion depth exceeded (Java StackOverflowError)"). The issue occurs because this is exactly what these recursion test are expecting to happen. So they see the RuntimeError and that means the test passes and normal execution resumes (e.g line 108 in /Lib/json/tests/test_recursion.py - This is a good test to run always reproduces this issue) So the good thing is this works as expected I guess, except the part where finally blocks were skipped. So if there is a way to handle this in Java (i.e somehow ensure all the locks are unlocked) then we could do that by hooking the Py.JavaError(Throwable t) method which handles OutOfMemoryError and StackOverflowError, explicitly.

The thing I don't understand is how increasing Xss has any effect on this (although I observe it does), even with Xss5m the StackOverflowError still happens, as this is a infinite recursion test. So you would think that the unlocking issue would still happen but it doesn't? I can only assume the stack is so deep, by the time you make it back to a lock the finally will be executed successfully.

Maybe this is useful, if nothing else I enjoyed seeing how this happens.
msg11281 (view) Author: Stefan Richthofer (stefan.richthofer) Date: 2017-03-30.08:22:00
I learned that my minimal example Test.java is false. The finally blocks are actually always called there, only the System.out.println was sometimes swallowed. If one prints the checksum along with "continue normal execution..." output line, it works like it should.

James, could you name the class and method where the StackOverflow happens? And also class and method of the finally block that raises another StackOverflow?
msg11282 (view) Author: Stefan Richthofer (stefan.richthofer) Date: 2017-03-30.09:15:57
Okay, with James' hints I was able to write a new minimal example showing how finally blocks can be bypassed ... actually. See Test2.java

The trick is as follows: After catching a StackOverflowError, the finally block performs a method-call. Even if the called method is empty, this can sometimes directly trigger another StackOverflowError, because the stack is already near to exhaustion. (In the example I let the dummy-method call itself twice so the stack expands slightly more, which makes the issue occur literally always.)

You will find that it outputs something like

Start...
continue normal execution... 7924

which means that some resource would not have been successfully cleaned up. (Otherwise it would be "continue normal execution... 0")

I guess that an increased xss somehow lets the JVM manage stack memory in larger blocks, so it is more resilient if the finally block requires additional stack; however this is a pure guess.

The real question is:
Why is the StackOverflowError happening in the finally block (?) not thrown? AFAIK exceptions raised in catch or finally blocks should propagate up the stack, e.g. see http://stackoverflow.com/questions/3779285/exception-thrown-in-catch-and-finally-clause

So this behavior is definitely somehow fishy.
msg11283 (view) Author: James Mudd (jamesmudd) Date: 2017-03-30.11:07:23
@Stefan I'm at work at them moment so will try to post the classes and methods tonight, but just a few comments

1. The new example is exactly right, I haven't tried it, but to get a StackOverflowError to occur in the finally is very easy in this situation, as you say the stack is very nearly exhausted already, so any method call could cause it.

2. Once you throw in the finally I believe the new exception replaces the existing one, I will check this tonight but remember investigating it once before for another reason. In this case it will be a new StackOverflowError but the stack it contains will be smaller as you have already dropped a stack frame maybe more, I think this is the source of a lot of the confusion around this. In this situation the stack trace you see in any exception will be only a partial stack. If anyone knows I'm wrong please correct this, because I think this is a key point which we should be clear on.

3. What makes you think the StackOverflowError thrown in the finally doesn't propagate up the stack? Looking at you example, I would suggest that only the last catch needs to succeed in catching the StackOverflowError and you will see nothing, as every time a new StackOverflowError is thrown the old one is lost?

Disclaimer: I might be wrong on any of this (If I am please correct me), I didn't even run the example its just from looking at the code but though it might be useful for discussion.

Also I don't know if it relevant here but throwing an exception over another, suppresses the first they is now a Java way to recover these see Throwable.getSuppressed but i'm not sure if it will apply in this case.
msg11284 (view) Author: Stefan Richthofer (stefan.richthofer) Date: 2017-03-30.12:16:11
> What makes you think the StackOverflowError thrown in the finally doesn't propagate up the stack?

It also occurred to me that an earlier call of recursion would handle the subsequent StackOverflowError, but if you add an output to the catch-block (not present in the example) you will see that only a single StackOverflowError is raised.

I also red that an error or exception thrown in a catch or finally block can replace the original one, but IMO the initial StackOverflowError has already been handled by the catch, once the finally block starts. In either case *something* should be visible in higher catch/finally blocks, but nothing shows up (also if you catch all Error or Throwable instead of just StackOverflowError).
msg11286 (view) Author: Stefan Richthofer (stefan.richthofer) Date: 2017-04-02.13:38:28
I just found out that the stack is so exhausted that even the System.out.println() causes a subsequent StackOverflowError, which prevents the expected output. So it's not a JVM bug, while a very inconvenient behavior: The JVM should have some extra stack memory to guarantee that debug outputs, close and unlock operations can pass even in a catch or finally block right after a StackOverflowError.

So, how shall we deal with this situation?
On one hand, this scenario is pathological and we maybe shouldn't spend too much effort on it.
On the other hand, having potential for deadlocks is an unbearable situation in the test suite. However, just dropping test_endless_recursion(self) is maybe not what we want either (while the easiest short-term solution, especially with 2.7.1 release in mind). Suggestions?
msg11287 (view) Author: James Mudd (jamesmudd) Date: 2017-04-02.15:10:44
Sorry I haven't got back to you on where the exceptions occur, I haven't really had any time to look at it.

About the test suite having the potential for deadlock. I agree this is bad, however is this actually the case? With the increased Xss setting I have never observed these deadlocks, whereas before they would happen frequently (a bit vague I know). So while I can't explain why increasing Xss should effect this issue, it seems to be the case. Has anyone seen these deadlocks occur since Xss was increased?

On the JVM, I don't know if offering extra memory to allow catch or finally to complete is plausible, where would you stop, there is nothing stopping someone assigning huge new objects in a finally (although I think it would be considered bad practice). I think the issue here is caused because Jython catches Throwable and then attempts to continue, this is risky, but needed here to produce the correct behaviour a Python application would expect. I think it's best said here https://docs.oracle.com/javase/7/docs/api/java/lang/Error.html "An Error is a subclass of Throwable that indicates serious problems that a reasonable application should not try to catch" and "these errors are abnormal conditions that should never occur"

Finally, I don't think this should be a issue to hold back the 2.7.1 release, essentially this only occurs because tests are deliberately performing infinite recursions to see what happens. This should not be the case in any normal operation.
msg11288 (view) Author: Jim Baker (zyasoft) Date: 2017-04-02.15:18:50
Per https://hg.python.org/cpython/rev/a21f5af476cb (part of http://bugs.python.org/issue12051, which added test_endless_recursion), we most likely want to do a defensive use of ThreadState#enterCall, so we can reduce the expected amount of "available" stack.

Note that prior to 2.7.0, we did not have a Java accelerator (port of _json.c), so this problem never occurred because it was only in Python code (which always checks on every call). In general, I would have expected the call to the default function would cause enough checking, but apparently not.

There's a subtle interaction between https://docs.python.org/2/library/sys.html#sys.setrecursionlimit (by default 1000, https://github.com/jythontools/jython/blob/master/src/org/python/core/PySystemState.java#L167), the actual memory for the stack (set by -Xss), and how much Java code is actually executed. It looks like we have been exploring this subtlety during the course of this bug.

I don't think there's much more we can ask Java itself to do here; much like CPython, we have to code even more defensively.
msg11289 (view) Author: Stefan Richthofer (stefan.richthofer) Date: 2017-04-02.16:20:50
Sure, this is not a release blocker (any more). AFAIK, release is only blocked by #2570 as of this writing.

> On the JVM, I don't know if offering extra memory to allow catch or finally to complete is plausible, where would you stop

Just a limited selection of very typical actions like System.out.println, printStackTrace, unlock methods from util.concurrent.locks, close methods of built-in Java-classes should be guaranteed pass, if directly called. If your application goes the unusual way of catching an error, you would probably do it for debugging or cleanup of very critical resources, e.g. ones which can later cause JVM to freeze. Anyway, this is only hypothetical talk, as it will most likely not trigger a JVM improvement.

Regarding xss: I am glad, this seems to be a proper workaround; however as long as we don't understand why it works, this issue cannot be closed. But I'll set it to low priority.

Jim's suggestion regarding ThreadState#enterCall sounds promising, but I guess, doing the required experiments and detail work to explore the mentioned subtleties would be (yet another) serious workload.
msg11292 (view) Author: Jim Baker (zyasoft) Date: 2017-04-02.22:15:59
Somehow I "removed" instead of "replied" to Stefan's comment. Oops! Presumably thinking of a different UI. Reposting:

Stefan Richthofer added the comment:

"One" more thing:

- We probably should re-enable test_logging in the test-suite again.

- If there are no concerns, I'd like to close all other deadlock issues related to this phenomenon (Resolution: duplicate) and add a final note with link to this issue.

- If there are no concerns, I'd change the title of this issue to something like "occasional freezes due to StackOverflowError in finally block"

(I will do this stuff as my time allows)
msg11293 (view) Author: Jim Baker (zyasoft) Date: 2017-04-02.22:19:28
Agreed about this issue now being low in its priority. A quick perusal suggests that check_circular=False is not used for json encoding; here's the top example which points out it's a bad idea:
https://github.com/Yelp/mrjob/issues/1006
msg11310 (view) Author: Stefan Richthofer (stefan.richthofer) Date: 2017-04-14.18:58:34
Closed #2537 and #2565 as duplicates of this, because they are most likely caused by the same phenomenon.
msg11347 (view) Author: Stefan Richthofer (stefan.richthofer) Date: 2017-05-04.21:45:40
I just observed test_pythoninterpreter_jy to be hanging, which smells like another manifestation of this issue (current trunk version, i.e with increased xss). See jstack attached -- "Thread-261" is waiting on a guava internal reentrant lock without apparent reason (The jstack is not well readable due to #2584; I'd appreciate to get that one fixed!).

So, this proves that increased xss does not actually fix this issue, just makes it less likely.

_I recommend to exclude test_endless_recursion from regular test suite._

Testing this kind of edge case potentially leaves the JVM in an insane state as JVM provides insufficient guarantees after catching StackOverflowError. Note that stuff like this is the exact reason why it is strictly recommended not to catch Errors (in contrast to Exceptions). I suppose in context of test_endless_recursion it's actually okay, because we just replace the Error by a Python-style one. However that Python-style Error in turn should not be caught, i.e. the Interpreter/the JVM should not continue regular execution (i.e. for other purpose than printing errors, dumping stacktrace etc) because of potential insane state.

This IMO renders test_endless_recursion an unacceptable member of the test suite. (We can run it in separation from time to time or could move it to the end of the suite.)
msg11348 (view) Author: Jim Baker (zyasoft) Date: 2017-05-04.21:54:28
Stefan, thanks for this investigation. Makes perfect sense, there's a limit to what we can recover from, given JVM guarantees, or lack thereof.

We should just remove this specific test from being run, just like other destabilizing tests. In particular, some of the networking tests exhibit the same problems: they run fine, but cause future tests to fail. They are specifically excluded as a result.
msg11351 (view) Author: James Mudd (jamesmudd) Date: 2017-05-05.07:17:05
@Stefan Yes I think we have probably seen a build failure on Travis due to this deadlock in test_pythoninterpreter_jy as well https://travis-ci.org/jythontools/jython/jobs/228908723
It's good to know its not fixed by the increased Xss that was very puzzling to me!

I think excluding tests where they attempt to cause this situation, like test_endless_recursion is probably a reasonable solution. Maybe we could just have one test at the end of the suite which does this just to check the error is correctly propagated back to the python code.

I wonder if this could be solved... Maybe when we catch the StackOverflowError we could go and remove the Guava maps which might contain still locked locks and replace them with new instances. It was my feeling these are "only" caches to improve jython performance so while emptying them is generally bad if it avoids this error it might be worth considering? Its an exceptional case anyway so degrading performance might be a acceptable trade off? However I do feel we are working against the JVM on this, they really don't want you to try and recover from these situations and we might just be getting into more trouble.
msg11361 (view) Author: Jim Baker (zyasoft) Date: 2017-05-07.16:07:17
So we have two distinct ways we use Guava collections. The first is strictly for caching, and this comes up in the regex implementation (python.sre.* registry keys to control). Code that uses regexes is not impacted by changes to this caching, except for performance.
 
The other is in building mappings between Java and Python, such as Java class to Python type maps. Both kinds (trying to use a different term!) of resources are garbage collectable, but during the *lifetime* of the usage of the resource on either end, Python or Java, we need to maintain this mapping and it cannot be thrown away, at least not without observable behavioral changes to using code. Such changes would impact correctness, not just performance.
msg11379 (view) Author: Stefan Richthofer (stefan.richthofer) Date: 2017-05-19.00:51:38
Just observed another one, it's first time now for test_select as far as I remember (see attached file test_select_hanging.txt.

@James Mudd:
I think to solve this, we would need to be able to tell guava to reset all locks when we recover from a caught stack overflow. But these are buried deeply in guava internals. And even then it might not be feasible due to complex, subtle stuff I currently might overlook.
msg11396 (view) Author: Stefan Richthofer (stefan.richthofer) Date: 2017-05-22.23:02:30
Just observed test_select to be hanging the same way again, although I currently have test_endless_recursion decorated with an unconditioned @unittest.skip.
Maybe there's another place that can cause this... (StackOverflow)? I will try to exclude test_json completely for a while. Or test_select can hang due to a distinct phenomenon, but it smells very much like this issue...
msg11399 (view) Author: James Mudd (jamesmudd) Date: 2017-05-23.22:00:43
Think I have just seen test_java_integration hang with this again. jstack -l attached.

This is a little messy, could we replace the guava maps with a ConcurrentHashMap with weak references to the key and values? This might avoid this issue? Might try and spend some time looking at this code would like to understand better how it works.
msg11400 (view) Author: Stefan Richthofer (stefan.richthofer) Date: 2017-05-23.22:14:02
ConcurrentHashMap also uses locks internally and could fall into the same insane state when a StackOverflowError occurs.
I don't think there is a setup that can be robust against StackOverflowError interrupting execution at an arbitrary point.
msg11401 (view) Author: Stefan Richthofer (stefan.richthofer) Date: 2017-05-23.23:17:03
Just observed the same deadlock you saw in test_java_intergration.
And that with test_json completely excluded, see https://github.com/Stewori/jython/commit/b3eea96272b8e9dc229c0239b189163f388a7424.

Is there some other endless recursion spot in the test suite? Must be something before test_java_integration. Or something else where a JVM error (likely StackOverflowError or OutOfMemoryError) is caught?
IMO we should avoid catching Errors for a while (i.e. exclude all tests that do so) and observe whether this reliably avoids these deadlocks.
msg11402 (view) Author: Stefan Richthofer (stefan.richthofer) Date: 2017-05-24.16:37:06
Okay, I investigated a bit more and found that test_isinstance yields StackOverflowError too. All in all we'd have to exclude the following tests:

test_json:
From Lib/json/tests/test_recursion.py:
test_highly_nested_objects_decoding
test_highly_nested_objects_encoding
test_endless_recursion


test_isinstance:
test_subclass_recursion_limit
test_isinstance_recursion_limit

I will @skip these and push it soon. After that no more deadlocks of this kind should be happening in regrtests. If we continue to see such stuff, we didn't understand this issue as well as we thought...

Secondly:
For Jython 2.7.2/Jython 3 we should consider to use a Jython-specific Guava fork. We can maintain it in a manner that allows us to steadily pull updates from Guava trunk. Also we can remove all classes that Jython does not actually need - guava.x.jar is > 2.5 MB and I suspect the stuff we actually use is about 180kB or so. So as a side effect this would thin Jython dist a bit (also a frequent request IIRC).
Then we could add a public way to reset locks. Whenever in PyTableCode.call a Throwable is caught that is not an Exception, we would reset locks of certain relevant caches/dicts in Jython.
msg11403 (view) Author: Stefan Richthofer (stefan.richthofer) Date: 2017-05-24.17:33:18
See https://hg.python.org/jython/rev/2b06bc95594d

Now we shouldn't observe any more deadlocks on regrtest. Hopefully it's done with that for now. Still, keep your eyes open, after this point, any more deadlocks are even more alerting and are likely to be distinct issues.
History
Date User Action Args
2017-05-24 17:33:19stefan.richthofersetmessages: + msg11403
2017-05-24 16:37:07stefan.richthofersetmessages: + msg11402
2017-05-23 23:17:03stefan.richthofersetmessages: + msg11401
2017-05-23 22:14:02stefan.richthofersetmessages: + msg11400
2017-05-23 22:00:45jamesmuddsetfiles: + test_java_integration_deadlock.txt
messages: + msg11399
2017-05-22 23:02:30stefan.richthofersetmessages: + msg11396
2017-05-19 00:51:42stefan.richthofersetfiles: + test_select_hanging.txt
messages: + msg11379
2017-05-07 16:07:18zyasoftsetmessages: + msg11361
2017-05-05 07:17:06jamesmuddsetmessages: + msg11351
2017-05-04 21:54:28zyasoftsetmessages: + msg11348
2017-05-04 21:45:42stefan.richthofersetfiles: + test_pythoninterpreter_jy_hangs.txt
messages: + msg11347
2017-04-14 18:58:34stefan.richthofersetmessages: + msg11310
title: test_logging deadlocks during regrtest -> deadlocks in regrtests due to StackOverflowError in finally block
2017-04-02 22:19:28zyasoftsetmessages: + msg11293
2017-04-02 22:15:59zyasoftsetmessages: + msg11292
2017-04-02 22:13:39zyasoftsetmessages: - msg11290
2017-04-02 16:34:48stefan.richthofersetmessages: + msg11290
2017-04-02 16:20:51stefan.richthofersetpriority: low
resolution: remind
messages: + msg11289
severity: normal -> minor
2017-04-02 15:18:50zyasoftsetmessages: + msg11288
2017-04-02 15:10:45jamesmuddsetmessages: + msg11287
2017-04-02 13:38:29stefan.richthofersetmessages: + msg11286
2017-03-30 12:16:11stefan.richthofersetmessages: + msg11284
2017-03-30 11:07:23jamesmuddsetmessages: + msg11283
2017-03-30 10:51:11stefan.richthofersetfiles: + Test2.java
2017-03-30 10:50:48stefan.richthofersetfiles: - Test2.java
2017-03-30 09:15:58stefan.richthofersetfiles: + Test2.java
messages: + msg11282
2017-03-30 08:22:01stefan.richthofersetmessages: + msg11281
2017-03-29 22:15:28jamesmuddsetmessages: + msg11280
2017-03-27 22:14:42jamesmuddsetmessages: + msg11279
2017-03-27 16:46:46stefan.richthofersetfiles: + Test.java
messages: + msg11277
2017-03-25 14:54:08jamesmuddsetmessages: + msg11274
2017-03-25 11:21:26jeff.allensetnosy: + jeff.allen
messages: + msg11273
2017-03-25 04:15:56stefan.richthofersetmessages: + msg11270
2017-03-16 00:11:17zyasoftsetmessages: + msg11236
2017-03-15 23:41:55stefan.richthofersetmessages: + msg11234
2017-03-09 08:11:41jamesmuddsetfiles: + test_logging_undetected_deadlock.txt
messages: + msg11210
2017-03-03 12:32:10stefan.richthofersetmessages: + msg11162
2017-02-28 21:26:21jamesmuddsetnosy: + jamesmudd
messages: + msg11145
2016-12-01 14:43:27stefan.richthofercreate