Issue2536

classification
Title: deadlocks in regrtests due to StackOverflowError in finally block
Type: Severity: minor
Components: Core Versions:
Milestone:
process
Status: open Resolution: remind
Dependencies: Superseder:
Assigned To: Nosy List: jamesmudd, jeff.allen, stefan.richthofer, zyasoft
Priority: low Keywords:

Created on 2016-12-01.14:43:27 by stefan.richthofer, last changed 2017-04-14.18:58:34 by stefan.richthofer.

Files
File name Uploaded Description Edit Remove
test_logging_hangs.txt stefan.richthofer, 2016-12-01.14:43:25 full jstack-output while test_logging deadlocks.
test_logging_undetected_deadlock.txt jamesmudd, 2017-03-09.08:11:39 undetected deadlock
Test.java stefan.richthofer, 2017-03-27.16:46:45 How to bypass finally-blocks
Test2.java stefan.richthofer, 2017-03-30.10:51:11 How to bypass finally-blocks ... actually
Messages
msg10996 (view) Author: Stefan Richthofer (stefan.richthofer) Date: 2016-12-01.14:43:25
Running on Linux Mint 18, 64 Bit with Java 8 I observe a deadlock in test_logging.py from time to time. It only occurs when running with 'ant regrtest' and the test itself does not give output or error, regrtest just hangs indefinitely.

jstack reports a deadlock (full output attached):


Found one Java-level deadlock:
=============================
"Thread-262":
  waiting for ownable synchronizer 0x00000000e06df030, (a java.util.concurrent.locks.ReentrantLock$NonfairSync),
  which is held by "MainThread"
"MainThread":
  waiting for ownable synchronizer 0x00000000e0787518, (a java.util.concurrent.locks.ReentrantLock$NonfairSync),
  which is held by "Thread-262"

Java stack information for the threads listed above:
===================================================
"Thread-262":
	at sun.misc.Unsafe.park(Native Method)
	- parking to wait for  <0x00000000e06df030> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
	at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
	at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
	at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2228)
	at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2208)
	at com.google.common.cache.LocalCache.get(LocalCache.java:4053)
	at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:4057)
	at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4986)
	at com.google.common.cache.LocalCache$LocalLoadingCache.getUnchecked(LocalCache.java:4992)
	at org.python.modules.sre.SRE_STATE$CACHE.get(SRE_STATE.java:1290)
	at org.python.modules.sre.SRE_STATE$CACHE.access$000(SRE_STATE.java:1253)
	at org.python.modules.sre.SRE_STATE.<init>(SRE_STATE.java:1295)
	at org.python.modules.sre.PatternObject.match(PatternObject.java:80)
	at org.python.modules.sre.PatternObject$match_exposer.__call__(Unknown Source)
	at org.python.core.PyObject.__call__(PyObject.java:480)
	at org.python.core.PyObject.__call__(PyObject.java:484)
	at logging.config$py.valid_ident$10(logging/config.py:275)
	at logging.config$py.call_function(logging/config.py)
	at org.python.core.PyTableCode.call(PyTableCode.java:167)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:139)
	at org.python.core.PyFunction.__call__(PyFunction.java:413)
	at logging.config$py.configure_handler$33(logging/config.py:732)
	at logging.config$py.call_function(logging/config.py)
	at org.python.core.PyTableCode.call(PyTableCode.java:167)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:154)
	at org.python.core.PyFunction.__call__(PyFunction.java:423)
	at org.python.core.PyMethod.__call__(PyMethod.java:141)
	at logging.config$py.configure$29(logging/config.py:640)
	at logging.config$py.call_function(logging/config.py)
	at org.python.core.PyTableCode.call(PyTableCode.java:167)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:139)
	at org.python.core.PyFunction.__call__(PyFunction.java:413)
	at org.python.core.PyMethod.__call__(PyMethod.java:126)
	at logging.config$py.dictConfig$38(logging/config.py:777)
	at logging.config$py.call_function(logging/config.py)
	at org.python.core.PyTableCode.call(PyTableCode.java:167)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:139)
	at org.python.core.PyFunction.__call__(PyFunction.java:413)
	at logging.config$py.handle$41(logging/config.py:840)
	at logging.config$py.call_function(logging/config.py)
	at org.python.core.PyTableCode.call(PyTableCode.java:167)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:139)
	at org.python.core.PyFunction.__call__(PyFunction.java:413)
	at org.python.core.PyMethod.__call__(PyMethod.java:126)
	at SocketServer$py.__init__$47(SocketServer.py:659)
	at SocketServer$py.call_function(SocketServer.py)
	at org.python.core.PyTableCode.call(PyTableCode.java:167)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:308)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:199)
	at org.python.core.PyFunction.__call__(PyFunction.java:482)
	at org.python.core.PyFunction.__call__(PyFunction.java:476)
	at org.python.core.PyInstance.__init__(PyInstance.java:124)
	at org.python.core.PyClass.__call__(PyClass.java:193)
	at org.python.core.PyObject.__call__(PyObject.java:515)
	at org.python.core.PyObject.__call__(PyObject.java:519)
	at SocketServer$py.finish_request$13(SocketServer.py:334)
	at SocketServer$py.call_function(SocketServer.py)
	at org.python.core.PyTableCode.call(PyTableCode.java:167)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:171)
	at org.python.core.PyFunction.__call__(PyFunction.java:434)
	at org.python.core.PyMethod.__call__(PyMethod.java:156)
	at SocketServer$py.process_request_thread$36(SocketServer.py:605)
	at SocketServer$py.call_function(SocketServer.py)
	at org.python.core.PyTableCode.call(PyTableCode.java:167)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:308)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:199)
	at org.python.core.PyFunction.__call__(PyFunction.java:482)
	at org.python.core.PyMethod.instancemethod___call__(PyMethod.java:237)
	at org.python.core.PyMethod.__call__(PyMethod.java:228)
	at org.python.core.PyMethod.__call__(PyMethod.java:223)
	at org.python.core.PyObject._callextra(PyObject.java:620)
	at threading$py.run$35(threading.py:213)
	at threading$py.call_function(threading.py)
	at org.python.core.PyTableCode.call(PyTableCode.java:167)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:139)
	at org.python.core.PyFunction.__call__(PyFunction.java:413)
	at org.python.core.PyMethod.__call__(PyMethod.java:126)
	at threading$py._Thread__bootstrap$36(threading.py:266)
	at threading$py.call_function(threading.py)
	at org.python.core.PyTableCode.call(PyTableCode.java:167)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:308)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:199)
	at org.python.core.PyFunction.__call__(PyFunction.java:482)
	at org.python.core.PyMethod.instancemethod___call__(PyMethod.java:237)
	at org.python.core.PyMethod.__call__(PyMethod.java:228)
	at org.python.core.PyMethod.__call__(PyMethod.java:218)
	at org.python.core.PyMethod.__call__(PyMethod.java:213)
	at org.python.core.FunctionThread.run(FunctionThread.java:25)
"MainThread":
	at sun.misc.Unsafe.park(Native Method)
	- parking to wait for  <0x00000000e0787518> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
	at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
	at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
	at org.python.modules._threading.RLock.RLock_acquire(RLock.java:42)
	at org.python.modules._threading.RLock$RLock_acquire_exposer.__call__(Unknown Source)
	at org.python.core.PyObject.__call__(PyObject.java:468)
	at logging$py._acquireLock$6(logging/__init__.py:219)
	at logging$py.call_function(logging/__init__.py)
	at org.python.core.PyTableCode.call(PyTableCode.java:167)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:125)
	at org.python.core.PyFunction.__call__(PyFunction.java:403)
	at logging.config$py.stopListening$48(logging/config.py:906)
	at logging.config$py.call_function(logging/config.py)
	at org.python.core.PyTableCode.call(PyTableCode.java:167)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:125)
	at org.python.core.PyFunction.__call__(PyFunction.java:403)
	at test.test_logging$py.setup_via_listener$84(test/test_logging.py:1809)
	at test.test_logging$py.call_function(test/test_logging.py)
	at org.python.core.PyTableCode.call(PyTableCode.java:167)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:154)
	at org.python.core.PyFunction.__call__(PyFunction.java:423)
	at org.python.core.PyMethod.__call__(PyMethod.java:141)
	at test.test_logging$py.test_listen_config_10_ok$85(test/test_logging.py:1825)
	at test.test_logging$py.call_function(test/test_logging.py)
	at org.python.core.PyTableCode.call(PyTableCode.java:167)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:139)
	at org.python.core.PyFunction.__call__(PyFunction.java:413)
	at org.python.core.PyMethod.__call__(PyMethod.java:126)
	at unittest.case$py.run$35(unittest/case.py:377)
	at unittest.case$py.call_function(unittest/case.py)
	at org.python.core.PyTableCode.call(PyTableCode.java:167)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:308)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:199)
	at org.python.core.PyFunction.__call__(PyFunction.java:482)
	at org.python.core.PyMethod.instancemethod___call__(PyMethod.java:237)
	at org.python.core.PyMethod.__call__(PyMethod.java:228)
	at org.python.core.PyMethod.__call__(PyMethod.java:223)
	at org.python.core.PyObject._callextra(PyObject.java:620)
	at unittest.case$py.__call__$37(unittest/case.py:396)
	at unittest.case$py.call_function(unittest/case.py)
	at org.python.core.PyTableCode.call(PyTableCode.java:167)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:308)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:199)
	at org.python.core.PyFunction.__call__(PyFunction.java:482)
	at org.python.core.PyMethod.instancemethod___call__(PyMethod.java:237)
	at org.python.core.PyMethod.__call__(PyMethod.java:228)
	at org.python.core.PyMethod.__call__(PyMethod.java:223)
	at org.python.core.PyObjectDerived.__call__(PyObjectDerived.java:1030)
	at org.python.core.PyObject.__call__(PyObject.java:480)
	at org.python.core.PyObject.__call__(PyObject.java:484)
	at unittest.suite$py.run$16(unittest/suite.py:116)
	at unittest.suite$py.call_function(unittest/suite.py)
	at org.python.core.PyTableCode.call(PyTableCode.java:167)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:308)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:199)
	at org.python.core.PyFunction.__call__(PyFunction.java:482)
	at org.python.core.PyMethod.instancemethod___call__(PyMethod.java:237)
	at org.python.core.PyMethod.__call__(PyMethod.java:228)
	at org.python.core.PyMethod.__call__(PyMethod.java:223)
	at org.python.core.PyObject._callextra(PyObject.java:620)
	at unittest.suite$py.__call__$13(unittest/suite.py:70)
	at unittest.suite$py.call_function(unittest/suite.py)
	at org.python.core.PyTableCode.call(PyTableCode.java:167)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:308)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:199)
	at org.python.core.PyFunction.__call__(PyFunction.java:482)
	at org.python.core.PyMethod.instancemethod___call__(PyMethod.java:237)
	at org.python.core.PyMethod.__call__(PyMethod.java:228)
	at org.python.core.PyMethod.__call__(PyMethod.java:223)
	at org.python.core.PyObjectDerived.__call__(PyObjectDerived.java:1030)
	at org.python.core.PyObject.__call__(PyObject.java:480)
	at org.python.core.PyObject.__call__(PyObject.java:484)
	at unittest.suite$py.run$16(unittest/suite.py:116)
	at unittest.suite$py.call_function(unittest/suite.py)
	at org.python.core.PyTableCode.call(PyTableCode.java:167)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:308)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:199)
	at org.python.core.PyFunction.__call__(PyFunction.java:482)
	at org.python.core.PyMethod.instancemethod___call__(PyMethod.java:237)
	at org.python.core.PyMethod.__call__(PyMethod.java:228)
	at org.python.core.PyMethod.__call__(PyMethod.java:223)
	at org.python.core.PyObject._callextra(PyObject.java:620)
	at unittest.suite$py.__call__$13(unittest/suite.py:70)
	at unittest.suite$py.call_function(unittest/suite.py)
	at org.python.core.PyTableCode.call(PyTableCode.java:167)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:308)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:199)
	at org.python.core.PyFunction.__call__(PyFunction.java:482)
	at org.python.core.PyMethod.instancemethod___call__(PyMethod.java:237)
	at org.python.core.PyMethod.__call__(PyMethod.java:228)
	at org.python.core.PyMethod.__call__(PyMethod.java:223)
	at org.python.core.PyObjectDerived.__call__(PyObjectDerived.java:1030)
	at org.python.core.PyObject.__call__(PyObject.java:480)
	at org.python.core.PyObject.__call__(PyObject.java:484)
	at test.junit_xml$py.run$3(test/junit_xml.py:27)
	at test.junit_xml$py.call_function(test/junit_xml.py)
	at org.python.core.PyTableCode.call(PyTableCode.java:167)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:154)
	at org.python.core.PyFunction.__call__(PyFunction.java:423)
	at org.python.core.PyMethod.__call__(PyMethod.java:141)
	at test.test_support$py._run_suite$102(test/test_support.py:1303)
	at test.test_support$py.call_function(test/test_support.py)
	at org.python.core.PyTableCode.call(PyTableCode.java:167)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:308)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:132)
	at org.python.core.PyFunction.__call__(PyFunction.java:413)
	at test.test_support$py.run_unittest$103(test/test_support.py:1320)
	at test.test_support$py.call_function(test/test_support.py)
	at org.python.core.PyTableCode.call(PyTableCode.java:167)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:308)
	at org.python.core.PyFunction.function___call__(PyFunction.java:471)
	at org.python.core.PyFunction.__call__(PyFunction.java:466)
	at org.python.core.PyFunction.__call__(PyFunction.java:456)
	at test.test_logging$py.test_main$96(test/test_logging.py:1927)
	at test.test_logging$py.call_function(test/test_logging.py)
	at org.python.core.PyTableCode.call(PyTableCode.java:167)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:308)
	at org.python.core.PyFunction.function___call__(PyFunction.java:471)
	at org.python.core.PyFunction.__call__(PyFunction.java:466)
	at org.python.core.PyFunction.__call__(PyFunction.java:461)
	at org.python.core.PyObject._callextra(PyObject.java:620)
	at test.test_support$py.inner$83(test/test_support.py:1095)
	at test.test_support$py.call_function(test/test_support.py)
	at org.python.core.PyTableCode.call(PyTableCode.java:167)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:308)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:119)
	at org.python.core.PyFunction.__call__(PyFunction.java:403)
	at org.python.pycode._pyx0.runtest_inner$5(test/regrtest.py:676)
	at org.python.pycode._pyx0.call_function(test/regrtest.py)
	at org.python.core.PyTableCode.call(PyTableCode.java:167)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:308)
	at org.python.core.PyFunction.function___call__(PyFunction.java:471)
	at org.python.core.PyFunction.__call__(PyFunction.java:466)
	at org.python.core.PyFunction.__call__(PyFunction.java:456)
	at org.python.pycode._pyx0.runtest$4(test/regrtest.py:566)
	at org.python.pycode._pyx0.call_function(test/regrtest.py)
	at org.python.core.PyTableCode.call(PyTableCode.java:167)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:308)
	at org.python.core.PyFunction.function___call__(PyFunction.java:471)
	at org.python.core.PyFunction.__call__(PyFunction.java:466)
	at org.python.core.PyFunction.__call__(PyFunction.java:456)
	at org.python.pycode._pyx0.main$2(test/regrtest.py:503)
	at org.python.pycode._pyx0.call_function(test/regrtest.py)
	at org.python.core.PyTableCode.call(PyTableCode.java:167)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:308)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:119)
	at org.python.core.PyFunction.__call__(PyFunction.java:403)
	at org.python.pycode._pyx0.f$0(test/regrtest.py:1558)
	at org.python.pycode._pyx0.call_function(test/regrtest.py)
	at org.python.core.PyTableCode.call(PyTableCode.java:167)
	at org.python.core.PyCode.call(PyCode.java:18)
	at org.python.core.Py.runCode(Py.java:1401)
	at org.python.util.PythonInterpreter.execfile(PythonInterpreter.java:296)
	at org.python.util.jython.run(jython.java:362)
	at org.python.util.jython.main(jython.java:142)

Found 1 deadlock.
msg11145 (view) Author: James Mudd (jamesmudd) Date: 2017-02-28.21:26:21
I can confirm I have also just seen this happen on Ubuntu 16.04 64 bit with Java 1.8.0_121
msg11162 (view) Author: Stefan Richthofer (stefan.richthofer) Date: 2017-03-03.12:32:10
Also observed it on Windows 10 right now.
msg11210 (view) Author: James Mudd (jamesmudd) Date: 2017-03-09.08:11:39
Have just seen test_logging deadlock again but slightly different. Using master with includes reverting the commit 41d8823 to backout the syncronization in _socket.py ans ssl.py

Since this was reverted it does seem to be better and now I see a deadlock jstack can't detect. Maybe a read lock? jstack -l output attached.
msg11234 (view) Author: Stefan Richthofer (stefan.richthofer) Date: 2017-03-15.23:41:54
After more study of this issue I found out that somehow test_json is preparing ground for this deadlock. It sometimes causes guava not to release a certain lock on a cache, which later leads to this deadlock. Especially test_scanstring seems to cause this; I am not yet sure how this can happen.

However it would be helpful if others could confirm the following:

Exclude test_json from regrtests (e.g. insert tests.remove('test_json') into Lib/test/regrtest.py in line 377. There surely is a more proper way to exclude a certain test, feel free to use that one!)

Then, test_logging should not deadlock any more.
msg11236 (view) Author: Jim Baker (zyasoft) Date: 2017-03-16.00:11:17
Very interesting debugging detail!

Note that the json module has Java acceleration support in org.python.modules._json, but otherwise has pure Python implementation; this acceleration can be readily turned off by removing this entry: https://github.com/jythontools/jython/blob/master/src/org/python/modules/Setup.java#L38

So we can see if it's this acceleration causing the problem or not.
msg11270 (view) Author: Stefan Richthofer (stefan.richthofer) Date: 2017-03-25.04:15:54
I think I finally understood the cause of this issue - it is crazily subtle. Good news is that it was already (accidentally) fixed as a side-effect of https://hg.python.org/jython/rev/17e40de9a541 (I wondered why it never occurred in recent trunk-versions).
Bad news is that some things silently go awfully wrong and I don't see a proper way to detect this on Jython-side.

So, the whole story of this bug... here we go.

guava uses reentrant locks to secure all sorts of access to data structures. It uses finally blocks to unlock such locks. Note that - as locks are reentrant - it can occur that finally-enclosed calls are nested.

This bug originates in finally-blocks not being executed; thus some locks fail to unlock and later cause deadlocks. While this can happen in various scenarios, it typically involves guava and the bypassed unlock typically happens in test_json (i.e. in org.python.modules.sre.SRE_STATE$CACHE, which is backed by com.google.common.cache.LocalCache).

But how can a finally-block be bypassed?

https://docs.oracle.com/javase/tutorial/essential/exceptions/finally.html
tells us it can be caused by thread interruption or System.exit, but I asserted that none of these is the case here. The document misses to mention that OutOfMemoryError and StackOverflowError can terminate the JVM without executing an enclosing finally-block:

http://stackoverflow.com/questions/22969000/finally-block-will-be-executed-in-case-of-outofmemoryerror
http://stackoverflow.com/questions/7753443/behaviour-of-jvm-during-out-of-memory-error-list-s-new-arrayliststring
http://stackoverflow.com/questions/17836120/stack-overflow-error-handling-in-finally-block

Normally that doesn't matter much since everything blows up anyway. However recall that in guava, calls enclosed by finally blocks are sometimes nested. In that case the outer finally block can catch the error (and is often able to, because while the error propagates up the stack, memory might get freed) and then resume normal execution - with the slight detail that the inner finally block was bypassed and lock-count was messed up. The nasty thing here is that no error is displayed - things go silently wrong. Also, it is somewhat random how much of the stack is skipped before normal execution resumes - sometimes it drops even two unlocks at a time.

I observed that test_json just happens to yield very deep call-stacks, probably because it is a nested test suite. Extending xss (https://hg.python.org/jython/rev/17e40de9a541) somewhat solves it; however the issue only shifts to deeper stacks. My concern is that OutOfMemoryError or StackOverflowError are not displayed properly; just a (seemingly) unexplainable deadlock occurs later on (and I can tell, it takes a lot of time and thought to discover the actual cause).

I suppose, this phenomenon causes all of the recently observed deadlocks (e.g. in #2565 an unlock is lost on PyType$LazyClassToTypeHolder.classToType, which is backed by com.google.common.collect.MapMakerInternalMap) and I don't see a way to detect it properly in Jython. Increasing xss is the best thing we can do on Jython-side so far.
guava could maybe detect bypassed finally-blocks by explicitly catching OutOfMemoryError/StackOverflowError and issue a warning or enforce a hard exit then (or double-check lock-consistency once proper execution eventually resumed).
Actually, the JVM itself should not resume execution once an OutOfMemoryError or StackOverflowError was thrown. These errors should IMHO explicitly propagate through finally blocks in the entire stack rather than bypassing only some of them in non-deterministic fashion.

So I am likelyy going to file issues regarding this scenario in guava and at Oracle/JVM.

Opinions?
msg11273 (view) Author: Jeff Allen (jeff.allen) Date: 2017-03-25.11:21:26
Impressive analysis Stefan. Not much I can add.

I suppose the finally clauses fail to operate because they need to allocate objects? It's almost as if a rul of successful finally clauses was that they should have all their object allocation done ahead of time. If once you run out of memory, all contracts enforced by finally are unreliable, the condition should not be handled too locally but propagate out to where all affected objects are dereferenced. But how can anyone know where that is?

Only one thought: even after it is (mis-)handled, is there any chance of spotting that we were, for a time, out of memory?
msg11274 (view) Author: James Mudd (jamesmudd) Date: 2017-03-25.14:54:08
@Stefan Really good work, that's a really interesting read.

That also explains why I haven't seen this happen for a while which is good.

I'm still not exactly clear what happens the the OutOfMemoryError or StackOverflowError. If this error get thrown, but not caught, you should go through each finally block down the stack until the JVM exits? If one finally block tries to create new objects (which it seems like calling unlock on a ReentrantLock might) then that finally should throw, and you will drop out into the next enclosing finally (which may do the same) and so on, but the Throwable itself shouldn't disappear unless its caught? So for "normal" execution to resume the Throwable must be caught? I can't see anywhere in Guava where that could happen, there are a few places in Jython but non look very likely to me, how does normal execution resume?

Do you have any way to reproduce this?

I'm really glad someone is looking at this one, I tried to reproduce it for a while and eventually gave up.
msg11277 (view) Author: Stefan Richthofer (stefan.richthofer) Date: 2017-03-27.16:46:45
>I'm still not exactly clear what happens the the OutOfMemoryError or StackOverflowError

I think it's StackOverflowError, but I suppose under heavy memory load the former one could cause a similar phenomenon.

Actually I wasn't able to detect the Error explicitly "in field", but the attached Test.java can provoke the explained behavior (You'll have to run it several times, but it runs fast and I observe the issue every 5th run or so, so it's still easy to reproduce).

Additionally, the fact that xss-value can affect likeliness of the issue supports the StackOverflowError thesis.

If you run Test.java often enough you will observe something like this:

...
recursion 4607
recursion 4608
recursion 4609
recursion 4610
recursion 4611java.lang.Stacjava.lang.StackOverflowErrorjava.lang.StackOverflowErrorjava.lang.StackOverflowError
4608
10619136
10628355
continue normal execution...

If the checksums (10619136, 10628355) differ some finally blocks were bypassed; note that the difference is 4609+4610, which belong to the frames that were also dropped by the catch: It printed 4608. The construct as a whole however still resumes normal execution.

I am currently not able to pin down how this program structure is exactly embedded in guava, but I am convinced it is present there in manifold ways (which is why we observed various slightly different deadlocks).
msg11279 (view) Author: James Mudd (jamesmudd) Date: 2017-03-27.22:14:41
@Stefan I'm now also convinced this is happening although I still can't figure out exactly where the StackOverflowError is going. With the debugger connected and an exception breakpoint on StackOverflowError, run

james@james-ubuntu:~/git/jython/dist$ java -agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=1045 -Xss1024k -jar jython.jar Lib/test/test_json.py

I can reliably get test_json to be paused as the exception is thrown, usually in test_endless_recursion which sounds reasonable, and after disconnecting the test runs to completion, so normal execution has resumed somehow...

Unfortunately Eclipse seems to be having serious issues getting the stack and allowing me to step through it, probably due to the state the JVM is in, the debugger just freezes which is strange I wonder if there is some other way...
msg11280 (view) Author: James Mudd (jamesmudd) Date: 2017-03-29.22:15:26
So having spent a bit more time on this, I think Stefan is right with the deadlocks being caused by skipped finally blocks. IntelliJ's debugger seems to have no issues pausing on StackOverflowError and watching the execution. So if you want to see what happens I recommend trying IntelliJ.

So what happens is, at some point a StackOverflowError is thrown, some finally blocks are executed but some are skipped when another StackOverflowError is thrown during the finally, anyway eventually this is caught in Jython code, line 168 in PyTableCode which calls Py.JavaError(t), which then explicitly handles the StackOverflowError returning a Py.RuntimeError("maximum recursion depth exceeded (Java StackOverflowError)"). The issue occurs because this is exactly what these recursion test are expecting to happen. So they see the RuntimeError and that means the test passes and normal execution resumes (e.g line 108 in /Lib/json/tests/test_recursion.py - This is a good test to run always reproduces this issue) So the good thing is this works as expected I guess, except the part where finally blocks were skipped. So if there is a way to handle this in Java (i.e somehow ensure all the locks are unlocked) then we could do that by hooking the Py.JavaError(Throwable t) method which handles OutOfMemoryError and StackOverflowError, explicitly.

The thing I don't understand is how increasing Xss has any effect on this (although I observe it does), even with Xss5m the StackOverflowError still happens, as this is a infinite recursion test. So you would think that the unlocking issue would still happen but it doesn't? I can only assume the stack is so deep, by the time you make it back to a lock the finally will be executed successfully.

Maybe this is useful, if nothing else I enjoyed seeing how this happens.
msg11281 (view) Author: Stefan Richthofer (stefan.richthofer) Date: 2017-03-30.08:22:00
I learned that my minimal example Test.java is false. The finally blocks are actually always called there, only the System.out.println was sometimes swallowed. If one prints the checksum along with "continue normal execution..." output line, it works like it should.

James, could you name the class and method where the StackOverflow happens? And also class and method of the finally block that raises another StackOverflow?
msg11282 (view) Author: Stefan Richthofer (stefan.richthofer) Date: 2017-03-30.09:15:57
Okay, with James' hints I was able to write a new minimal example showing how finally blocks can be bypassed ... actually. See Test2.java

The trick is as follows: After catching a StackOverflowError, the finally block performs a method-call. Even if the called method is empty, this can sometimes directly trigger another StackOverflowError, because the stack is already near to exhaustion. (In the example I let the dummy-method call itself twice so the stack expands slightly more, which makes the issue occur literally always.)

You will find that it outputs something like

Start...
continue normal execution... 7924

which means that some resource would not have been successfully cleaned up. (Otherwise it would be "continue normal execution... 0")

I guess that an increased xss somehow lets the JVM manage stack memory in larger blocks, so it is more resilient if the finally block requires additional stack; however this is a pure guess.

The real question is:
Why is the StackOverflowError happening in the finally block (?) not thrown? AFAIK exceptions raised in catch or finally blocks should propagate up the stack, e.g. see http://stackoverflow.com/questions/3779285/exception-thrown-in-catch-and-finally-clause

So this behavior is definitely somehow fishy.
msg11283 (view) Author: James Mudd (jamesmudd) Date: 2017-03-30.11:07:23
@Stefan I'm at work at them moment so will try to post the classes and methods tonight, but just a few comments

1. The new example is exactly right, I haven't tried it, but to get a StackOverflowError to occur in the finally is very easy in this situation, as you say the stack is very nearly exhausted already, so any method call could cause it.

2. Once you throw in the finally I believe the new exception replaces the existing one, I will check this tonight but remember investigating it once before for another reason. In this case it will be a new StackOverflowError but the stack it contains will be smaller as you have already dropped a stack frame maybe more, I think this is the source of a lot of the confusion around this. In this situation the stack trace you see in any exception will be only a partial stack. If anyone knows I'm wrong please correct this, because I think this is a key point which we should be clear on.

3. What makes you think the StackOverflowError thrown in the finally doesn't propagate up the stack? Looking at you example, I would suggest that only the last catch needs to succeed in catching the StackOverflowError and you will see nothing, as every time a new StackOverflowError is thrown the old one is lost?

Disclaimer: I might be wrong on any of this (If I am please correct me), I didn't even run the example its just from looking at the code but though it might be useful for discussion.

Also I don't know if it relevant here but throwing an exception over another, suppresses the first they is now a Java way to recover these see Throwable.getSuppressed but i'm not sure if it will apply in this case.
msg11284 (view) Author: Stefan Richthofer (stefan.richthofer) Date: 2017-03-30.12:16:11
> What makes you think the StackOverflowError thrown in the finally doesn't propagate up the stack?

It also occurred to me that an earlier call of recursion would handle the subsequent StackOverflowError, but if you add an output to the catch-block (not present in the example) you will see that only a single StackOverflowError is raised.

I also red that an error or exception thrown in a catch or finally block can replace the original one, but IMO the initial StackOverflowError has already been handled by the catch, once the finally block starts. In either case *something* should be visible in higher catch/finally blocks, but nothing shows up (also if you catch all Error or Throwable instead of just StackOverflowError).
msg11286 (view) Author: Stefan Richthofer (stefan.richthofer) Date: 2017-04-02.13:38:28
I just found out that the stack is so exhausted that even the System.out.println() causes a subsequent StackOverflowError, which prevents the expected output. So it's not a JVM bug, while a very inconvenient behavior: The JVM should have some extra stack memory to guarantee that debug outputs, close and unlock operations can pass even in a catch or finally block right after a StackOverflowError.

So, how shall we deal with this situation?
On one hand, this scenario is pathological and we maybe shouldn't spend too much effort on it.
On the other hand, having potential for deadlocks is an unbearable situation in the test suite. However, just dropping test_endless_recursion(self) is maybe not what we want either (while the easiest short-term solution, especially with 2.7.1 release in mind). Suggestions?
msg11287 (view) Author: James Mudd (jamesmudd) Date: 2017-04-02.15:10:44
Sorry I haven't got back to you on where the exceptions occur, I haven't really had any time to look at it.

About the test suite having the potential for deadlock. I agree this is bad, however is this actually the case? With the increased Xss setting I have never observed these deadlocks, whereas before they would happen frequently (a bit vague I know). So while I can't explain why increasing Xss should effect this issue, it seems to be the case. Has anyone seen these deadlocks occur since Xss was increased?

On the JVM, I don't know if offering extra memory to allow catch or finally to complete is plausible, where would you stop, there is nothing stopping someone assigning huge new objects in a finally (although I think it would be considered bad practice). I think the issue here is caused because Jython catches Throwable and then attempts to continue, this is risky, but needed here to produce the correct behaviour a Python application would expect. I think it's best said here https://docs.oracle.com/javase/7/docs/api/java/lang/Error.html "An Error is a subclass of Throwable that indicates serious problems that a reasonable application should not try to catch" and "these errors are abnormal conditions that should never occur"

Finally, I don't think this should be a issue to hold back the 2.7.1 release, essentially this only occurs because tests are deliberately performing infinite recursions to see what happens. This should not be the case in any normal operation.
msg11288 (view) Author: Jim Baker (zyasoft) Date: 2017-04-02.15:18:50
Per https://hg.python.org/cpython/rev/a21f5af476cb (part of http://bugs.python.org/issue12051, which added test_endless_recursion), we most likely want to do a defensive use of ThreadState#enterCall, so we can reduce the expected amount of "available" stack.

Note that prior to 2.7.0, we did not have a Java accelerator (port of _json.c), so this problem never occurred because it was only in Python code (which always checks on every call). In general, I would have expected the call to the default function would cause enough checking, but apparently not.

There's a subtle interaction between https://docs.python.org/2/library/sys.html#sys.setrecursionlimit (by default 1000, https://github.com/jythontools/jython/blob/master/src/org/python/core/PySystemState.java#L167), the actual memory for the stack (set by -Xss), and how much Java code is actually executed. It looks like we have been exploring this subtlety during the course of this bug.

I don't think there's much more we can ask Java itself to do here; much like CPython, we have to code even more defensively.
msg11289 (view) Author: Stefan Richthofer (stefan.richthofer) Date: 2017-04-02.16:20:50
Sure, this is not a release blocker (any more). AFAIK, release is only blocked by #2570 as of this writing.

> On the JVM, I don't know if offering extra memory to allow catch or finally to complete is plausible, where would you stop

Just a limited selection of very typical actions like System.out.println, printStackTrace, unlock methods from util.concurrent.locks, close methods of built-in Java-classes should be guaranteed pass, if directly called. If your application goes the unusual way of catching an error, you would probably do it for debugging or cleanup of very critical resources, e.g. ones which can later cause JVM to freeze. Anyway, this is only hypothetical talk, as it will most likely not trigger a JVM improvement.

Regarding xss: I am glad, this seems to be a proper workaround; however as long as we don't understand why it works, this issue cannot be closed. But I'll set it to low priority.

Jim's suggestion regarding ThreadState#enterCall sounds promising, but I guess, doing the required experiments and detail work to explore the mentioned subtleties would be (yet another) serious workload.
msg11292 (view) Author: Jim Baker (zyasoft) Date: 2017-04-02.22:15:59
Somehow I "removed" instead of "replied" to Stefan's comment. Oops! Presumably thinking of a different UI. Reposting:

Stefan Richthofer added the comment:

"One" more thing:

- We probably should re-enable test_logging in the test-suite again.

- If there are no concerns, I'd like to close all other deadlock issues related to this phenomenon (Resolution: duplicate) and add a final note with link to this issue.

- If there are no concerns, I'd change the title of this issue to something like "occasional freezes due to StackOverflowError in finally block"

(I will do this stuff as my time allows)
msg11293 (view) Author: Jim Baker (zyasoft) Date: 2017-04-02.22:19:28
Agreed about this issue now being low in its priority. A quick perusal suggests that check_circular=False is not used for json encoding; here's the top example which points out it's a bad idea:
https://github.com/Yelp/mrjob/issues/1006
msg11310 (view) Author: Stefan Richthofer (stefan.richthofer) Date: 2017-04-14.18:58:34
Closed #2537 and #2565 as duplicates of this, because they are most likely caused by the same phenomenon.
History
Date User Action Args
2017-04-14 18:58:34stefan.richthofersetmessages: + msg11310
title: test_logging deadlocks during regrtest -> deadlocks in regrtests due to StackOverflowError in finally block
2017-04-02 22:19:28zyasoftsetmessages: + msg11293
2017-04-02 22:15:59zyasoftsetmessages: + msg11292
2017-04-02 22:13:39zyasoftsetmessages: - msg11290
2017-04-02 16:34:48stefan.richthofersetmessages: + msg11290
2017-04-02 16:20:51stefan.richthofersetpriority: low
resolution: remind
messages: + msg11289
severity: normal -> minor
2017-04-02 15:18:50zyasoftsetmessages: + msg11288
2017-04-02 15:10:45jamesmuddsetmessages: + msg11287
2017-04-02 13:38:29stefan.richthofersetmessages: + msg11286
2017-03-30 12:16:11stefan.richthofersetmessages: + msg11284
2017-03-30 11:07:23jamesmuddsetmessages: + msg11283
2017-03-30 10:51:11stefan.richthofersetfiles: + Test2.java
2017-03-30 10:50:48stefan.richthofersetfiles: - Test2.java
2017-03-30 09:15:58stefan.richthofersetfiles: + Test2.java
messages: + msg11282
2017-03-30 08:22:01stefan.richthofersetmessages: + msg11281
2017-03-29 22:15:28jamesmuddsetmessages: + msg11280
2017-03-27 22:14:42jamesmuddsetmessages: + msg11279
2017-03-27 16:46:46stefan.richthofersetfiles: + Test.java
messages: + msg11277
2017-03-25 14:54:08jamesmuddsetmessages: + msg11274
2017-03-25 11:21:26jeff.allensetnosy: + jeff.allen
messages: + msg11273
2017-03-25 04:15:56stefan.richthofersetmessages: + msg11270
2017-03-16 00:11:17zyasoftsetmessages: + msg11236
2017-03-15 23:41:55stefan.richthofersetmessages: + msg11234
2017-03-09 08:11:41jamesmuddsetfiles: + test_logging_undetected_deadlock.txt
messages: + msg11210
2017-03-03 12:32:10stefan.richthofersetmessages: + msg11162
2017-02-28 21:26:21jamesmuddsetnosy: + jamesmudd
messages: + msg11145
2016-12-01 14:43:27stefan.richthofercreate