Issue2536

classification
Title: test_logging deadlocks during regrtest
Type: Severity: normal
Components: Core Versions:
Milestone:
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: jamesmudd, jeff.allen, stefan.richthofer, zyasoft
Priority: Keywords:

Created on 2016-12-01.14:43:27 by stefan.richthofer, last changed 2017-03-30.12:16:11 by stefan.richthofer.

Files
File name Uploaded Description Edit Remove
test_logging_hangs.txt stefan.richthofer, 2016-12-01.14:43:25 full jstack-output while test_logging deadlocks.
test_logging_undetected_deadlock.txt jamesmudd, 2017-03-09.08:11:39 undetected deadlock
Test.java stefan.richthofer, 2017-03-27.16:46:45 How to bypass finally-blocks
Test2.java stefan.richthofer, 2017-03-30.10:51:11 How to bypass finally-blocks ... actually
Messages
msg10996 (view) Author: Stefan Richthofer (stefan.richthofer) Date: 2016-12-01.14:43:25
Running on Linux Mint 18, 64 Bit with Java 8 I observe a deadlock in test_logging.py from time to time. It only occurs when running with 'ant regrtest' and the test itself does not give output or error, regrtest just hangs indefinitely.

jstack reports a deadlock (full output attached):


Found one Java-level deadlock:
=============================
"Thread-262":
  waiting for ownable synchronizer 0x00000000e06df030, (a java.util.concurrent.locks.ReentrantLock$NonfairSync),
  which is held by "MainThread"
"MainThread":
  waiting for ownable synchronizer 0x00000000e0787518, (a java.util.concurrent.locks.ReentrantLock$NonfairSync),
  which is held by "Thread-262"

Java stack information for the threads listed above:
===================================================
"Thread-262":
	at sun.misc.Unsafe.park(Native Method)
	- parking to wait for  <0x00000000e06df030> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
	at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
	at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
	at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2228)
	at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2208)
	at com.google.common.cache.LocalCache.get(LocalCache.java:4053)
	at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:4057)
	at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4986)
	at com.google.common.cache.LocalCache$LocalLoadingCache.getUnchecked(LocalCache.java:4992)
	at org.python.modules.sre.SRE_STATE$CACHE.get(SRE_STATE.java:1290)
	at org.python.modules.sre.SRE_STATE$CACHE.access$000(SRE_STATE.java:1253)
	at org.python.modules.sre.SRE_STATE.<init>(SRE_STATE.java:1295)
	at org.python.modules.sre.PatternObject.match(PatternObject.java:80)
	at org.python.modules.sre.PatternObject$match_exposer.__call__(Unknown Source)
	at org.python.core.PyObject.__call__(PyObject.java:480)
	at org.python.core.PyObject.__call__(PyObject.java:484)
	at logging.config$py.valid_ident$10(logging/config.py:275)
	at logging.config$py.call_function(logging/config.py)
	at org.python.core.PyTableCode.call(PyTableCode.java:167)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:139)
	at org.python.core.PyFunction.__call__(PyFunction.java:413)
	at logging.config$py.configure_handler$33(logging/config.py:732)
	at logging.config$py.call_function(logging/config.py)
	at org.python.core.PyTableCode.call(PyTableCode.java:167)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:154)
	at org.python.core.PyFunction.__call__(PyFunction.java:423)
	at org.python.core.PyMethod.__call__(PyMethod.java:141)
	at logging.config$py.configure$29(logging/config.py:640)
	at logging.config$py.call_function(logging/config.py)
	at org.python.core.PyTableCode.call(PyTableCode.java:167)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:139)
	at org.python.core.PyFunction.__call__(PyFunction.java:413)
	at org.python.core.PyMethod.__call__(PyMethod.java:126)
	at logging.config$py.dictConfig$38(logging/config.py:777)
	at logging.config$py.call_function(logging/config.py)
	at org.python.core.PyTableCode.call(PyTableCode.java:167)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:139)
	at org.python.core.PyFunction.__call__(PyFunction.java:413)
	at logging.config$py.handle$41(logging/config.py:840)
	at logging.config$py.call_function(logging/config.py)
	at org.python.core.PyTableCode.call(PyTableCode.java:167)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:139)
	at org.python.core.PyFunction.__call__(PyFunction.java:413)
	at org.python.core.PyMethod.__call__(PyMethod.java:126)
	at SocketServer$py.__init__$47(SocketServer.py:659)
	at SocketServer$py.call_function(SocketServer.py)
	at org.python.core.PyTableCode.call(PyTableCode.java:167)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:308)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:199)
	at org.python.core.PyFunction.__call__(PyFunction.java:482)
	at org.python.core.PyFunction.__call__(PyFunction.java:476)
	at org.python.core.PyInstance.__init__(PyInstance.java:124)
	at org.python.core.PyClass.__call__(PyClass.java:193)
	at org.python.core.PyObject.__call__(PyObject.java:515)
	at org.python.core.PyObject.__call__(PyObject.java:519)
	at SocketServer$py.finish_request$13(SocketServer.py:334)
	at SocketServer$py.call_function(SocketServer.py)
	at org.python.core.PyTableCode.call(PyTableCode.java:167)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:171)
	at org.python.core.PyFunction.__call__(PyFunction.java:434)
	at org.python.core.PyMethod.__call__(PyMethod.java:156)
	at SocketServer$py.process_request_thread$36(SocketServer.py:605)
	at SocketServer$py.call_function(SocketServer.py)
	at org.python.core.PyTableCode.call(PyTableCode.java:167)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:308)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:199)
	at org.python.core.PyFunction.__call__(PyFunction.java:482)
	at org.python.core.PyMethod.instancemethod___call__(PyMethod.java:237)
	at org.python.core.PyMethod.__call__(PyMethod.java:228)
	at org.python.core.PyMethod.__call__(PyMethod.java:223)
	at org.python.core.PyObject._callextra(PyObject.java:620)
	at threading$py.run$35(threading.py:213)
	at threading$py.call_function(threading.py)
	at org.python.core.PyTableCode.call(PyTableCode.java:167)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:139)
	at org.python.core.PyFunction.__call__(PyFunction.java:413)
	at org.python.core.PyMethod.__call__(PyMethod.java:126)
	at threading$py._Thread__bootstrap$36(threading.py:266)
	at threading$py.call_function(threading.py)
	at org.python.core.PyTableCode.call(PyTableCode.java:167)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:308)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:199)
	at org.python.core.PyFunction.__call__(PyFunction.java:482)
	at org.python.core.PyMethod.instancemethod___call__(PyMethod.java:237)
	at org.python.core.PyMethod.__call__(PyMethod.java:228)
	at org.python.core.PyMethod.__call__(PyMethod.java:218)
	at org.python.core.PyMethod.__call__(PyMethod.java:213)
	at org.python.core.FunctionThread.run(FunctionThread.java:25)
"MainThread":
	at sun.misc.Unsafe.park(Native Method)
	- parking to wait for  <0x00000000e0787518> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
	at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
	at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
	at org.python.modules._threading.RLock.RLock_acquire(RLock.java:42)
	at org.python.modules._threading.RLock$RLock_acquire_exposer.__call__(Unknown Source)
	at org.python.core.PyObject.__call__(PyObject.java:468)
	at logging$py._acquireLock$6(logging/__init__.py:219)
	at logging$py.call_function(logging/__init__.py)
	at org.python.core.PyTableCode.call(PyTableCode.java:167)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:125)
	at org.python.core.PyFunction.__call__(PyFunction.java:403)
	at logging.config$py.stopListening$48(logging/config.py:906)
	at logging.config$py.call_function(logging/config.py)
	at org.python.core.PyTableCode.call(PyTableCode.java:167)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:125)
	at org.python.core.PyFunction.__call__(PyFunction.java:403)
	at test.test_logging$py.setup_via_listener$84(test/test_logging.py:1809)
	at test.test_logging$py.call_function(test/test_logging.py)
	at org.python.core.PyTableCode.call(PyTableCode.java:167)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:154)
	at org.python.core.PyFunction.__call__(PyFunction.java:423)
	at org.python.core.PyMethod.__call__(PyMethod.java:141)
	at test.test_logging$py.test_listen_config_10_ok$85(test/test_logging.py:1825)
	at test.test_logging$py.call_function(test/test_logging.py)
	at org.python.core.PyTableCode.call(PyTableCode.java:167)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:139)
	at org.python.core.PyFunction.__call__(PyFunction.java:413)
	at org.python.core.PyMethod.__call__(PyMethod.java:126)
	at unittest.case$py.run$35(unittest/case.py:377)
	at unittest.case$py.call_function(unittest/case.py)
	at org.python.core.PyTableCode.call(PyTableCode.java:167)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:308)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:199)
	at org.python.core.PyFunction.__call__(PyFunction.java:482)
	at org.python.core.PyMethod.instancemethod___call__(PyMethod.java:237)
	at org.python.core.PyMethod.__call__(PyMethod.java:228)
	at org.python.core.PyMethod.__call__(PyMethod.java:223)
	at org.python.core.PyObject._callextra(PyObject.java:620)
	at unittest.case$py.__call__$37(unittest/case.py:396)
	at unittest.case$py.call_function(unittest/case.py)
	at org.python.core.PyTableCode.call(PyTableCode.java:167)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:308)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:199)
	at org.python.core.PyFunction.__call__(PyFunction.java:482)
	at org.python.core.PyMethod.instancemethod___call__(PyMethod.java:237)
	at org.python.core.PyMethod.__call__(PyMethod.java:228)
	at org.python.core.PyMethod.__call__(PyMethod.java:223)
	at org.python.core.PyObjectDerived.__call__(PyObjectDerived.java:1030)
	at org.python.core.PyObject.__call__(PyObject.java:480)
	at org.python.core.PyObject.__call__(PyObject.java:484)
	at unittest.suite$py.run$16(unittest/suite.py:116)
	at unittest.suite$py.call_function(unittest/suite.py)
	at org.python.core.PyTableCode.call(PyTableCode.java:167)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:308)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:199)
	at org.python.core.PyFunction.__call__(PyFunction.java:482)
	at org.python.core.PyMethod.instancemethod___call__(PyMethod.java:237)
	at org.python.core.PyMethod.__call__(PyMethod.java:228)
	at org.python.core.PyMethod.__call__(PyMethod.java:223)
	at org.python.core.PyObject._callextra(PyObject.java:620)
	at unittest.suite$py.__call__$13(unittest/suite.py:70)
	at unittest.suite$py.call_function(unittest/suite.py)
	at org.python.core.PyTableCode.call(PyTableCode.java:167)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:308)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:199)
	at org.python.core.PyFunction.__call__(PyFunction.java:482)
	at org.python.core.PyMethod.instancemethod___call__(PyMethod.java:237)
	at org.python.core.PyMethod.__call__(PyMethod.java:228)
	at org.python.core.PyMethod.__call__(PyMethod.java:223)
	at org.python.core.PyObjectDerived.__call__(PyObjectDerived.java:1030)
	at org.python.core.PyObject.__call__(PyObject.java:480)
	at org.python.core.PyObject.__call__(PyObject.java:484)
	at unittest.suite$py.run$16(unittest/suite.py:116)
	at unittest.suite$py.call_function(unittest/suite.py)
	at org.python.core.PyTableCode.call(PyTableCode.java:167)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:308)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:199)
	at org.python.core.PyFunction.__call__(PyFunction.java:482)
	at org.python.core.PyMethod.instancemethod___call__(PyMethod.java:237)
	at org.python.core.PyMethod.__call__(PyMethod.java:228)
	at org.python.core.PyMethod.__call__(PyMethod.java:223)
	at org.python.core.PyObject._callextra(PyObject.java:620)
	at unittest.suite$py.__call__$13(unittest/suite.py:70)
	at unittest.suite$py.call_function(unittest/suite.py)
	at org.python.core.PyTableCode.call(PyTableCode.java:167)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:308)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:199)
	at org.python.core.PyFunction.__call__(PyFunction.java:482)
	at org.python.core.PyMethod.instancemethod___call__(PyMethod.java:237)
	at org.python.core.PyMethod.__call__(PyMethod.java:228)
	at org.python.core.PyMethod.__call__(PyMethod.java:223)
	at org.python.core.PyObjectDerived.__call__(PyObjectDerived.java:1030)
	at org.python.core.PyObject.__call__(PyObject.java:480)
	at org.python.core.PyObject.__call__(PyObject.java:484)
	at test.junit_xml$py.run$3(test/junit_xml.py:27)
	at test.junit_xml$py.call_function(test/junit_xml.py)
	at org.python.core.PyTableCode.call(PyTableCode.java:167)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:154)
	at org.python.core.PyFunction.__call__(PyFunction.java:423)
	at org.python.core.PyMethod.__call__(PyMethod.java:141)
	at test.test_support$py._run_suite$102(test/test_support.py:1303)
	at test.test_support$py.call_function(test/test_support.py)
	at org.python.core.PyTableCode.call(PyTableCode.java:167)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:308)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:132)
	at org.python.core.PyFunction.__call__(PyFunction.java:413)
	at test.test_support$py.run_unittest$103(test/test_support.py:1320)
	at test.test_support$py.call_function(test/test_support.py)
	at org.python.core.PyTableCode.call(PyTableCode.java:167)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:308)
	at org.python.core.PyFunction.function___call__(PyFunction.java:471)
	at org.python.core.PyFunction.__call__(PyFunction.java:466)
	at org.python.core.PyFunction.__call__(PyFunction.java:456)
	at test.test_logging$py.test_main$96(test/test_logging.py:1927)
	at test.test_logging$py.call_function(test/test_logging.py)
	at org.python.core.PyTableCode.call(PyTableCode.java:167)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:308)
	at org.python.core.PyFunction.function___call__(PyFunction.java:471)
	at org.python.core.PyFunction.__call__(PyFunction.java:466)
	at org.python.core.PyFunction.__call__(PyFunction.java:461)
	at org.python.core.PyObject._callextra(PyObject.java:620)
	at test.test_support$py.inner$83(test/test_support.py:1095)
	at test.test_support$py.call_function(test/test_support.py)
	at org.python.core.PyTableCode.call(PyTableCode.java:167)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:308)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:119)
	at org.python.core.PyFunction.__call__(PyFunction.java:403)
	at org.python.pycode._pyx0.runtest_inner$5(test/regrtest.py:676)
	at org.python.pycode._pyx0.call_function(test/regrtest.py)
	at org.python.core.PyTableCode.call(PyTableCode.java:167)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:308)
	at org.python.core.PyFunction.function___call__(PyFunction.java:471)
	at org.python.core.PyFunction.__call__(PyFunction.java:466)
	at org.python.core.PyFunction.__call__(PyFunction.java:456)
	at org.python.pycode._pyx0.runtest$4(test/regrtest.py:566)
	at org.python.pycode._pyx0.call_function(test/regrtest.py)
	at org.python.core.PyTableCode.call(PyTableCode.java:167)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:308)
	at org.python.core.PyFunction.function___call__(PyFunction.java:471)
	at org.python.core.PyFunction.__call__(PyFunction.java:466)
	at org.python.core.PyFunction.__call__(PyFunction.java:456)
	at org.python.pycode._pyx0.main$2(test/regrtest.py:503)
	at org.python.pycode._pyx0.call_function(test/regrtest.py)
	at org.python.core.PyTableCode.call(PyTableCode.java:167)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:308)
	at org.python.core.PyBaseCode.call(PyBaseCode.java:119)
	at org.python.core.PyFunction.__call__(PyFunction.java:403)
	at org.python.pycode._pyx0.f$0(test/regrtest.py:1558)
	at org.python.pycode._pyx0.call_function(test/regrtest.py)
	at org.python.core.PyTableCode.call(PyTableCode.java:167)
	at org.python.core.PyCode.call(PyCode.java:18)
	at org.python.core.Py.runCode(Py.java:1401)
	at org.python.util.PythonInterpreter.execfile(PythonInterpreter.java:296)
	at org.python.util.jython.run(jython.java:362)
	at org.python.util.jython.main(jython.java:142)

Found 1 deadlock.
msg11145 (view) Author: James Mudd (jamesmudd) Date: 2017-02-28.21:26:21
I can confirm I have also just seen this happen on Ubuntu 16.04 64 bit with Java 1.8.0_121
msg11162 (view) Author: Stefan Richthofer (stefan.richthofer) Date: 2017-03-03.12:32:10
Also observed it on Windows 10 right now.
msg11210 (view) Author: James Mudd (jamesmudd) Date: 2017-03-09.08:11:39
Have just seen test_logging deadlock again but slightly different. Using master with includes reverting the commit 41d8823 to backout the syncronization in _socket.py ans ssl.py

Since this was reverted it does seem to be better and now I see a deadlock jstack can't detect. Maybe a read lock? jstack -l output attached.
msg11234 (view) Author: Stefan Richthofer (stefan.richthofer) Date: 2017-03-15.23:41:54
After more study of this issue I found out that somehow test_json is preparing ground for this deadlock. It sometimes causes guava not to release a certain lock on a cache, which later leads to this deadlock. Especially test_scanstring seems to cause this; I am not yet sure how this can happen.

However it would be helpful if others could confirm the following:

Exclude test_json from regrtests (e.g. insert tests.remove('test_json') into Lib/test/regrtest.py in line 377. There surely is a more proper way to exclude a certain test, feel free to use that one!)

Then, test_logging should not deadlock any more.
msg11236 (view) Author: Jim Baker (zyasoft) Date: 2017-03-16.00:11:17
Very interesting debugging detail!

Note that the json module has Java acceleration support in org.python.modules._json, but otherwise has pure Python implementation; this acceleration can be readily turned off by removing this entry: https://github.com/jythontools/jython/blob/master/src/org/python/modules/Setup.java#L38

So we can see if it's this acceleration causing the problem or not.
msg11270 (view) Author: Stefan Richthofer (stefan.richthofer) Date: 2017-03-25.04:15:54
I think I finally understood the cause of this issue - it is crazily subtle. Good news is that it was already (accidentally) fixed as a side-effect of https://hg.python.org/jython/rev/17e40de9a541 (I wondered why it never occurred in recent trunk-versions).
Bad news is that some things silently go awfully wrong and I don't see a proper way to detect this on Jython-side.

So, the whole story of this bug... here we go.

guava uses reentrant locks to secure all sorts of access to data structures. It uses finally blocks to unlock such locks. Note that - as locks are reentrant - it can occur that finally-enclosed calls are nested.

This bug originates in finally-blocks not being executed; thus some locks fail to unlock and later cause deadlocks. While this can happen in various scenarios, it typically involves guava and the bypassed unlock typically happens in test_json (i.e. in org.python.modules.sre.SRE_STATE$CACHE, which is backed by com.google.common.cache.LocalCache).

But how can a finally-block be bypassed?

https://docs.oracle.com/javase/tutorial/essential/exceptions/finally.html
tells us it can be caused by thread interruption or System.exit, but I asserted that none of these is the case here. The document misses to mention that OutOfMemoryError and StackOverflowError can terminate the JVM without executing an enclosing finally-block:

http://stackoverflow.com/questions/22969000/finally-block-will-be-executed-in-case-of-outofmemoryerror
http://stackoverflow.com/questions/7753443/behaviour-of-jvm-during-out-of-memory-error-list-s-new-arrayliststring
http://stackoverflow.com/questions/17836120/stack-overflow-error-handling-in-finally-block

Normally that doesn't matter much since everything blows up anyway. However recall that in guava, calls enclosed by finally blocks are sometimes nested. In that case the outer finally block can catch the error (and is often able to, because while the error propagates up the stack, memory might get freed) and then resume normal execution - with the slight detail that the inner finally block was bypassed and lock-count was messed up. The nasty thing here is that no error is displayed - things go silently wrong. Also, it is somewhat random how much of the stack is skipped before normal execution resumes - sometimes it drops even two unlocks at a time.

I observed that test_json just happens to yield very deep call-stacks, probably because it is a nested test suite. Extending xss (https://hg.python.org/jython/rev/17e40de9a541) somewhat solves it; however the issue only shifts to deeper stacks. My concern is that OutOfMemoryError or StackOverflowError are not displayed properly; just a (seemingly) unexplainable deadlock occurs later on (and I can tell, it takes a lot of time and thought to discover the actual cause).

I suppose, this phenomenon causes all of the recently observed deadlocks (e.g. in #2565 an unlock is lost on PyType$LazyClassToTypeHolder.classToType, which is backed by com.google.common.collect.MapMakerInternalMap) and I don't see a way to detect it properly in Jython. Increasing xss is the best thing we can do on Jython-side so far.
guava could maybe detect bypassed finally-blocks by explicitly catching OutOfMemoryError/StackOverflowError and issue a warning or enforce a hard exit then (or double-check lock-consistency once proper execution eventually resumed).
Actually, the JVM itself should not resume execution once an OutOfMemoryError or StackOverflowError was thrown. These errors should IMHO explicitly propagate through finally blocks in the entire stack rather than bypassing only some of them in non-deterministic fashion.

So I am likelyy going to file issues regarding this scenario in guava and at Oracle/JVM.

Opinions?
msg11273 (view) Author: Jeff Allen (jeff.allen) Date: 2017-03-25.11:21:26
Impressive analysis Stefan. Not much I can add.

I suppose the finally clauses fail to operate because they need to allocate objects? It's almost as if a rul of successful finally clauses was that they should have all their object allocation done ahead of time. If once you run out of memory, all contracts enforced by finally are unreliable, the condition should not be handled too locally but propagate out to where all affected objects are dereferenced. But how can anyone know where that is?

Only one thought: even after it is (mis-)handled, is there any chance of spotting that we were, for a time, out of memory?
msg11274 (view) Author: James Mudd (jamesmudd) Date: 2017-03-25.14:54:08
@Stefan Really good work, that's a really interesting read.

That also explains why I haven't seen this happen for a while which is good.

I'm still not exactly clear what happens the the OutOfMemoryError or StackOverflowError. If this error get thrown, but not caught, you should go through each finally block down the stack until the JVM exits? If one finally block tries to create new objects (which it seems like calling unlock on a ReentrantLock might) then that finally should throw, and you will drop out into the next enclosing finally (which may do the same) and so on, but the Throwable itself shouldn't disappear unless its caught? So for "normal" execution to resume the Throwable must be caught? I can't see anywhere in Guava where that could happen, there are a few places in Jython but non look very likely to me, how does normal execution resume?

Do you have any way to reproduce this?

I'm really glad someone is looking at this one, I tried to reproduce it for a while and eventually gave up.
msg11277 (view) Author: Stefan Richthofer (stefan.richthofer) Date: 2017-03-27.16:46:45
>I'm still not exactly clear what happens the the OutOfMemoryError or StackOverflowError

I think it's StackOverflowError, but I suppose under heavy memory load the former one could cause a similar phenomenon.

Actually I wasn't able to detect the Error explicitly "in field", but the attached Test.java can provoke the explained behavior (You'll have to run it several times, but it runs fast and I observe the issue every 5th run or so, so it's still easy to reproduce).

Additionally, the fact that xss-value can affect likeliness of the issue supports the StackOverflowError thesis.

If you run Test.java often enough you will observe something like this:

...
recursion 4607
recursion 4608
recursion 4609
recursion 4610
recursion 4611java.lang.Stacjava.lang.StackOverflowErrorjava.lang.StackOverflowErrorjava.lang.StackOverflowError
4608
10619136
10628355
continue normal execution...

If the checksums (10619136, 10628355) differ some finally blocks were bypassed; note that the difference is 4609+4610, which belong to the frames that were also dropped by the catch: It printed 4608. The construct as a whole however still resumes normal execution.

I am currently not able to pin down how this program structure is exactly embedded in guava, but I am convinced it is present there in manifold ways (which is why we observed various slightly different deadlocks).
msg11279 (view) Author: James Mudd (jamesmudd) Date: 2017-03-27.22:14:41
@Stefan I'm now also convinced this is happening although I still can't figure out exactly where the StackOverflowError is going. With the debugger connected and an exception breakpoint on StackOverflowError, run

james@james-ubuntu:~/git/jython/dist$ java -agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=1045 -Xss1024k -jar jython.jar Lib/test/test_json.py

I can reliably get test_json to be paused as the exception is thrown, usually in test_endless_recursion which sounds reasonable, and after disconnecting the test runs to completion, so normal execution has resumed somehow...

Unfortunately Eclipse seems to be having serious issues getting the stack and allowing me to step through it, probably due to the state the JVM is in, the debugger just freezes which is strange I wonder if there is some other way...
msg11280 (view) Author: James Mudd (jamesmudd) Date: 2017-03-29.22:15:26
So having spent a bit more time on this, I think Stefan is right with the deadlocks being caused by skipped finally blocks. IntelliJ's debugger seems to have no issues pausing on StackOverflowError and watching the execution. So if you want to see what happens I recommend trying IntelliJ.

So what happens is, at some point a StackOverflowError is thrown, some finally blocks are executed but some are skipped when another StackOverflowError is thrown during the finally, anyway eventually this is caught in Jython code, line 168 in PyTableCode which calls Py.JavaError(t), which then explicitly handles the StackOverflowError returning a Py.RuntimeError("maximum recursion depth exceeded (Java StackOverflowError)"). The issue occurs because this is exactly what these recursion test are expecting to happen. So they see the RuntimeError and that means the test passes and normal execution resumes (e.g line 108 in /Lib/json/tests/test_recursion.py - This is a good test to run always reproduces this issue) So the good thing is this works as expected I guess, except the part where finally blocks were skipped. So if there is a way to handle this in Java (i.e somehow ensure all the locks are unlocked) then we could do that by hooking the Py.JavaError(Throwable t) method which handles OutOfMemoryError and StackOverflowError, explicitly.

The thing I don't understand is how increasing Xss has any effect on this (although I observe it does), even with Xss5m the StackOverflowError still happens, as this is a infinite recursion test. So you would think that the unlocking issue would still happen but it doesn't? I can only assume the stack is so deep, by the time you make it back to a lock the finally will be executed successfully.

Maybe this is useful, if nothing else I enjoyed seeing how this happens.
msg11281 (view) Author: Stefan Richthofer (stefan.richthofer) Date: 2017-03-30.08:22:00
I learned that my minimal example Test.java is false. The finally blocks are actually always called there, only the System.out.println was sometimes swallowed. If one prints the checksum along with "continue normal execution..." output line, it works like it should.

James, could you name the class and method where the StackOverflow happens? And also class and method of the finally block that raises another StackOverflow?
msg11282 (view) Author: Stefan Richthofer (stefan.richthofer) Date: 2017-03-30.09:15:57
Okay, with James' hints I was able to write a new minimal example showing how finally blocks can be bypassed ... actually. See Test2.java

The trick is as follows: After catching a StackOverflowError, the finally block performs a method-call. Even if the called method is empty, this can sometimes directly trigger another StackOverflowError, because the stack is already near to exhaustion. (In the example I let the dummy-method call itself twice so the stack expands slightly more, which makes the issue occur literally always.)

You will find that it outputs something like

Start...
continue normal execution... 7924

which means that some resource would not have been successfully cleaned up. (Otherwise it would be "continue normal execution... 0")

I guess that an increased xss somehow lets the JVM manage stack memory in larger blocks, so it is more resilient if the finally block requires additional stack; however this is a pure guess.

The real question is:
Why is the StackOverflowError happening in the finally block (?) not thrown? AFAIK exceptions raised in catch or finally blocks should propagate up the stack, e.g. see http://stackoverflow.com/questions/3779285/exception-thrown-in-catch-and-finally-clause

So this behavior is definitely somehow fishy.
msg11283 (view) Author: James Mudd (jamesmudd) Date: 2017-03-30.11:07:23
@Stefan I'm at work at them moment so will try to post the classes and methods tonight, but just a few comments

1. The new example is exactly right, I haven't tried it, but to get a StackOverflowError to occur in the finally is very easy in this situation, as you say the stack is very nearly exhausted already, so any method call could cause it.

2. Once you throw in the finally I believe the new exception replaces the existing one, I will check this tonight but remember investigating it once before for another reason. In this case it will be a new StackOverflowError but the stack it contains will be smaller as you have already dropped a stack frame maybe more, I think this is the source of a lot of the confusion around this. In this situation the stack trace you see in any exception will be only a partial stack. If anyone knows I'm wrong please correct this, because I think this is a key point which we should be clear on.

3. What makes you think the StackOverflowError thrown in the finally doesn't propagate up the stack? Looking at you example, I would suggest that only the last catch needs to succeed in catching the StackOverflowError and you will see nothing, as every time a new StackOverflowError is thrown the old one is lost?

Disclaimer: I might be wrong on any of this (If I am please correct me), I didn't even run the example its just from looking at the code but though it might be useful for discussion.

Also I don't know if it relevant here but throwing an exception over another, suppresses the first they is now a Java way to recover these see Throwable.getSuppressed but i'm not sure if it will apply in this case.
msg11284 (view) Author: Stefan Richthofer (stefan.richthofer) Date: 2017-03-30.12:16:11
> What makes you think the StackOverflowError thrown in the finally doesn't propagate up the stack?

It also occurred to me that an earlier call of recursion would handle the subsequent StackOverflowError, but if you add an output to the catch-block (not present in the example) you will see that only a single StackOverflowError is raised.

I also red that an error or exception thrown in a catch or finally block can replace the original one, but IMO the initial StackOverflowError has already been handled by the catch, once the finally block starts. In either case *something* should be visible in higher catch/finally blocks, but nothing shows up (also if you catch all Error or Throwable instead of just StackOverflowError).
History
Date User Action Args
2017-03-30 12:16:11stefan.richthofersetmessages: + msg11284
2017-03-30 11:07:23jamesmuddsetmessages: + msg11283
2017-03-30 10:51:11stefan.richthofersetfiles: + Test2.java
2017-03-30 10:50:48stefan.richthofersetfiles: - Test2.java
2017-03-30 09:15:58stefan.richthofersetfiles: + Test2.java
messages: + msg11282
2017-03-30 08:22:01stefan.richthofersetmessages: + msg11281
2017-03-29 22:15:28jamesmuddsetmessages: + msg11280
2017-03-27 22:14:42jamesmuddsetmessages: + msg11279
2017-03-27 16:46:46stefan.richthofersetfiles: + Test.java
messages: + msg11277
2017-03-25 14:54:08jamesmuddsetmessages: + msg11274
2017-03-25 11:21:26jeff.allensetnosy: + jeff.allen
messages: + msg11273
2017-03-25 04:15:56stefan.richthofersetmessages: + msg11270
2017-03-16 00:11:17zyasoftsetmessages: + msg11236
2017-03-15 23:41:55stefan.richthofersetmessages: + msg11234
2017-03-09 08:11:41jamesmuddsetfiles: + test_logging_undetected_deadlock.txt
messages: + msg11210
2017-03-03 12:32:10stefan.richthofersetmessages: + msg11162
2017-02-28 21:26:21jamesmuddsetnosy: + jamesmudd
messages: + msg11145
2016-12-01 14:43:27stefan.richthofercreate