Title: missing synchronized causes deadlocks in org.python.core.IdImpl
Type: crash Severity: normal
Components: Core Versions: 2.5.1, 2.5.0
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: zyasoft Nosy List: akruis, zyasoft
Priority: Keywords: patch

Created on 2010-04-26.09:25:49 by akruis, last changed 2010-04-27.22:29:40 by zyasoft.

File name Uploaded Description Edit Remove
IdImpl.diff akruis, 2010-04-26.09:25:49 Patch for this issue: diff against rev 7046
thread_dumps.txt akruis, 2010-04-27.07:12:10
msg5748 (view) Author: Anselm Kruis (akruis) Date: 2010-04-26.09:58:42
The current implementation of org.python.core.IdImpl is not fully thread save, because the #java_obj_id(Object) method is not synchronized. IdImpl uses a java.util.HashMap to store ID-values for objects. The API-doc for java.util.HashMap requires to synchronize concurrent access to HashMap.

IdImpl.java_obj_id(Object) method is indirectly called by thread.get_ident().
(thread.get_ident() -> org.python.core.Py.java_obj_id(Object) -> org.python.core.IdImpl.java_obj_id(Object)

Django uses thread.get_ident() in its transaction management code. I observed two deadlocks on a multiprocessor server, that were apparently caused by the IdImpl#isMap HashMap being corrupted. I can provide the Java thread dumps if anybody is interested.

The attached patch fixes this issue.
msg5750 (view) Author: Jim Baker (zyasoft) Date: 2010-04-26.15:30:13
Thanks for the simple patch. Could you submit a test case as well?
msg5752 (view) Author: Anselm Kruis (akruis) Date: 2010-04-27.07:12:10
Unfortunately, the bug is fairly hard to trigger. I couldn't reproduce it on my personal development system. You probably need: several threads, memory pressure to trigger garbage collection. At least this are the conditions on my pre-production system. I observed the deadlock twice within about 8-millions transactions.

I have attached the thread dumps of both cases. Common to both cases is, that the "RUNNABLE" thread consumes 100% CPU, doesn't perform any system calls and does not return from java.util.HashMap.get / java.util.HashMap.removeEntryForKey. Both methods traverse a linked list and this became cyclic.
msg5754 (view) Author: Jim Baker (zyasoft) Date: 2010-04-27.22:29:40
Fixed in r7048

Thanks for the debug output. The infinite loop you observed on a hash bucket is perhaps the classical bug now seen in thread safety problems :)

Fortunately it's also easy to fix.

I'm going to mark this bug closed since the specific bug is basically not possible to recreate with unit tests (at least with our resources) and more importantly it's a well-known problem with a robust solution.
Date User Action Args
2010-04-27 22:29:40zyasoftsetstatus: open -> closed
resolution: accepted -> fixed
messages: + msg5754
2010-04-27 07:12:14akruissetfiles: + thread_dumps.txt
messages: + msg5752
2010-04-26 15:30:26zyasoftsetassignee: zyasoft
resolution: accepted
2010-04-26 15:30:14zyasoftsetnosy: + zyasoft
messages: + msg5750
2010-04-26 09:58:42akruissetmessages: + msg5748
2010-04-26 09:25:49akruiscreate