Issue2337

classification
Title: StackOverflowError in gc.collect
Type: Severity: normal
Components: Versions: Jython 2.7
Milestone:
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: stefan.richthofer Nosy List: jmadden, stefan.richthofer
Priority: normal Keywords:

Created on 2015-04-23.15:24:48 by jmadden, last changed 2015-05-19.22:09:49 by zyasoft.

Messages
msg9932 (view) Author: Jason Madden (jmadden) Date: 2015-04-23.15:24:47
I came across this while running the `zodbshootout` to get an idea of Jython's performance under ZODB with high-concurrency. There were two threads running, each of which calls `gc.collect` at the beginning of a certain phase of the benchmark (roughly halfway through).

  Traceback (most recent call last):
  File "/jython2.7rc3/Lib/site-packages/zodbshootout/fork.py", line 66, in run
    res = self.func(self.param, self.sync)
  File "/jython2.7rc3/Lib/site-packages/zodbshootout/main.py", line 233, in read
    return self.read_test(db_factory, n, sync)
  File "/jython2.7rc3/Lib/site-packages/zodbshootout/main.py", line 177, in read_test
    conn.cacheMinimize()
  File "//ZODB/src/ZODB/Connection.py", line 272, in cacheMinimize
    connection._cache.minimize()
  File "//persistent/persistent/picklecache.py", line 281, in full_sweep
    return self._sweep(0)
  File "//persistent/persistent/picklecache.py", line 75, in locked
    return f(self, *args, **kwargs)
  File "//persistent/persistent/picklecache.py", line 420, in _sweep
    gc.collect()
  RuntimeError: maximum recursion depth exceeded (Java StackOverflowError)

Even setting -Xss to 16M still overflowed.

ZODB (or more specifically, persistent) does use the `gc.monitorObject` API as well as `weakref.WeakValueDictionary`: the values in the dictionary are monitored because we need a more deterministic cleanup of them then standard Java GC semantics offer.

I tried the tip from #2258 (org.python.core.Options.showJavaExceptions = True) to get the Java stack, but that didn't produce additional output.
msg9937 (view) Author: Stefan Richthofer (stefan.richthofer) Date: 2015-04-23.17:39:01
Could you provide a (preferably minimal) code-sample that produces the error? To fix it reliably, we will need to pin it in a unit-test.
msg9939 (view) Author: Jason Madden (jmadden) Date: 2015-04-23.21:10:26
I haven't been able to reproduce it, but I did get the stack trace. The interesting part:

	at org.python.modules._collections.PyDeque.traverseNode(PyDeque.java:714)
	at org.python.modules._collections.PyDeque.traverseNode(PyDeque.java:714)
	at org.python.modules._collections.PyDeque.traverseNode(PyDeque.java:714)
	at org.python.modules._collections.PyDeque.traverseNode(PyDeque.java:714)
	at org.python.modules._collections.PyDeque.traverseNode(PyDeque.java:714)

...which probably explains why I haven't been able to reproduce it, I was focusing on the WeakValueDictionary and gc.monitorObject, not a deque. I'll see if I can figure out how the application is using a deque.
msg9940 (view) Author: Jason Madden (jmadden) Date: 2015-04-23.22:12:44
Here's a reproducible example:

  >>> import gc
  >>> import weakref
  >>> from Queue import Queue
  >>> class X(object):
  ...    def __init__(self, q):
  ...         self.q = q
  ...
  >>> x = X(Queue())
  >>> gc.monitorObject(x)
  >>> gc.collect()
  Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
  RuntimeError: maximum recursion depth exceeded (Java StackOverflowError)

Here's where the infinite recursion from the earlier stack trace begins:

	at org.python.modules._collections.PyDeque.traverse(PyDeque.java:725)
	at org.python.modules.gc.traverse(gc.java:2795)
	at org.python.modules.gc.removeNonCyclicWeakRefs(gc.java:2531)
	at org.python.modules.gc.syncCollect(gc.java:1889)
	at org.python.modules.gc.collect_intern(gc.java:1794)
	at org.python.modules.gc.collect(gc.java:1722)
msg9941 (view) Author: Stefan Richthofer (stefan.richthofer) Date: 2015-04-23.22:55:41
The thing with PyDequeue makes sense. I actually contradicted my own warnings there and traversed java-objects. I thought I had done it in a safe way, but obviously I was wrong. Try the following fix:

Replace lines 680 onwards of PyDequeue.java with the following code (i.e. such that traverseNode won't be used and replaced by gc.traverseByReflection):


        /* Traverseproc implementation */
        @Override
        public int traverse(Visitproc visit, Object arg) {
            int retVal = super.traverse(visit, arg);
            if (retVal != 0) {
                return retVal;
            }
            //return lastReturned == null ? 0 : traverseNode(lastReturned, visit, arg);
            return lastReturned == null ? 0 :
                    gc.traverseByReflection(lastReturned, visit, arg)
        }

        @Override
        public boolean refersDirectlyTo(PyObject ob) throws UnsupportedOperationException {
            if (ob == null) {
                return false;
            } else if (super.refersDirectlyTo(ob)) {
                return true;
            } else {
                throw new UnsupportedOperationException();
            }
        }
    }


    /* Traverseproc implementation */
//    private static int traverseNode(Node node, Visitproc visit, Object arg) {
//        int retVal;
//        if (node.data != null) {
//            retVal = visit.visit(node.data, arg);
//            if (retVal != 0) {
//                return retVal;
//            }
//        }
//        if (node.left != null) {
//            retVal = traverseNode(node.left, visit, arg);
//            if (retVal != 0) {
//                return retVal;
//            }
//        }
//        return node.right == null ? 0 : traverseNode(node.right, visit, arg);
//    }

    @Override
    public int traverse(Visitproc visit, Object arg) {
        //return header == null ? 0 : traverseNode(header, visit, arg);
        return header == null ? 0 : gc.traverseByReflection(header, visit, arg);
    }

    @Override
    public boolean refersDirectlyTo(PyObject ob) throws UnsupportedOperationException {
        throw new UnsupportedOperationException();
    }
}




--------------

I left removes out-commented so you can better see what was actually changed. gc.traverseByReflection should be safe I suppose. If this cures the bug, we can merge it in, but you must confirm this first, as you are currently the only person able to reproduce this error.
msg9942 (view) Author: Stefan Richthofer (stefan.richthofer) Date: 2015-04-23.22:59:00
Okay, sorry I overlooked that you already posted a reproducible example. I will try it when I find time. However feel encouraged to try my proposed fix too, so maybe it just works...
msg9944 (view) Author: Stefan Richthofer (stefan.richthofer) Date: 2015-04-23.23:27:05
Okay, I just confirmed that this change in PyDequeue fixes your example-code. Can you confirm that it also fixes your more complex original scenario?
msg9945 (view) Author: Jason Madden (jmadden) Date: 2015-04-23.23:36:13
I just finished testing a (simpler?) change that corrects both my test case and the original issue; basically the idea is to detect the circular nature of the Node structure and stop when you loop around:

diff -r 95d09ba14aa7 src/org/python/modules/_collections/PyDeque.java
--- a/src/org/python/modules/_collections/PyDeque.java	Wed Apr 15 18:34:17 2015 -0400
+++ b/src/org/python/modules/_collections/PyDeque.java	Thu Apr 23 18:33:08 2015 -0500
@@ -702,7 +702,7 @@


     /* Traverseproc implementation */
-    private static int traverseNode(Node node, Visitproc visit, Object arg) {
+    private int traverseNode(Node node, Visitproc visit, Object arg) {
         int retVal;
         if (node.data != null) {
             retVal = visit.visit(node.data, arg);
@@ -710,13 +710,13 @@
                 return retVal;
             }
         }
-        if (node.left != null) {
+        if (node.left != null && node.left != header) {
             retVal = traverseNode(node.left, visit, arg);
             if (retVal != 0) {
                 return retVal;
             }
         }
-        return node.right == null ? 0 : traverseNode(node.right, visit, arg);
+        return node.right == null || node.right == header ? 0 : traverseNode(node.right, visit, arg);
     }

     @Override
msg9946 (view) Author: Stefan Richthofer (stefan.richthofer) Date: 2015-04-24.00:26:17
Until now I actually did not look into the detailes of PyDequeue and just wrote traverseprocs to catch all non-static fields somehow.
However. Since it appears to be a doubly linked, circular list-structure, simply watching out for header is probably more efficient than using general purpose reflection-based traversal. I would add some improvement to your variant though. The iteration should be done in a non-recursive way to avoid bothering Java's stack unnecessarily:

    @Override
    public int traverse(Visitproc visit, Object arg) {
        if (header == null) {
            return 0;
        }
        int retVal = 0;
        if (header.data != null) {
            retVal = visit.visit(header.data, arg);
            if (retVal != 0) {
                return retVal;
            }
        }
        Node tmp = header.right;
        while (tmp != header) {
            if (tmp.data != null) {
                retVal = visit.visit(tmp.data, arg);
                if (retVal != 0) {
                    return retVal;
                }
            }
            tmp = tmp.right;
        }
        return retVal;
    }

This seems to work, but I'd appreciate if you could review this iteration to make sure I did not mess up anything.
msg9951 (view) Author: Jason Madden (jmadden) Date: 2015-04-24.13:14:54
I like that approach much better. It works in both the simple and complex tests, with one slight modification: A separate `traverseNode` function is called by an active iterator (the inner class PyDequeIter), to traverse beginning with its current position, so that still needs to exist (well, and a null check). I'm not sure if this is an optimization (since eventually it may wind up traversing the full ring anyway) or has semantic implications. 

This is the code I've been testing with (the last example just split into two functions for the sake of PyDequeIter plus a null check):

    private int traverseNode(Node node, Visitproc visit, Object arg) {
        int retVal = 0;
        while (node != header && node != null) {
            if (node.data != null) {
                retVal = visit.visit(node.data, arg);
                if (retVal != 0) {
                    return retVal;
                }
            }
            node = node.right;
        }
        return retVal;
    }

    @Override
    public int traverse(Visitproc visit, Object arg) {
        int retVal = 0;
        if (header.data != null) {
            retVal = visit.visit(header.data, arg);
            if (retVal != 0) {
                return retVal;
            }
        }
        return traverseNode(header.right, visit, arg);
    }
msg9955 (view) Author: Stefan Richthofer (stefan.richthofer) Date: 2015-04-24.17:02:36
Right, it should be fixed for PyDequeIter too. Some remarks:

- In PyDequeIter, traverseNode should not stop at header, but rotate around the entire circle until .lastReturned is reached again (in case lastReturned != header initially)

- Checking node.right for null during iteration should not be necessary as PyDeque always keeps the list a circle, having never null-links

- Unless a multithreaded scenario removes a node and nulls its .right while traverse-iteration traverses it

- But the right way to prevent this is to make traverseNode synchronized anyway, isn't it?

- I reasoned whether this might produce subtle deadlocks, but I suppose it should not.

Proposed code:


    private class PyDequeIter extends PyIterator {
        ...
        ...

        /* Traverseproc implementation */
        @Override
        public int traverse(Visitproc visit, Object arg) {
            int retVal = super.traverse(visit, arg);
            if (retVal != 0) {
                return retVal;
            }
            return traverseNode(lastReturned, Visitproc visit, Object arg);
        }

        @Override
        public boolean refersDirectlyTo(PyObject ob) throws UnsupportedOperationException {
            if (ob == null) {
                return false;
            } else if (super.refersDirectlyTo(ob)) {
                return true;
            } else {
                throw new UnsupportedOperationException();
            }
        }
    }


    /* Traverseproc implementation */
    private synchronized int traverseNode(Node node, Visitproc visit, Object arg) {
        if (node == null) {
            return 0;
        }
        int retVal = 0;
        if (node.data != null) {
            retVal = visit.visit(node.data, arg);
            if (retVal != 0) {
                return retVal;
            }
        }
        Node tmp = node.right;
        while (tmp != node) {
            if (tmp.data != null) {
                retVal = visit.visit(tmp.data, arg);
                if (retVal != 0) {
                    return retVal;
                }
            }
            tmp = tmp.right;
        }
        return retVal;
    }

    @Override
    public int traverse(Visitproc visit, Object arg) {
        return traverseNode(header, visit, arg);
    }
 ...
msg9957 (view) Author: Jason Madden (jmadden) Date: 2015-04-24.17:26:53
If `PyDequeIter` needs to be sure to traverse all the way around the ring, then starting with `lastReturned` is really only a small optimization for the (rare?) case that nodes ahead of it return 0 but it and nodes behind it don't. Right?

So in that case, why not simply have its traverse method just call PyDeque's implementation?

        @Override
        public int traverse(Visitproc visit, Object arg) {
            int retVal = super.traverse(visit, arg);
            if (retVal != 0) {
                return retVal;
            }
            return PyDeque.this.traverse(visit, arg);
        }

That can potentially slightly simplify `traverseNode` again, or just move it all into `traverse`.

I think I agree with you about the null checks, so long as the lock is held. I don't know what to think about deadlocks, though; I don't see any obvious ones but I don't know enough about how/when this gets called.
msg10046 (view) Author: Stefan Richthofer (stefan.richthofer) Date: 2015-05-11.12:51:04
So, finally I found time to write a unittest about this and merge-in the bugfix.

@Jason
I adopted your last suggestion for this.


This issue is fixed now as of commit changeset: 7705:376c6130d9e8
History
Date User Action Args
2015-05-19 22:09:49zyasoftsetstatus: pending -> closed
2015-05-11 12:51:59stefan.richthofersetstatus: open -> pending
resolution: fixed
2015-05-11 12:51:05stefan.richthofersetmessages: + msg10046
2015-04-24 17:26:53jmaddensetmessages: + msg9957
2015-04-24 17:02:36stefan.richthofersetmessages: + msg9955
2015-04-24 13:14:55jmaddensetmessages: + msg9951
2015-04-24 00:26:18stefan.richthofersetmessages: + msg9946
2015-04-23 23:36:13jmaddensetmessages: + msg9945
2015-04-23 23:27:05stefan.richthofersetmessages: + msg9944
2015-04-23 23:02:21stefan.richthofersetassignee: stefan.richthofer
2015-04-23 22:59:00stefan.richthofersetmessages: + msg9942
2015-04-23 22:55:42stefan.richthofersetmessages: + msg9941
2015-04-23 22:12:45jmaddensetmessages: + msg9940
2015-04-23 21:10:26jmaddensetmessages: + msg9939
2015-04-23 17:39:01stefan.richthofersetmessages: + msg9937
2015-04-23 15:39:17zyasoftsetpriority: normal
nosy: + stefan.richthofer
2015-04-23 15:24:48jmaddencreate