Issue1660

classification
Title: threading module memory leak
Type: crash Severity: major
Components: Core Versions: 2.5.1, 2.5.2b1
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: zyasoft Nosy List: amak, gdoutch, treaves, zyasoft
Priority: high Keywords: patch

Created on 2010-09-30.08:18:48 by gdoutch, last changed 2011-06-16.21:03:36 by treaves.

Files
File name Uploaded Description Edit Remove
socketleak.py gdoutch, 2010-09-30.08:18:43
unnamed gdoutch, 2010-10-01.21:18:57
socket_concurrency.patch amak, 2010-10-04.21:04:53 A patch which attempts synchronization in a couple of areas to address this memory leak.
unnamed gdoutch, 2010-10-05.15:37:15
Messages
msg6110 (view) Author: Gareth Doutch (gdoutch) Date: 2010-09-30.08:18:42
The attached script is modified slightly from the example given in:
http://docs.python.org/library/socketserver.html

to make it run "forever". However, a memory leak occurs and eventually we see:

Exception in thread "RMI TCP Connection(idle)" java.lang.OutOfMemoryError: Java heap space
Exception in thread Thread:Traceback (most recent call last):
Exception in thread "MainThread" java.lang.OutOfMemoryError: Java heap space
  File "C:\jython2.5.2b2\Lib\threading.py", line 179, in _Thread__bootstrap

I am running 2.5.2b2, however this behaviour is observed on 2.5.1 also. Running Sun JDK on Win32 (XP).
msg6115 (view) Author: Jim Baker (zyasoft) Date: 2010-10-01.17:19:54
How long is eventually?

On my OS X box, there *may* be a slow general trend to increasing memory, but it looks I'm going to have run this for quite a while to really see if that's the case or not. (I'm using jconsole to watch heap memory.)
msg6117 (view) Author: Alan Kennedy (amak) Date: 2010-10-01.19:36:58
The attached code should not run. Instead, it should crash. On Windows, it does so with a socket.error, "Address already in use".

The reason for this is because of the deferred way that socket objects are created in jython. Java has different objects for Client (java.net.Socket+java.nio.channels.SocketChannel) and Server (java.net.ServerSocket+java.nio.channels.ServerSocketChannel) objects. Jython cannot know whether to create a client or serever socket until some client or server specific behaviour is observed. For clients, this is a connect() call, and for servers, this is a listen() call. When a connect() is made, jython then knows to create a java.net.Socket+java.nio.channels.SocketChannel to implement the socket. Similarly, when a listen() call is made, jython then knows to create a java.net.ServerSocket+java.nio.channels.ServerSocketChannel to implement the socket. Any attempt to get information about a socket before either a connect() or listen() call is made, i.e. before the implementation socket is created, will fail, although attempts are made to return already set configuration values, if possible.

But the default TCPServer code from SocketServer.py is not aware of this subtlety of jython sockets. The following methods illustrate the issue (comments snipped to make the code smaller).

class TCPServer(BaseServer):

    def __init__(self, server_address, RequestHandlerClass):
        BaseServer.__init__(self, server_address, RequestHandlerClass)
        self.socket = socket.socket(self.address_family,
                                    self.socket_type)
        self.server_bind()
        self.server_activate()

    def server_bind(self):
        if self.allow_reuse_address:
            self.socket.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
        self.socket.bind(self.server_address)
        self.server_address = self.socket.getsockname()

    def server_activate(self):
        self.socket.listen(self.request_queue_size)

Note that the server_address is set before after the bind() call, but before the listen() call. Meaning self.socket.getsockname() will return 0 for the port number, which was the requested value. The actual ephemeral port number wll not be assigned until after the listen() call.

So, in your attached code, the (host, port) tuple ("localhost", 0) will be returned for server_address, which is not a valid socket address for the client to connect to. As I mentioned, on Windows this crashes with an "Address already in use" exception.

So, in order to get your code working, I had to assign an explicit port number, e.g. 8888. Then your code runs as expected.

Notes:

1. The above is really a bug in the jython implementation of SocketServer.TCPServer, which we took directly from cpython without modification. Our implementation needs to be aware of the deferred creation issues mentioned above. I'm surprised if it runs on any platform, i.e. no platform should permit an attempt to connect port 0.

2. Once the code is running, by specifying an explicit port number, it does seem to slowly consume memory. In 30 mins of running, it seems to have consumed about 15M of memory. I do not know why this is yet, and will be investigating.
msg6118 (view) Author: Gareth Doutch (gdoutch) Date: 2010-10-01.21:18:58
Alan, you got me.

number, e.g. 8888. Then your code runs as expected.

That is the kind of thing I did,sorry I didn't paste it in (I thought it
might have been something I was doing wrong.)
However, if you limit the memory using -Xmx to something like 40m you would
see an out of memory error sooner.
My applications that use it tend to run out of memory after about a day
using 128m memory limit.

> I do not know why this is yet, and will be investigating.

Let me know if and when you would like me to test anything out!

Regards,

Gareth

On 1 October 2010 20:36, Alan Kennedy <report@bugs.jython.org> wrote:

>
> Alan Kennedy <jython-dev@xhaus.com> added the comment:
>
> The attached code should not run. Instead, it should crash. On Windows, it
> does so with a socket.error, "Address already in use".
>
> The reason for this is because of the deferred way that socket objects are
> created in jython. Java has different objects for Client
> (java.net.Socket+java.nio.channels.SocketChannel) and Server
> (java.net.ServerSocket+java.nio.channels.ServerSocketChannel) objects.
> Jython cannot know whether to create a client or serever socket until some
> client or server specific behaviour is observed. For clients, this is a
> connect() call, and for servers, this is a listen() call. When a connect()
> is made, jython then knows to create a
> java.net.Socket+java.nio.channels.SocketChannel to implement the socket.
> Similarly, when a listen() call is made, jython then knows to create a
> java.net.ServerSocket+java.nio.channels.ServerSocketChannel to implement the
> socket. Any attempt to get information about a socket before either a
> connect() or listen() call is made, i.e. before the implementation socket is
> created, will fail, although attempts are made to return already set
> configuration values, if possible.
>
> But the default TCPServer code from SocketServer.py is not aware of this
> subtlety of jython sockets. The following methods illustrate the issue
> (comments snipped to make the code smaller).
>
> class TCPServer(BaseServer):
>
>    def __init__(self, server_address, RequestHandlerClass):
>        BaseServer.__init__(self, server_address, RequestHandlerClass)
>        self.socket = socket.socket(self.address_family,
>                                    self.socket_type)
>        self.server_bind()
>        self.server_activate()
>
>    def server_bind(self):
>        if self.allow_reuse_address:
>            self.socket.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR,
> 1)
>        self.socket.bind(self.server_address)
>        self.server_address = self.socket.getsockname()
>
>    def server_activate(self):
>        self.socket.listen(self.request_queue_size)
>
> Note that the server_address is set before after the bind() call, but
> before the listen() call. Meaning self.socket.getsockname() will return 0
> for the port number, which was the requested value. The actual ephemeral
> port number wll not be assigned until after the listen() call.
>
> So, in your attached code, the (host, port) tuple ("localhost", 0) will be
> returned for server_address, which is not a valid socket address for the
> client to connect to. As I mentioned, on Windows this crashes with an
> "Address already in use" exception.
>
> So, in order to get your code working, I had to assign an explicit port
> number, e.g. 8888. Then your code runs as expected.
>
> Notes:
>
> 1. The above is really a bug in the jython implementation of
> SocketServer.TCPServer, which we took directly from cpython without
> modification. Our implementation needs to be aware of the deferred creation
> issues mentioned above. I'm surprised if it runs on any platform, i.e. no
> platform should permit an attempt to connect port 0.
>
> 2. Once the code is running, by specifying an explicit port number, it does
> seem to slowly consume memory. In 30 mins of running, it seems to have
> consumed about 15M of memory. I do not know why this is yet, and will be
> investigating.
>
> ----------
> nosy: +amak
>
> _______________________________________
> Jython tracker <report@bugs.jython.org>
> <http://bugs.jython.org/issue1660>
> _______________________________________
>
msg6123 (view) Author: Jim Baker (zyasoft) Date: 2010-10-03.16:31:43
(First, I made the same change to not use port 0, since it didn't work on OS X. It's irrelevant to the memory leak regardless.)

SocketServer.py is directly pulled in from CPython. It's clear there's a programming flaw here that has been fixed in subsequent versions of CPython, when comparing 2.5 vs 2.7. So we should have this pattern of try-finally:

class BaseRequestHandler:
    def __init__(self, request, client_address, server):
        self.request = request
        self.client_address = client_address
        self.server = server
        self.setup()
        try:
            self.handle()
        finally:
            self.finish()

In 2.5, this is confused, so we have:

    def __init__(self, request, client_address, server):
        self.request = request
        self.client_address = client_address
        self.server = server
        try:
            self.setup()
            self.handle()
            self.finish()
        finally:
            sys.exc_traceback = None    # Help garbage collection

In general that last bit would be unnecessary in Jython, although it wouldn't hurt either.

However, it still seems like we have slow upwards trend in memory consumption. So we may have similar leaks occurring elsewhere in the socket support. More analysis needed!
msg6124 (view) Author: Jim Baker (zyasoft) Date: 2010-10-03.16:41:27
Ahh, this looks very interesting in socket.py, because it's not going to be threadsafe. We need to either synchronize the ref counting, or probably better yet just use an AtomicInteger. This definitely could explain the slow leak. I would work on it now, but I'm about to go to the zoo, so posting it here for now:

    def close(self):
        try:
            if self._sock:
                self.flush()
        finally:
            if self._sock:
                if isinstance(self._sock, _nonblocking_api_mixin):
                    self._sock.reference_count -= 1
                    if not self._sock.reference_count or self._close:
                        self._sock.close()
                elif self._close:
                    self._sock.close()
            self._sock = None
msg6125 (view) Author: Alan Kennedy (amak) Date: 2010-10-03.19:48:51
OK, I've fixed the ephemeral port bug in SocketServer, as well as making the try .. finally  fix that Jim pointed out. Checked in at r7131.

Will look at the reference count issue later.
msg6130 (view) Author: Jim Baker (zyasoft) Date: 2010-10-04.01:14:26
Python sockets should be thread safe, at least according to this bug http://bugs.python.org/issue1544279

I think that's as definitive as it ever gets.
msg6142 (view) Author: Alan Kennedy (amak) Date: 2010-10-04.21:04:53
OK, there is definitely a memory leak going on here. I'm seeing approx 4 megabytes being used per 1000 requests.

Although it is the case that methods on an individual socket tend to only be called from a single thread, that is not guaranteed, and sockets should be thread-safe, just in case.

I tried the using an AtomicInteger for the reference_count, as Jim suggested, but wasn't hopeful that it would solve the problem. My reasoning was that the code cited, i.e. the one without synchronization, is from the _fileobject wrapper class, which is not used in this circumstance. The close() method on socket objects *is* protected by a lock, i.e. _sock.close_lock. 

Nonetheless, I made the change, to see if it would fix the error: it didn't.

Next up, I looked at the _active_sockets set, which looks like a prime candidate for concurrency problems, because A: it is a module global and B: the add(), remove() and copy() operations carried out on it are not atomic (on jython, AFAIK). So I wrapped a synchronized.apply_synchronized() call around those operations, in the hope that this would cure the problem: it didn't :-(

In the interest of completeness, I attach a patch against trunk for the solutions I tried.
msg6143 (view) Author: Alan Kennedy (amak) Date: 2010-10-04.21:08:18
One thing to note when I implemented the reference_count as an AtomicInteger, I found that I was running into the following error, which may be Windows specific.

java.net.SocketException: No buffer space available (maximum connections reached?): connect

I have no explanation for why this error appears after proper synchonization has been introduced. I should point out that I am not the author of any of the code relating to reference_counts or the _active_sockets set.
msg6144 (view) Author: Alan Kennedy (amak) Date: 2010-10-04.21:11:53
With the amount of memory consumed growing by approx 4K per request, I'm beginning to suspect that this may be caused by some reference cycle that is preventing garbage collection, perhaps in the IO layer.

To detect something like this will require running the example code under some form of memory profiler. Which I don't have time for now, unfortunately.
msg6145 (view) Author: Jim Baker (zyasoft) Date: 2010-10-04.22:47:09
Jython's set is threadsafe, including under add, remove, and copy (although copy doesn't have any guarantees on what specifically will be copied through the underlying iteration). This is because of the underlying usage of ConcurrentHashMap.

However, any composite operation, much like += 1, will not be threadsafe, unless it's locked or using an appropriate primitive that supports it. Or if it's used in a thread confined fashion, of course it doesn't matter.

Regardless casual inspection of _active_sockets looks fine.

Next step is to do the memory analysis. I'll try to look into this tonight.
msg6146 (view) Author: Gareth Doutch (gdoutch) Date: 2010-10-05.15:37:16
I use VisualVM to watch my memory consumption. Free and Open Source, only
requires to be run on the JDK.
https://visualvm.dev.java.net/

Cheers,

Gareth

On 4 October 2010 22:11, Alan Kennedy <report@bugs.jython.org> wrote:

>
> Alan Kennedy <jython-dev@xhaus.com> added the comment:
>
> With the amount of memory consumed growing by approx 4K per request, I'm
> beginning to suspect that this may be caused by some reference cycle that is
> preventing garbage collection, perhaps in the IO layer.
>
> To detect something like this will require running the example code under
> some form of memory profiler. Which I don't have time for now,
> unfortunately.
>
> _______________________________________
> Jython tracker <report@bugs.jython.org>
> <http://bugs.jython.org/issue1660>
> _______________________________________
>
msg6147 (view) Author: Jim Baker (zyasoft) Date: 2010-10-05.18:31:53
I like jvisualvm, definitely a good tool. Based on the recommendation by Charlie Nutter from his blog, I'm now trying the Eclipse Memory Analyzer Tool (mat), which is available as a separate download. (I would like to do that, since I currently use IntelliJ, and I very much like it as my IDE.)

So here's what I have found using the leak suspects report:

Run with a 1000 connections or so, get this

One instance of "com.google.common.collect.CustomConcurrentHashMap" loaded by "sun.misc.Launcher$AppClassLoader @ 0x1089ef270" occupies 2,574,736 (25.03%) bytes. The memory is accumulated in one instance of "com.google.common.collect.CustomConcurrentHashMap$Segment[]" loaded by "sun.misc.Launcher$AppClassLoader @ 0x1089ef270".

For ~10000 connections, this is:

One instance of "com.google.common.collect.CustomConcurrentHashMap" loaded by "sun.misc.Launcher$AppClassLoader @ 0x1092d27e0" occupies 36,690,640 (82.85%) bytes. The memory is accumulated in one instance of "com.google.common.collect.CustomConcurrentHashMap$Segment[]" loaded by "sun.misc.Launcher$AppClassLoader @ 0x1092d27e0".

This the canonical map PyType#class_to_type. Now there's nothing wrong with it holding classes, but it should not be increasing in size. Watching jvisualvm, class loading stabilizes to approx 4000 classes for the duration of the run. So I think the most logical conclusion for now is that a class (presumably one of the socket classes) is somehow getting attached with this 4K chunk/connection that Alan observed.

As usual, more analysis needed, but I think we may have something we might have something we can work on now.
msg6148 (view) Author: Jim Baker (zyasoft) Date: 2010-10-05.20:20:36
Fixed in r7135

The actual problem is not in socket support, but in threads. In my previous change here, I was wrong about the Guava map in use, it's actually the canonical map org.python.modules._threading._jthread_to_pythread. The solution is to have it both weak keys AND weak values, similar to what we saw with class_to_type.

I can now run Gareth's example code in excess of 200,000 times (without the time.sleep of course, I have no patience :) for that), which is the point that the usual heap allocation of 512M would be exhausted on my laptop.

So we need to consider how to test this in regrtest. Probably the best bet is to use the SocketServer code in a subprocess with a minimal heap, run it for the equivalent of 10s or so, and verify that it exits without error.
msg6149 (view) Author: Jim Baker (zyasoft) Date: 2010-10-06.22:01:10
Test added with r7137
msg6554 (view) Author: Timothy Reaves (treaves) Date: 2011-06-16.21:03:36
Is issue 1761 a duplicate of this, or a regression?
History
Date User Action Args
2011-06-16 21:03:36treavessetnosy: + treaves
messages: + msg6554
2010-10-06 22:01:12zyasoftsetstatus: pending -> closed
messages: + msg6149
2010-10-05 20:20:37zyasoftsetstatus: open -> pending
title: Socket memory leak -> threading module memory leak
messages: + msg6148
priority: high
assignee: zyasoft
resolution: fixed
2010-10-05 18:31:55zyasoftsetmessages: + msg6147
2010-10-05 15:37:17gdoutchsetfiles: + unnamed
messages: + msg6146
2010-10-04 22:47:09zyasoftsetmessages: + msg6145
2010-10-04 21:12:07amaksetseverity: normal -> major
2010-10-04 21:11:53amaksetmessages: + msg6144
2010-10-04 21:08:18amaksetmessages: + msg6143
2010-10-04 21:04:55amaksetfiles: + socket_concurrency.patch
keywords: + patch
messages: + msg6142
2010-10-04 01:14:26zyasoftsetmessages: + msg6130
2010-10-03 19:48:51amaksetmessages: + msg6125
2010-10-03 16:41:28zyasoftsetmessages: + msg6124
2010-10-03 16:31:45zyasoftsetmessages: + msg6123
2010-10-01 21:18:58gdoutchsetfiles: + unnamed
messages: + msg6118
2010-10-01 19:36:59amaksetnosy: + amak
messages: + msg6117
2010-10-01 17:19:55zyasoftsetnosy: + zyasoft
messages: + msg6115
2010-09-30 08:18:48gdoutchcreate