Issue1291

classification
Title: select() crashes with IOException
Type: Severity: normal
Components: Core, Library Versions: 2.2.2, 2.5b1
Milestone:
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: amak Nosy List: amak, billiejoex, irmen, kfitch42, zyasoft
Priority: high Keywords:

Created on 2009-03-28.00:47:36 by irmen, last changed 2009-04-19.13:38:06 by amak.

Files
File name Uploaded Description Edit Remove
server.py irmen, 2009-03-28.00:47:35 server program that crashes
client.py irmen, 2009-03-28.00:47:52 client program to run with server
server.java amak, 2009-04-05.13:24:40 A server written in pure java that illustrates the bug.
client.java amak, 2009-04-05.13:25:25 A client written in pure java that is used to make the server crash.
Messages
msg4365 (view) Author: Irmen de Jong (irmen) Date: 2009-03-28.00:47:35
on a certain use of select(), jython crashes with an IOException:
Exception in thread Thread:Traceback (most recent call last):
  File "F:\jython25b3\Lib\threading.py", line 247, in _Thread__bootstrap
    self.run()
  File "F:\jython25b3\Lib\threading.py", line 238, in run
    self._target(*self._args, **self._kwargs)
  File "server.py", line 23, in handle
    rsock,wsock,esock=_selectfunction([csock],[],[],2)  # <-- causes the
crash
  File "F:\jython25b3\Lib\select.py", line 213, in cpython_compatible_select
    return native_select(read_fd_list, write_fd_list, outofband_fd_list,
timeout)
  File "F:\jython25b3\Lib\select.py", line 171, in native_select
    pobj = poll()
  File "F:\jython25b3\Lib\select.py", line 63, in __init__
    self.selector = java.nio.channels.Selector.open()
IOException: java.io.IOException: Unable to establish loopback connection

I am using the cpython_compatible_select function from jython's select
module.

See attached server.py and client.py that show the problem (the
server.py will crash after a short while)
It seems to be related to using
msg4366 (view) Author: Irmen de Jong (irmen) Date: 2009-03-28.00:49:55
...It seems to be related to using the select inside a thread, or on a
client socket instead of a server socket. 
If you remove the select in the handle() function in the server, it runs
fine.
msg4367 (view) Author: Alan Kennedy (amak) Date: 2009-03-28.18:15:43
Hi Irmen,

Thanks for the detailed report, and the code that reproduces the bug. I
can reliably recreate this bug now.

More soon.
msg4370 (view) Author: Alan Kennedy (amak) Date: 2009-03-28.21:38:15
OK, this is not necessarily a problem with our jython implementation:
see attached server.java, which is the direct equivalent of the jython
server.py that Irmen provided: it generates the same exception, with no
jython involved.

I'm reasonably sure that this is a Windows specific thing as well, but I
need to check on a few different platforms.

I think this may be caused by our method using
java.nio.channels.Selector, which is really meant to be created once and
used for the whole duration of the server.

The jython select.select is implemented using Selector, and creates and
destroys a Selector every single time the select.select function is called.

The best way to work around this problem would be to use our select.poll
object, in the way it is intended to be used, i.e. create one for the
whole duration of the server.

I'm going to report a bug Sun about this, but I'm not optimistic that it
will be fixed (or even looked at, going on experience from previous
similar java bugs).

So we may just have to document this as a known bug, and live with it.

I want to do some more experimentation though.
msg4371 (view) Author: Irmen de Jong (irmen) Date: 2009-03-28.22:25:59
Ok in light of this, I will commit the workaround (remove the select on
the client socket) in Pyro. I want it to work with Jython now, and not
having to wait for a jdk fix ;-)
msg4372 (view) Author: Alan Kennedy (amak) Date: 2009-03-28.23:43:27
This is alomost certainly a resource issue, on windows java.

I had a suspicion that this issue is nothing to do with threading, so I
rewrote the server to not spawn threads, but to process everything in a
single thread. The "single_threaded_server.java" file is attached.

I also modified the client to count the number of iterations it went
through (counting_client.py attached).

The exception occurs at (almost) exactly the same point each time, after
between 3945 and 3955 iterations on my Windows Server 2003 box, with the
following exception, which seems to indicate that buffer space has been
exhausted.

Exception: java.io.IOException: Unable to establish loopback connection
Exception in thread "main" java.io.IOException: Unable to establish
loopback connection
        at sun.nio.ch.PipeImpl$Initializer.run(PipeImpl.java:106)
        at java.security.AccessController.doPrivileged(Native Method)
        at sun.nio.ch.PipeImpl.<init>(PipeImpl.java:122)
        at
sun.nio.ch.SelectorProviderImpl.openPipe(SelectorProviderImpl.java:27)
        at java.nio.channels.Pipe.open(Pipe.java:133)
        at
sun.nio.ch.WindowsSelectorImpl.<init>(WindowsSelectorImpl.java:106)
        at
sun.nio.ch.WindowsSelectorProvider.openSelector(WindowsSelectorProvider.java:26)
        at java.nio.channels.Selector.open(Selector.java:209)
        at single_threaded_server.main(single_threaded_server.java:69)
Caused by: java.net.SocketException: No buffer space available (maximum
connections reached?): connect
        at sun.nio.ch.Net.connect(Native Method)
        at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:464)
        at java.nio.channels.SocketChannel.open(SocketChannel.java:146)
        at sun.nio.ch.PipeImpl$Initializer.run(PipeImpl.java:78)
        ... 8 more
msg4373 (view) Author: Alan Kennedy (amak) Date: 2009-03-28.23:44:18
Uploading the counting client.
msg4374 (view) Author: Alan Kennedy (amak) Date: 2009-03-28.23:45:26
I'd be interested in reports of how many iterations before failure on
other versions of windows, e.g. vista, XP, 2008, etc.
msg4375 (view) Author: Alan Kennedy (amak) Date: 2009-03-28.23:45:52
Lastly, I'd like to know if anyone can reproduce this issue on any other
platform besides windows.
msg4377 (view) Author: Irmen de Jong (irmen) Date: 2009-03-29.00:05:46
Tried the single threaded server and counting client on my Mac (OSX 10.4).  
Server crashes with java.io.IOException: Too many open files   after the 
client reached about 3400.
msg4378 (view) Author: Alan Kennedy (amak) Date: 2009-03-29.15:59:47
I was surprised by the MacOS result.

So I took a closer look at the java code I posted, and realised that
there was a glaring omission: the selector's in that server code were
never closed.

So I updated the code to properly close all the selectors: the new
"closing_single_threaded_server.java" is attached.

And the code still reliably reproduces the bug. Except that the reported
exception is different now. The new code reports "Address already in
use: connect"; traceback given below.

In an effort to try and find ways to prevent the bug from happening, I
also added to two options, as follows

1. Cancel keys. Cancelling a SelectionKey removes it from the "interest
set" of a Selector. I added this option to cancel keys to see if it made
any difference to the exception. It didn't.

2. Reselect. In the early days of java.nio, there was some community
advice to do another select on a Selector before closing it, after
cancelling all the keys, to ensure that all SelectionKey cancellation
would have taken effect before a close operation was attempted. (this
was never official advice from Sun however). See the javadocs for more
information about closing selectors.

http://java.sun.com/j2se/1.5.0/docs/api/java/nio/channels/Selector.html

But neither of the options makes any difference. Cancelling the keys
makes no difference. Reselecting makes no difference. The IOException
still occurs in all cases.

As before, I'd be interested to see the behaviour of the
"closing_single_threaded_server" on different platforms.

Here is the new traceback.

Exception in thread "main" java.io.IOException: Unable to establish
loopback connection
        at sun.nio.ch.PipeImpl$Initializer.run(PipeImpl.java:106)
        at java.security.AccessController.doPrivileged(Native Method)
        at sun.nio.ch.PipeImpl.<init>(PipeImpl.java:122)
        at
sun.nio.ch.SelectorProviderImpl.openPipe(SelectorProviderImpl.java:27)
        at java.nio.channels.Pipe.open(Pipe.java:133)
        at
sun.nio.ch.WindowsSelectorImpl.<init>(WindowsSelectorImpl.java:106)
        at
sun.nio.ch.WindowsSelectorProvider.openSelector(WindowsSelectorProvider.java:26)
        at java.nio.channels.Selector.open(Selector.java:209)
        at
closing_single_threaded_server$handler.process(closing_single_threaded_server.java:40)
        at
closing_single_threaded_server.main(closing_single_threaded_server.java:90)
Caused by: java.net.BindException: Address already in use: connect
        at sun.nio.ch.Net.connect(Native Method)
        at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:464)
        at java.nio.channels.SocketChannel.open(SocketChannel.java:146)
        at sun.nio.ch.PipeImpl$Initializer.run(PipeImpl.java:78)
        ... 9 more
msg4431 (view) Author: Jim Baker (zyasoft) Date: 2009-04-04.02:52:37
This *may* be a blocker for RC, so marking as high
msg4438 (view) Author: Alan Kennedy (amak) Date: 2009-04-04.15:22:10
This looks very likely to be a java bug, on windows. If it is, then we
can't fix it.

So we can't let it hold up the 2.5 release; we can only report it to
Sun, and document it as a "known issue", and recommend practices to
workaround it.

I need to investigate if it occurs on various JVMs versions. If it is
fixed on Java 6, for example, then we can at least recommend to users
who encounter the bug to use the bug free one.
msg4440 (view) Author: Alan Kennedy (amak) Date: 2009-04-05.13:24:40
Uploading pure java version of the server.
msg4441 (view) Author: Alan Kennedy (amak) Date: 2009-04-05.13:25:25
Uploading java version of the client.
msg4442 (view) Author: Alan Kennedy (amak) Date: 2009-04-05.13:32:36
Using the attached pure java server and client, I can reliably reproduce
the bug on JDK 1.5.0_17, on my Windows Server 2003 box.

After approx 8000 iterations of the client, the server crashes with the
following exception.

Exception in thread "main" java.io.IOException: Unable to establish
loopback connection
        at sun.nio.ch.PipeImpl$Initializer.run(PipeImpl.java:106)
        at java.security.AccessController.doPrivileged(Native Method)
        at sun.nio.ch.PipeImpl.<init>(PipeImpl.java:122)
        at
sun.nio.ch.SelectorProviderImpl.openPipe(SelectorProviderImpl.java:27)
        at java.nio.channels.Pipe.open(Pipe.java:133)
        at
sun.nio.ch.WindowsSelectorImpl.<init>(WindowsSelectorImpl.java:106)
        at
sun.nio.ch.WindowsSelectorProvider.openSelector(WindowsSelectorProvider.java:26)
        at java.nio.channels.Selector.open(Selector.java:209)
        at server$handler.process(server.java:42)
        at server.main(server.java:91)
Caused by: java.net.BindException: Address already in use: connect
        at sun.nio.ch.Net.connect(Native Method)
        at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:464)
        at java.nio.channels.SocketChannel.open(SocketChannel.java:146)
        at sun.nio.ch.PipeImpl$Initializer.run(PipeImpl.java:78)
        ... 9 more

After the crash, when I issue the command "netstat -ano", I get
thousands of sockets, in the state "TIME_WAIT", and pid of 0:

  Proto  Local Address          Foreign Address        State           PID
  TCP    127.0.0.1:1040         127.0.0.1:1052         TIME_WAIT       0
  TCP    127.0.0.1:1042         127.0.0.1:1044         TIME_WAIT       0
  TCP    127.0.0.1:1044         127.0.0.1:1050         TIME_WAIT       0
  TCP    127.0.0.1:1047         127.0.0.1:1050         TIME_WAIT       0
  TCP    127.0.0.1:1048         127.0.0.1:1050         TIME_WAIT       0
  TCP    127.0.0.1:1050         127.0.0.1:1052         TIME_WAIT       0
  TCP    127.0.0.1:1051         127.0.0.1:1052         TIME_WAIT       0
  TCP    127.0.0.1:1052         127.0.0.1:1076         TIME_WAIT       0
  TCP    127.0.0.1:1055         127.0.0.1:1059         TIME_WAIT       0
  TCP    127.0.0.1:1057         127.0.0.1:1059         TIME_WAIT       0
msg4443 (view) Author: Alan Kennedy (amak) Date: 2009-04-05.13:35:52
Using the attached server and client, I can reliably reproduce the bug
on jdk 1.6.0_13.

The symptoms are exactly the same as on JDK 1.5, an "Unable to establish
loopback connection", caused by "Address already in use: connect".

Again, a huge list of sockets in the TIME_WAIT state with a PID of 0 is
listed by "netstat -ano".
msg4445 (view) Author: Alan Kennedy (amak) Date: 2009-04-05.13:58:10
OK, I've reported this bug in java to Sun.

Though I'm not hopeful; they completely ignored my bug report the last
time I submitted one, with full source code to reliably reproduce.

http://bugs.jython.org/issue1018
msg4446 (view) Author: Alan Kennedy (amak) Date: 2009-04-05.14:39:45
I have documented this on the socket module wiki as a "known issue".

http://wiki.python.org/jython/NewSocketModule#Multipleuseofselect.selectresultinanExceptiononWindows
msg4449 (view) Author: Giampaolo Rodola' (billiejoex) Date: 2009-04-05.19:07:58
> The best workaround in this case 
> is to use select.poll() objects 
> instead of the select.select() function. 

This is incorrect since poll() doesn't exist on Windows.
msg4450 (view) Author: Alan Kennedy (amak) Date: 2009-04-05.20:18:18
[Giampaolo]
> This is incorrect since poll() doesn't exist on Windows.

Perhaps the poster has misunderstood. java.nio.channels.Selector is the
only class available on the JVM that permits multiplexing sockets.

java.nio.channels.Selector, the java object, is currently defective on
Windows, because of the bug illustrated.

Jython software using this class will be similarly affected on Windows,
i.e. the jython select module.

There may be a deeper relationship between the facts that 

A. java.nio.channels.Selector has problems on Windows
B. Cpython's select module on Windows does not have a poll object.

But in terms of jython implementation, we're restricted to java classes,
unless we want to go to JNI, which could a very complex undertaking,
that the cpython folks avoided.

I'm beginning to regret supporting the select.select function. It's an
unnatural way to use the java APIs. It seem like a neat solution to the
problem at the time, but I now think goes against the grain of the
intention of the java APIs, especially given the common use cases that
are appearing from cpython.

I think now perhaps that it would have been better to simply support
select.poll objects, to make it clear to the user that select semantics
on jython are different from cpython, and have to be thought of on java
terms.
msg4451 (view) Author: Giampaolo Rodola' (billiejoex) Date: 2009-04-05.23:14:54
> There may be a deeper relationship between the facts that 
> A. java.nio.channels.Selector has problems on Windows
> B. Cpython's select module on Windows does not have a poll object.

It's not that cPython avoided to add Windows poll() support.
poll() system call is *UNIX only*.

> I'm beginning to regret supporting the select.select function. 

Having select() support is good for Windows users since it's often the
only chance they have for doing socket multiplexing.
msg4454 (view) Author: Alan Kennedy (amak) Date: 2009-04-06.09:33:42
[Giampaolo]
> It's not that cPython avoided to add Windows poll() support.
> poll() system call is *UNIX only*.

The poster is confused. The statement "poll() system call is *UNIX
only*" applies to cpython, *not* to jython. Jython *has* a poll object
on Windows, and it works.

[Giampaolo]
> Having select() support is good for Windows users since it's often the
> only chance they have for doing socket multiplexing.

The poster has not been reading what I've been writing. The statement
above applies only to cpython, not to jython.

By necessity, jython must use java.nio.channels.Selector to implement
the select function. But this usage results in an "unnatural" usage of
the java API, and gives rise to the problem being discussed in this thread.

I refer you to the code for the actual implementation of the jython
select module.

http://fisheye3.atlassian.com/browse/~raw,r=trunk/jython/trunk/jython/Lib/select.py

If the poster has better implementation options, patches are welcome.
Further discussion based on confusing cpython with jython will serve no
purpose.

http://www.jython.org/Project/userfaq.html
msg4485 (view) Author: Kevin Fitch (kfitch42) Date: 2009-04-08.12:35:41
From what I can tell, the underlying problem here is that there is an
implicit leak in the java Selector object related to lookback
connections, that it uses for whatever reason. So perhaps the best
workaround would be to introduce our own 'leak' that will cover up the
java one :)

What if we had a poll pool (say that 10 times fast :)? The naive
implementation would make sure we never created more poll objects (and
hence selectors) than the maximum number of simultaneous calls to
select, instead of creating as many as there are calls to select.

The only outstanding question would  be: when do we clean the pool?
Perhaps it would be good enough to just let things accumulate. In
'normal' applications (at least in my definition of normal) we would
only end up with a small handful of poll objects that live forever in
the pool.

A quick and dirty pool implementation (I haven't run/tested this yet):
class _selectPollPool():
    _pollPool=[]
    _mtx = threading.mutex()
    
    @staticmethod
    def getPollObject():
        with _mtx:
            if _pollPool:
                return _pollPool.pop()
            else:
                return poll()

    def putPollObject(pobj)
        pobj.close()
        with _mtx:
            _pollPool.append(pobj)

We could then just replace the 'pobj = poll()' line in native_select
with 'pobj = _selectPollPool.getPollObject()' and the pobj.close() in
the finally block with _selectPollPool.putPollObject(pobj)
msg4490 (view) Author: Kevin Fitch (kfitch42) Date: 2009-04-08.20:27:50
I have been unable to reproduce this bug under Linux (using the java or
the jython code). It is definitely Windows (and may OSX) specific.
msg4491 (view) Author: Giampaolo Rodola' (billiejoex) Date: 2009-04-09.02:13:40
Sorry for my lack of Jython experience.
As far as I can tell I've just tried to run pyftpdlib [1] test suite by
using select.poll() instead of select.select() but I get the same
"Unable to establish loopback connection" error (Windows XP SP3, Jython
2.5b3).
msg4495 (view) Author: Alan Kennedy (amak) Date: 2009-04-09.12:16:17
[Kevin]
> From what I can tell, the underlying problem here is that there is an
> implicit leak in the java Selector object related to lookback
> connections, that it uses for whatever reason. So perhaps the best
> workaround would be to introduce our own 'leak' that will cover up the
> java one :)
> 
> What if we had a poll pool (say that 10 times fast :)? The naive
> implementation would make sure we never created more poll objects (and
> hence selectors) than the maximum number of simultaneous calls to
> select, instead of creating as many as there are calls to select.

That's a (great idea)*10 :-)

I'm not 100% convinced it will work, because we won't be able to close()
the poll objects in the pool, which might create problems with
outstanding references to SocketChannels. This is because they cannot be
re-opened: there is no open() method on Selectors: they can only be
created with the static factory method
java.nio.channels.Selector.open(), which returns them already opened.

http://java.sun.com/j2se/1.5.0/docs/api/java/nio/channels/Selector.html#open()

But it's the best idea proposed so far to solve this problem.

I'm working on a patch now; more later today.
msg4497 (view) Author: Kevin Fitch (kfitch42) Date: 2009-04-09.19:15:30
I can't seem to reproduce this. I tried out the jython (using 2.5b3)
version on Win XP Pro SP3 (32-bit) with Sun java 1.6.0_04 I gave up when
it got to around 250000 iterations.

The java one is still running, but is at 69000 right now, with no probs.
msg4500 (view) Author: Alan Kennedy (amak) Date: 2009-04-11.11:15:31
I've checked in an implementation of a cache for poll objects, so that
poll objects/Selectors are re-used as much as possible. It is currently
only used on MS Windows.

It seems stable enough. I ran 6 of the counting clients against it
yesterday for a few hours; they accumulated over 7,000,000 iterations
between them, with no problems at all.

So, please can everyone who has been having problems with this bug try
out the new code, which was checked in at r6208.

http://fisheye3.atlassian.com/browse/jython/trunk/jython/Lib/select.py
msg4528 (view) Author: Alan Kennedy (amak) Date: 2009-04-17.16:28:30
Ping.

Has anyone been able to reproduce the bug with the new patched version
of the select module?

If not, I'll consider this bug fixed.
msg4530 (view) Author: Giampaolo Rodola' (billiejoex) Date: 2009-04-17.17:39:25
I tried pyftpdlib test suite on Windows by using both select() and
poll() and I don't get any "IOException: java.io.IOException:" exception
now.
msg4540 (view) Author: Alan Kennedy (amak) Date: 2009-04-19.13:38:06
Two successful reports that this fix as eliminated the exception, one on
this bug (from billiejoex) and one from kfitch42 on this bug

http://bugs.jython.org/issue1304

Therefore, closing this bug.

Thanks to Kevin Fitch for the suggestion of the eventual solution.
History
Date User Action Args
2009-04-19 13:38:06amaksetstatus: open -> closed
resolution: fixed
messages: + msg4540
2009-04-17 17:39:25billiejoexsetmessages: + msg4530
2009-04-17 16:28:30amaksetmessages: + msg4528
2009-04-11 11:15:32amaksetmessages: + msg4500
2009-04-09 19:15:31kfitch42setmessages: + msg4497
2009-04-09 12:16:19amaksetmessages: + msg4495
2009-04-09 02:13:41billiejoexsetmessages: + msg4491
2009-04-08 20:27:51kfitch42setmessages: + msg4490
2009-04-08 12:35:41kfitch42setmessages: + msg4485
2009-04-07 21:42:18kfitch42setnosy: + kfitch42
2009-04-06 09:33:43amaksetmessages: + msg4454
2009-04-05 23:14:55billiejoexsetmessages: + msg4451
2009-04-05 20:18:19amaksetmessages: + msg4450
2009-04-05 19:07:59billiejoexsetmessages: + msg4449
2009-04-05 14:39:45amaksetmessages: + msg4446
2009-04-05 13:58:10amaksetmessages: + msg4445
2009-04-05 13:35:52amaksetmessages: + msg4443
2009-04-05 13:32:37amaksetmessages: + msg4442
2009-04-05 13:25:25amaksetfiles: + client.java
messages: + msg4441
2009-04-05 13:24:41amaksetfiles: + server.java
messages: + msg4440
2009-04-05 13:22:22amaksetfiles: - closing_single_threaded_server.java
2009-04-05 13:22:16amaksetfiles: - counting_client.py
2009-04-05 13:22:10amaksetfiles: - single_threaded_server.java
2009-04-05 13:22:04amaksetfiles: - server.java
2009-04-04 15:22:10amaksetmessages: + msg4438
2009-04-04 02:52:37zyasoftsetpriority: high
nosy: + zyasoft
messages: + msg4431
2009-04-03 16:38:41billiejoexsetnosy: + billiejoex
2009-03-29 15:59:48amaksetfiles: + closing_single_threaded_server.java
messages: + msg4378
2009-03-29 00:05:46irmensetmessages: + msg4377
2009-03-28 23:45:52amaksetmessages: + msg4375
2009-03-28 23:45:26amaksetmessages: + msg4374
2009-03-28 23:44:18amaksetfiles: + counting_client.py
messages: + msg4373
2009-03-28 23:43:29amaksetfiles: + single_threaded_server.java
messages: + msg4372
2009-03-28 22:25:59irmensetmessages: + msg4371
2009-03-28 21:38:17amaksetfiles: + server.java
messages: + msg4370
2009-03-28 18:15:44amaksetassignee: amak
messages: + msg4367
nosy: + amak
2009-03-28 00:49:56irmensetmessages: + msg4366
2009-03-28 00:47:52irmensetfiles: + client.py
2009-03-28 00:47:36irmencreate