Issue1744567

classification
Title: Simultaneous read & write on socket FileWrapper causes hang
Type: Severity: normal
Components: Library Versions:
Milestone:
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: amak Nosy List: amak, cgroves, pjenvey, rluse
Priority: normal Keywords:

Created on 2007-06-28.02:58:57 by rluse, last changed 2007-11-17.00:09:03 by pjenvey.

Files
File name Uploaded Description Edit Remove
Server.py rluse, 2007-06-28.02:58:57
socket_fileobject-r3478.diff pjenvey, 2007-09-17.20:37:25 integration of socket._fileobject from CPython 2.5.1
socket_fileobject3-r3591.diff pjenvey, 2007-10-14.18:59:34 socket._fileobject again, plus _socketobject and _closedsocket and fixes
socket_fileobject4-r3591.diff pjenvey, 2007-10-15.07:32:48 w/ locking guard
Messages
msg1659 (view) Author: Bob Luse (rluse) Date: 2007-06-28.02:58:57
Hi,

I have a case where code that works fine in 2.2B2 has problems in 2.2RC1.  I've included 3 modules and you should be able to recreate it fairly easily.  Just run the Server, and then run the Client GUI.  They are setup to run on the same machine, you will have to make sure the ports match up.  When I run this code on 2.2b2, I can hit the send button and data gets sent to the server and a response comes back.  When I run it on 2.2 RC2, the GUI hangs and the message doesn't appear to be sent.  I suspect this has to do with the thread that I am using in the Client (which ironically I did a lot of work to put in because Jython didn't have Select!)   

Let me know if you have any questions.

Bob
msg1660 (view) Author: Alan Kennedy (amak) Date: 2007-06-28.18:52:37
OK, I can recreate the problem.

However, I can't get your GUI client running: I get this error message.

"""
java.lang.Error: Do not use org.python.proxies.__main__$TestRemoteConnectionGUI$0.setLayout() use org.python.proxies.__main__$TestRemoteConnectionGUI$0.getContentPane().setLayout() instead
"""

I'll show how I create the problem in another message.

msg1661 (view) Author: Alan Kennedy (amak) Date: 2007-06-28.18:58:56
I use this script to recreate the problem.

#-=-=-=-=--=-=
from Client import Client

class MyCallback:

    def readyReceived(self):
        print "Callback received ready notification"

if __name__ == "__main__":
    cli = Client('localhost', 9000, MyCallback())
    cli.send()
#-=-=-=-=-=-=-=-=

When I use this script, the code hangs on trying to send to the socket.

However, the bug seems to be with the object returned from the makefile method. When I replace your Client.send() method with this method, which sends directly on the socket rather than through the madefile

#-=-=-=-=-=-=-=-=
    def send(self):
        print 'Enter Client send()'
#        self.fc.write("Send" + "\r\n")
        self.connection.send("Send" + "\r\n")
        print 'Exit Client send()'
#-=-=-=-=-=-=-=-=

The org.python.core.PyFile object returned from socket.makefile() seems to be hanging on the write call.

I'm looking into what that might hang now; more soon.
msg1662 (view) Author: Alan Kennedy (amak) Date: 2007-06-28.19:44:24
OK, I'm reasonably confident that this is a java bug, related to the simultaneous use of input and output streams on a socket created by java.nio. See these bug reports for further info

http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4509080
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4774871

The problem is that the read and write methods of the InputStream and OutputStream returned by the java.nio.channels.Channels class are protected by a lock, which means that asocket cannot be read and written to at the same time by different threads. Which is exactly what your code is trying to do: ironic that your workaround for select breaks when select becomes available!

I think I can code around this bug, by creating PyFiles that read and write from Channels instead of [In|Out]putStreams, but I need to do more research to verify that.

In the meantime, you have a few possible solutions:

1. Use select! That's what its for; now that we've got it, why not use it :-)
2. Don't try to write and read simultaneously on the same PyFile returned from socket.makefile(). For example, insert a delay (you can see this by making your IncomingMessageHandler.run() method sleep for a second or two before it reads from the socket).
3. Directly use socket send and recv methods to talk to the socket; they are not affected by this locking problem.

Thanks for taking the time report this one. Your code made it very easy to find the problem, which could otherwise have been very difficult to track down.
msg1663 (view) Author: Bob Luse (rluse) Date: 2007-06-28.20:54:09
Hi,

I've switched back to using send and recv and it appears to be working.  I had just switched to using the makefile very recently.  

I am also having a problem using select, but I am not sure that it is not user error.  You probably do not want to use this forum for my questions, so I will send them on the dev. 

Also, was there a problem with my GUI that I sent to run the tests?  It was just standard swing.  
msg1664 (view) Author: Bob Luse (rluse) Date: 2007-06-28.22:41:21
OK - I have retrofitted the change to use the sockets directly on my entire app and everything seems to be fine.  If you noticed, the server module that I sent subclasses StreamRequestHandler from SocketServer which uses makefile.  I had to switch this to use BaseRequestHandler so that I could use the sockets directly on the server as well.  The problem on the server side only manifested itself after a few sends.  I am assuming that this is because the StreamRequestHandler code is better than mine, but I really dont know why it doesn't happen right away like it does on my client.  You may want to look at that as well.

Thanks for the quick help and I appreciate that you realized the usefulness of my example .  It took me considerable amount of time to isolate it to that level.  I realize that you are implementing a major change and also how difficult it can be to find problems like this sometimes.  

As for my GUI, the getContentPane requirement was removed from Java I think when 1.5 was introduced.  So, the message that you got would make sense if you are using 1.4.x.  It is easy to change.  For future reference I think if you would change 

         self.layout = awt.FlowLayout()   to
         self.getContentPane().layout = awt.FlowLayout()

that it would work fine.
msg1665 (view) Author: Alan Kennedy (amak) Date: 2007-07-01.19:23:52
Although we now know that this is not a jython bug per se, but caused by an underlying java bug, I am not going to close this bug until there is a way to implement makefile such that it is not affected by the java bug.

Meantime, I have documented this as a "Known Issue" on the socket wiki page; that page also lists existing workarounds.

http://wiki.python.org/jython/NewSocketModule
msg1666 (view) Author: Philip Jenvey (pjenvey) Date: 2007-09-17.20:37:25
The simpler solution for this is to not use a PyFile, which currently requires the use of Input/Outputstreams that cause this issue. CPython actually uses a file-like wrapper for makefile that wraps the socket object itself, and it's implemented in pure-python code: socket._fileobject. That way the file object continues using the nio channels for i/o that won't deadlock

I've attached a patch that uses the pure python socket._fileobject from CPython 2.5.1 (which also does buffering).

I haven't committed it because there's a small issue -- the parent socket passed to the _fileobject will disable itself when closed, which in turn disables the file object. That behavior breaks the testCloseSocketDoesNotCloseFile test. I added a quick hack to fix this, just to test that this solution works: I copy.copy the socket object passed to the fileobject. All the tests pass with that change, and the Client/Server test in this bug no longer deadlocks, but copy.copy there is probably evil and Alan might have a better way of managing the parent socket vs file object here

The other benefit of this change is it removes a usage of constructing a PyFile with java objects, which requires a nasty hack to the gexpose template for Pyfile's __new__. There's only a few more of these
File Added: socket_fileobject-r3478.diff
msg1667 (view) Author: Philip Jenvey (pjenvey) Date: 2007-10-13.22:53:11
Ok, I've implemented a better solution than doing a copy.copy (socket_fileobject2-r3591.diff):

o replace usage of PyFile in socket.makefile with the pure python file-like
socket._fileobject from CPython 2.4.4. fixes deadlocks with makefile objects,
#1744567

o kill FileWrapper and move the file vs socket close() handling to the pure
python socket._socketobject and socket._closedsocket wrappers, both adapted
from CPython 2.4.4

o move file_count down into _nonblocking_api_mixin so file objects' close()
works for UDP sockets. add a test for this

o kill socket.SocketTypes now that socket._socketobject always wraps both
_tcpsocket and _udpsocket

I think I'll wait for Alan to review this before committing
File Added: socket_fileobject2-r3591.diff
msg1668 (view) Author: Philip Jenvey (pjenvey) Date: 2007-10-13.23:17:17
File Added: socket_fileobject2-r3591.diff
msg1669 (view) Author: Philip Jenvey (pjenvey) Date: 2007-10-14.04:52:08
File Added: socket_fileobject2-r3591.diff
msg1670 (view) Author: Philip Jenvey (pjenvey) Date: 2007-10-14.18:59:34
i just refactored it a little to do actual reference counting instead of having a file_count and has_socketobject vars
File Added: socket_fileobject3-r3591.diff
msg1671 (view) Author: Charlie Groves (cgroves) Date: 2007-10-14.20:46:49
It looks like it'd be possible to get a socket object that's been closed in _socketobject.__init__ if multiple threads are dealing with the same socket.  If one thread closes the socket after another thread calls accept but before that thread calls __init__,  the reference_count won't be incremented until after it's been closed.  

I think making all the reference_count operations happen in a synchronized block before the actual reference is given out should fix this.  From what I can see, that would be incrementing the reference_count when the socket is acquired through accept and duplicated in dup, and then decrementing it in close.
msg1672 (view) Author: Philip Jenvey (pjenvey) Date: 2007-10-15.07:28:09
it's not a problem with accept because it creates a new socket object, but there's definitely a race with dup and even makefile. I've added locking for this in the latest patch.

_fileobject.close() has a similar race, but the worst thing that can happen with it is the underlying socket and its streams having close() called more than once. Subsequent close() calls are noops (and synchronized) anyway. The same thing was possible with the original FileWrapper closer

I also fixed a problem with makefile and _socketobject, they weren't handling _closedsocket situations correctly (don't assume they should always reference_count++). Added tests for that, and some basic tests for dup() and close() with dup()
File Added: socket_fileobject4-r3591.diff
msg1673 (view) Author: Philip Jenvey (pjenvey) Date: 2007-10-15.07:32:48
File Added: socket_fileobject4-r3591.diff
msg1674 (view) Author: Philip Jenvey (pjenvey) Date: 2007-11-06.20:34:24
One thing I missed in this patch is the new _socketobject doesn't provide a getchannel method like the _nonblocking_apis were providing before.

select still works because it falls back to fileno. getchannel can be added to the _socketmethods list to preserve the old behavior though
msg1675 (view) Author: Philip Jenvey (pjenvey) Date: 2007-11-10.02:43:46
This is fixed on trunk. I'm aiming to merge it to the 2.2 branch before its next release
msg1676 (view) Author: Philip Jenvey (pjenvey) Date: 2007-11-10.02:44:11
(trunk as of r3662)
msg1677 (view) Author: Philip Jenvey (pjenvey) Date: 2007-11-17.00:09:03
merged to the 2.2.x branch in r3688. the fix will be in included in Jython 2.2.2.
History
Date User Action Args
2007-06-28 02:58:57rlusecreate