Issue 2550: SO_REUSEADDR doesn't have the same semantics on Windows as on Unix (original) (raw)

Issue2550

Created on 2008-04-04 15:57 by trent, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
test_socket.py.patch	trent,2008-04-04 15:57	Patch to trunk/Lib/test/test_socket.py
trunk.2550.patch	trent,2008-04-06 21:24
trunk.2550-2.patch	trent,2008-04-08 11:49

Messages (12)
msg64933 - (view)	Author: Trent Nelson (trent) *	Date: 2008-04-04 15:57
Background: I came across this issue when trying to track down why test_asynchat would periodically wedge python processes on the Windows buildbots, to the point that they wouldn't even respond to SIGKILL (or ctrl-c on the console). What I found after a bit of digging is that Windows doesn't raise EADDRINUSE socket.errors when you bind() two sockets to identical host/ports IFF SO_REUSEADDR has been set as a socket option. Decided to brighten up my tube journey into work this morning by reading the Gospel's take on the situation. As per the 'SO_REUSEADDR and SO_REUSEPORT Socket Options' section in chapter 7.5 of Stevens' UNIX Network Programming Volume 1 (2nd Ed): "With TCP, we are never able to start multiple servers that bind the same IP address and same port: a completely duplicate binding. That is, we cannot start one server that binds 198.69.10.2 port 80 and start another that also binds 198.69.10.2 port 80, even if we set the SO_REUSEADDR socket option for the second server." So, it seems at least Windows isn't adhering to this, at least on XP and Server 2008 with 2.5-2.6. I've patched test_socket.py to explicitly test for this situation -- as expected, it passes on Unix (tested on FreeBSD in particular), and fails on Windows. I'd like to commit this to trunk to see if any of the buildbots for different platforms match the behaviour of Windows.
msg65050 - (view)	Author: Trent Nelson (trent) *	Date: 2008-04-06 21:20
[Updating the issue with relevant mailing list conversation] Interesting results! I committed the patch to test_socket.py in r62152. I was expecting all other platforms except for Windows to behave consistently (i.e. pass). That is, given the following: import socket host = '127.0.0.1' sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM) sock.bind((host, 0)) port = sock.getsockname()[1] sock.close() del sock sock1 = socket.socket(socket.AF_INET, socket.SOCK_STREAM) sock1.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1) sock1.bind((host, port)) sock2 = socket.socket(socket.AF_INET, socket.SOCK_STREAM) sock2.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1) sock2.bind((host, port)) ^^^^ ....the second bind should fail with EADDRINUSE, at least according to the 'SO_REUSEADDR and SO_REUSEPORT Socket Options' section in chapter 7.5 of Stevens' UNIX Network Programming Volume 1 (2nd Ed): "With TCP, we are never able to start multiple servers that bind the same IP address and same port: a completely duplicate binding. That is, we cannot start one server that binds 198.69.10.2 port 80 and start another that also binds 198.69.10.2 port 80, even if we set the SO_REUSEADDR socket option for the second server." The results: both Windows and Linux fail the patched test; none of the buildbots for either platform encountered an EADDRINUSE socket.error after the second bind(). FreeBSD, OS X, Solaris and Tru64 pass the test -- EADDRINUSE is raised on the second bind. (Interesting that all the ones that passed have a BSD lineage.) I've just reverted the test in r62156 as planned. The real issue now is that there are tests that are calling test_support.bind_socket() with the assumption that the port returned by this method is 'unbound', when in fact, the current implementation can't guarantee this: def bind_port(sock, host='', preferred_port=54321): for port in [preferred_port, 9907, 10243, 32999, 0]: try: sock.bind((host, port)) if port == 0: port = sock.getsockname()[1] return port except socket.error, (err, msg): if err != errno.EADDRINUSE: raise print >>sys.__stderr__, \ ' WARNING: failed to listen on port %d, trying another' % port This logic is only correct for platforms other than Windows and Linux. I haven't looked into all the networking test cases that rely on bind_port(), but I would think an implementation such as this would be much more reliable than what we've got for returning an unused port: def bind_port(sock, host='127.0.0.1', *args): s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) s.bind((host, 0)) port = s.getsockname()[1] s.close() del s sock.bind((host, port)) return port Actually, FWIW, I just ran a full regrtest.py against trunk on Win32 with this change in place and all the tests still pass. Thoughts? Trent.
msg65051 - (view)	Author: Trent Nelson (trent) *	Date: 2008-04-06 21:21
[Updating issue with mailing list discussion; Jean-Paul's reply] On Fri, 4 Apr 2008 13:24:49 -0700, Trent Nelson <tnelson@onresolve.com> wrote: >Interesting results! I committed the patch to test_socket.py in r62152. I was expecting all other platforms except for Windows to behave consistently (i.e. pass). That is, given the following: > > import socket > host = '127.0.0.1' > sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM) > sock.bind((host, 0)) > port = sock.getsockname()[1] > sock.close() > del sock > > sock1 = socket.socket(socket.AF_INET, socket.SOCK_STREAM) > sock1.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1) > sock1.bind((host, port)) > sock2 = socket.socket(socket.AF_INET, socket.SOCK_STREAM) > sock2.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1) > sock2.bind((host, port)) > ^^^^ > >....the second bind should fail with EADDRINUSE, at least according to the 'SO_REUSEADDR and SO_REUSEPORT Socket Options' section in chapter 7.5 of Stevens' UNIX Network Programming Volume 1 (2nd Ed): > >"With TCP, we are never able to start multiple servers that bind > the same IP address and same port: a completely duplicate binding. > That is, we cannot start one server that binds 198.69.10.2 port 80 > and start another that also binds 198.69.10.2 port 80, even if we > set the SO_REUSEADDR socket option for the second server." > >The results: both Windows and Linux fail the patched test; none of the buildbots for either platform encountered an EADDRINUSE socket.error after the second bind(). FreeBSD, OS X, Solaris and Tru64 pass the test -- EADDRINUSE is raised on the second bind. (Interesting that all the ones that passed have a BSD lineage.) Notice that the quoted text explains that you cannot start multiple servers that etc. Since you didn't call listen on either socket, it's arguable that you didn't start any servers, so there should be no surprise regarding the behavior. Try adding listen calls at various places in the example and you'll see something different happen. FWIW, AIUI, SO_REUSEADDR behaves just as described in the above quote on Linux/BSD/UNIX/etc. On Windows, however, that option actually means something quite different. It means that the address should be stolen from any process which happens to be using it at the moment. There is another option, SO_EXCLUSIVEADDRUSE, only on Windows I think, which, AIUI, makes it impossible for another process to steal the port using SO_REUSEADDR. Hope this helps, Jean-Paul
msg65052 - (view)	Author: Trent Nelson (trent) *	Date: 2008-04-06 21:21
[Updating issue with mailing list discussion; my reply to Jean-Paul] > >"With TCP, we are never able to start multiple servers that bind > > the same IP address and same port: a completely duplicate binding. > > That is, we cannot start one server that binds 198.69.10.2 port 80 > > and start another that also binds 198.69.10.2 port 80, even if we > > set the SO_REUSEADDR socket option for the second server." > Notice that the quoted text explains that you cannot start multiple > servers that etc. Since you didn't call listen on either socket, it's > arguable that you didn't start any servers, so there should be no > surprise regarding the behavior. Try adding listen calls at various > places in the example and you'll see something different happen. I agree in principle, Stevens says nothing about what happens if you do try and bind two sockets on two identical host/port addresses. Even so, test_support.bind_port() makes an assumption that bind() will raise EADDRINUSE if the port is not available, which, as has been demonstrated, won't be the case on Windows or Linux. > FWIW, AIUI, SO_REUSEADDR behaves just as described in the above quote > on Linux/BSD/UNIX/etc. On Windows, however, that option actually means > something quite different. It means that the address should be stolen > from any process which happens to be using it at the moment. Probably explains why the python process wedges when this happens on Windows... > There is another option, SO_EXCLUSIVEADDRUSE, only on Windows I think, > which, AIUI, makes it impossible for another process to steal the port > using SO_REUSEADDR. Nod, if SO_EXCLUSIVEADDRUSE is used instead in the code I posted, Windows raises EADDRINUSE on the second bind(). I don't have access to any Linux boxes at the moment, so I can't test what sort of error is raised with the example I posted if listen() and accept() are called on the two sockets bound to identical addresses. Can anyone else shed some light on this? I'd be interested in knowing if the process wedges on Linux as badly as it does on Windows (to the point where it's not respecting ctrl-c or sigkill). Trent.
msg65054 - (view)	Author: Trent Nelson (trent) *	Date: 2008-04-06 21:24
I've attached another patch that fixes test_support.bind_port() as well as a bunch of files that used that method. The new implementation always uses an ephemeral port in order to elicit an unused port for subsequent binding. Tested on Windows 32-bit & x64 and FreeBSD 6.2. Would like to apply sooner rather than later unless anyone has any objections as it'll fix my two Windows buildbots that are on the same machine from both hanging if they test asynchat at the same time (which happens more often than you'd think).
msg65055 - (view)	Author: Neal Norwitz (nnorwitz) *	Date: 2008-04-06 22:04
Trent, go ahead and try this out. We should definitely be moving in this direction. So I'd rather fix the problem than keep suffering with the current problems of not being able to run the test suite concurrently. I think bind_port might be documented, so you should update the docs if so. Also, please add a Misc/NEWS entry.
msg65075 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2008-04-07 15:10
I don't like that the patch changes the API of a function in test_support() (in particular changing the return type; adding optional arguments is not a problem). This could trip up 3rd party users of this API. I recommend creating a new API bind_host_and_port() (or whatever you'd like to name it) and implement the original API in terms of the new one. (You can even add a warning if you think the original API is always unsafe.)
msg65077 - (view)	Author: Trent Nelson (trent) *	Date: 2008-04-07 16:03
To be honest, I wasn't really happy either with having to return HOST, it's somewhat redundant given that all these tests should be binding against localhost. What about something like this for bind_port(): def bind_port(sock, host=''): """Bind the socket to a free port and return the port number. Relies on ephemeral ports in order to ensure we are using an unbound port. This is important as many tests may be running simultaneously, especially in a buildbot environment.""" # Use a temporary socket object to ensure we're not # affected by any socket options that have already # been set on the 'sock' object we're passed. tempsock = socket.socket(sock.family, sock.type) tempsock.bind((host, 0)) port = tempsock.getsockname()[1] tempsock.close() del tempsock sock.bind((host, port)) return port The tests would then look something like: HOST = 'localhost' PORT = None class Foo(TestCase): def setUp(self): sock = socket.socket() global PORT PORT = test_support.bind_port(sock, HOST) So, the return value is the port bound to, no change there, but we're abolishing preferred_port as an optional argument, which is important, IMO, as none of these tests should be stipulating which port they want to listen on. That's actually the root of this entire problem.
msg65078 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2008-04-07 17:20
Thanks, that's much better (though I'm not the authority on all details of this patch).
msg65155 - (view)	Author: Trent Nelson (trent) *	Date: 2008-04-08 11:49
Invested quite a few cycles on this issue last night. The more time I spent on it, the more I became convinced that every single test working with sockets should be changed in one fell swoop in order to facilitate (virtually unlimited) parallel test execution without fear of port conflicts. I've attached a second patch, trunk.2550-2.patch, which is my progress so far on doing just this. The main changes can be expressed by the following two points: a) do whatever it takes in network-oriented tests to ensure unique ports are obtained (relying on the bind_port() and find_unused_port() methods exposed by test_support) b) never, ever, ever call SO_REUSEADDR on a socket from a test; because we're putting so much effort into obtaining a unique port, this should never be necessary -- in the rare cases that our attempts to obtain a unique port fail, then we absolutely should fail with EADDRINUSE, as the ability to obtain a unique port for the duration of a client/server test is an invariant that we must be able to depend upon. If the invariant is broken, fail immediately (don't mask the problem with SO_REUSEADDR). With this patch applied, I can spawn a handful of Python processes and run the entire test suite (without -r, ensuring all tests are run in the same order, which should encourage port conflicts (if there were any)) without any errors. Doing that now is completely and utterly impossible. [] Well, almost without error. All the I/O related tests that try and open @test fail. I believe there's still outstanding work to do with this patch with regards to how the intracacies of SO_REUSEADDR and SO_EXCLUSIVEADDRUSE should be handled in the rest of the stdlib. I'm still thinking about the best approach for this. However, the patch as it currently stands is still quite substantial so I wanted to get it out sooner rather than later for review. (I'll forward this to python-dev@ to try and encourage more eyes from people with far more network-fu than I.)
msg65224 - (view)	Author: Trent Nelson (trent) *	Date: 2008-04-08 23:48
Committed updates to relevant network-oriented tests, as well as test_support changes discussed, in r62234.
msg104365 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2010-04-27 21:26
This is now fixed, right? Personal experience as well as buildbot behaviour seems to show that parallel test execution (either through -j, or by running several test suites at the same time) works ok.

History
Date	User	Action	Args
2022-04-11 14:56:33	admin	set	github: 46802
2010-04-27 21:26:06	pitrou	set	status: open -> closednosy: + pitrou, exarkunmessages: + resolution: accepted -> fixedstage: test needed -> resolved
2010-03-20 17:44:28	r.david.murray	set	stage: test neededversions: + Python 3.1, Python 2.7, Python 3.2, - Python 3.0
2008-09-18 22:05:37	forest	set	nosy: + forest
2008-05-13 18:23:06	amak	set	nosy: + amak
2008-04-08 23:48:17	trent	set	messages: +
2008-04-08 11:49:32	trent	set	files: + trunk.2550-2.patchmessages: +
2008-04-07 17:20:41	gvanrossum	set	messages: +
2008-04-07 16:03:51	trent	set	messages: +
2008-04-07 15:10:09	gvanrossum	set	nosy: + gvanrossummessages: +
2008-04-06 22:04:54	nnorwitz	set	resolution: acceptedmessages: + nosy: + nnorwitz
2008-04-06 21:25:02	trent	set	files: + trunk.2550.patchmessages: +
2008-04-06 21:21:34	trent	set	messages: +
2008-04-06 21:21:02	trent	set	messages: +
2008-04-06 21:20:26	trent	set	messages: +
2008-04-04 15:57:33	trent	create