CentOS Linux 5, Python 2.4.3 (but code appears unchanged in 2.5 and trunk, so I don't believe this bug has already been fixed) We have an xmlrpc server that subclasses DocXMLRPCServer.DocXMLRPCServer and SocketServer.ForkingMixIn. Under load, it sometimes crashes with an error in SocketServer.ForkingMixIn.collect_children The bug is that collect_children calls os.waitpid with pid 0, so it waits for any child. But then it assumes that the pid found was in the list self.active_children, and attempts to remove it from that list without a try block. However, another call to collect_children could have already removed it, so we get "ValueError: list.remove(x): x not in list" The fix is just adding a try/except block around the attempt to remove pid from self.active children. diff -u SocketServer.py /tmp/SocketServer.py --- SocketServer.py 2007-08-27 10:52:24.000000000 -0400 +++ /tmp/SocketServer.py 2007-09-20 15:34:00.000000000 -0400 @@ -421,7 +421,10 @@ except os.error: pid = None if not pid: break - self.active_children.remove(pid) + try: + self.active_children.remove(pid) + except ValueError: + pass def process_request(self, request, client_address): """Fork a new subprocess to process the request."""
I've had the exact same error - but only when I used a subclass of XMLRPCServer, which installed signal handlers for SIGCHLD (which then called collect_children). Does your code install such a signal handler? (I found mine somewhere on the web). Do you start any subprocesses?
Hmm. I think the race can only happen if you call collect_children() concurrently from multiple threads or from a signal handler. The waidpid(0) bug (which affected anyone who spawned subprocesses from anything other than ForkingMixIn) is partly fixed by r61106, but I don't intend to make ForkingMixIn thread- or signal-safe. Let me know if this is enough for you. :)