msg141295 - (view) |
Author: (gabriele.trombetti) |
Date: 2011-07-28 13:58 |
There seems to be a file descriptor (fd) leak in subprocess upon call to kill() and then destroying the current subprocess object (e.g. by scope exit) if the pipe functionality is being used. This is a reproducer: (Linux 2.6.25, Python 2.7.1 (r271:86832)) import subprocess, time def leaktest(): subp = subprocess.Popen("echo hi; sleep 200; echo bye", shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE) #you can do something here subp.terminate() #subp.wait() #Fixes the bug #time.sleep(0.001) #fixes the bug #time.sleep(0.0000001) #doesn't fix the bug return True Launch the function multiple times interactively, each time the number of open file descriptors for the python process will increase by 2. You can see that by "ls -l /proc//fd" This seems to be a race condition because adding a time.sleep(0.001) before the return fixes the problem. Probably some kernel delay is responsible for the race. This bug is significant for daemons because the daemon will die once the number of open file descriptors hits the ulimit, usually 1024, so please fix. Note: until the bug is fixed, a simple workaround (not documented in module docs though) is to call Popen.wait() after Popen.kill() Thank you |
|
|
msg141307 - (view) |
Author: Charles-François Natali (neologix) *  |
Date: 2011-07-28 16:02 |
There's indeed a leak in your code, because you don't close the stdout and stderr file descriptors. Try with: subp.terminate() subp.stdout.close() subp.stderr.close() return True And you'll be just fine. The reason why this works with a short delay is subtle however, and there's actually a bug: - when a subprocess.Popen is created, its stdout and stderr are wrapped with os.fdopen() - file objects returned by fdopen() are automatically closed when the object is deallocated (i.e. when the file object goes out of scope in cPython) - so in theory, when your subprocess.Popen goes out of scope, the corresponding FDs should be closed, and you shoudn't have a leak (note that this assumes cPython, and will maybe not work with other implementations - you shouldn't rely on this) - the problem is that when subprocess.Popen's __del__ method is called before the process has exited (i.e. if you return from leaktest() before the process has exited), the Popen object is added to a list (subprocess._active), so that it can be waited (to avoid zombies) - Popen objects from this list are collected (i.e. the process is waited for, and if terminated it's removed from the list) synchronously when a new Popen() object is created (_cleanup funtion) The problem is that there's a bug in the collection code: def _cleanup(): for inst in _active[:]: res = inst._internal_poll(_deadstate=sys.maxint) if res is not None and res >= 0: try: _active.remove(inst) except ValueError: # This can happen if two threads create a new Popen instance. # It's harmless that it was already removed, so ignore. pass res = inst._internal_poll(_deadstate=sys.maxint) if res is not None and res >= 0: If the process exit code is negative (in your case, -SIGTERM) then the Popen object is not removed from the list. Two consequences: - the object is not deallocated, the corresponding FDs neither, and you hit RLIMIT_NOFILE - even if stdout and stderr are closed manually, the object itself is not deallocated, so you have an ever-growing _active list, and a memory leak (and walking the list takes longer and longer) Changing if res is not None and res >= 0: to if res is not None: fixes this. I'll look at this in more detail and provide a patch. |
|
|
msg141317 - (view) |
Author: Charles-François Natali (neologix) *  |
Date: 2011-07-28 21:30 |
Alright. I tested this on default, and couldn't reproduce the FD leak. It turned out to be due to another bug, affecting only the code path which calls pure C _posixsubprocess (which is the only implementation left in 3.3, but 3.2 still has the old pure-Python version). The code just forgets to set Popen._child_created to true after fork(), so when Popen.__del__() gets called before the process has exited, the object is not added to the _active list, and gets deallocated immediately. While this accidentaly "fixes" the FD leak, this has another - worse - side effet: the process remains a zombie. I'm thus attaching three patches, with tests: - one for 2.7, which fixes the original problem (i.e. remove the process from _active once exited, even if it got killed by a signal) - one for default, which also sets _child_created to True after fork() - another one for 3.2, which does the same thing as the one for default (but the code is a little different because 3.2 has both pure-Python and C implementation) Reviews welcome! |
|
|
msg141329 - (view) |
Author: Ross Lagerwall (rosslagerwall)  |
Date: 2011-07-29 05:43 |
The patches look good and seem to fix the issue. |
|
|
msg142337 - (view) |
Author: Roundup Robot (python-dev)  |
Date: 2011-08-18 15:18 |
New changeset 9ee802642d86 by Charles-François Natali in branch '2.7': Issue #12650: Fix a race condition where a subprocess.Popen could leak http://hg.python.org/cpython/rev/9ee802642d86 |
|
|
msg142348 - (view) |
Author: Roundup Robot (python-dev)  |
Date: 2011-08-18 15:54 |
New changeset 7d358379c5fb by Charles-François Natali in branch '2.7': Issue #12650: only run the tests on Unix. http://hg.python.org/cpython/rev/7d358379c5fb |
|
|
msg142366 - (view) |
Author: Roundup Robot (python-dev)  |
Date: 2011-08-18 16:48 |
New changeset 8ca7f109ce79 by Charles-François Natali in branch '3.2': Issue #12650: Fix a race condition where a subprocess.Popen could leak http://hg.python.org/cpython/rev/8ca7f109ce79 |
|
|
msg142373 - (view) |
Author: Roundup Robot (python-dev)  |
Date: 2011-08-18 17:12 |
New changeset 148d75d106f1 by Charles-François Natali in branch 'default': Issue #12650: Fix a race condition where a subprocess.Popen could leak http://hg.python.org/cpython/rev/148d75d106f1 |
|
|
msg142391 - (view) |
Author: Charles-François Natali (neologix) *  |
Date: 2011-08-18 18:21 |
Patch committed. This should also fix the original problem. Gabriele, thanks for the report. |
|
|
msg142402 - (view) |
Author: Charles-François Natali (neologix) *  |
Date: 2011-08-18 21:43 |
Reopening. The new tests are failing on solaris and Debian parallel buildbots: http://www.python.org/dev/buildbot/all/builders/x86 debian parallel 3.x/builds/2734/steps/test/logs/stdio http://www.python.org/dev/buildbot/all/builders/sparc solaris10 gcc 3.x/builds/3528/steps/test/logs/stdio """ ====================================================================== FAIL: test_zombie_fast_process_del (test.test_subprocess.POSIXProcessTestCase) ---------------------------------------------------------------------- Traceback (most recent call last): File "/home2/buildbot2/slave/3.x.loewis-parallel/build/Lib/test/test_subprocess.py", line 1519, in test_zombie_fast_process_del self.assertRaises(OSError, os.waitpid, pid, 0) AssertionError: OSError not raised by waitpid """ I've no idea what's going on here... This means that the child process didn't get reaped by _cleanup(). It could also maybe be due to the first child's PID being reused immediately by the second. |
|
|
msg142404 - (view) |
Author: Roundup Robot (python-dev)  |
Date: 2011-08-18 22:25 |
New changeset 122f952add3c by Charles-François Natali in branch '3.2': Issue #12650: fix failures on some buildbots, when a subprocess takes a long http://hg.python.org/cpython/rev/122f952add3c New changeset a32ae2749cd1 by Charles-François Natali in branch 'default': Issue #12650: fix failures on some buildbots, when a subprocess takes a long http://hg.python.org/cpython/rev/a32ae2749cd1 New changeset 5333ed548257 by Charles-François Natali in branch '2.7': Issue #12650: fix failures on some buildbots, when a subprocess takes a long http://hg.python.org/cpython/rev/5333ed548257 |
|
|
msg142405 - (view) |
Author: Charles-François Natali (neologix) *  |
Date: 2011-08-18 22:32 |
> This means that the child process didn't get reaped by _cleanup(). The most likely cause is that when _cleanup() calls waitpid(pid, WNOHANG), the child process hasn't exited yet. Since I had already set a rather high timeout to avoid this problem and we're only interested in checking that the child is correctly added to the _active list, I've removed the part of test_zombie_fast_process_del which checked that the child got collected when the next Popen instance is allocated (I added this check in test_leak_fast_process_del_killed, which doesn't have this timeout problem because the process gets killed). I'll watch the buildbots. |
|
|
msg142422 - (view) |
Author: Charles-François Natali (neologix) *  |
Date: 2011-08-19 10:27 |
The test now passes on the buildbots, closing. |
|
|