Issue 14396: Popen wait() doesn't handle spurious wakeups (original) (raw)

Created on 2012-03-24 04:04 by amscanne, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Messages (7)

msg156683 - (view)

Author: Adin Scannell (amscanne)

Date: 2012-03-24 04:04

While running a complex python process that executes a bunch of subprocesses (using the subprocess module, specifically calling communicate()), I found myself with occasional zombie processes piling up. Turns out Python is not correctly wait()ing for the children. Although in my case it happens for < 5% of subprocesses, it happens for random Popen objects, used in different ways (using Popen() and then read()/write()/wait() directly or with communicate()). I'd love to find out I'm crazy, but I'm not doing anything too sneaky and the patch below fixes the problem.

I'm not sure why it's happening in my particular environment (maybe it just so happens that the child processes enter into states with particular timing, or the parent receives signals at the wrong moments) but it's very reproducible for me.

I believe that the cause of the zombie processes is as follows:

If you read the description of the waitpid system call (http://www.kernel.org/doc/man-pages/online/pages/man2/wait.2.html), there are several events that could cause waitpid() to return. I have no idea why, but even without WNOHANG set, it looks I'm getting back an occasional 0 return value from waitpid(). Interrupted system call? Stopped child process? Not sure why at the moment. The documentation is a bit ambiguous as to whether this can happen, BUT looking at the example code at the bottom, it seems to handle this spurious wakeup case (which subprocess does not). The net result is that this process has not exited or been killed. The python code paths don't consider this possibility (as I believe in normal circumstances, it rarely happens!).

I discovered this bug on 2.7.2. I've prepared a patch for the 2.7 branch (75701:d46c1973d3c4), although I'm certain almost all versions, including the tip suffer from this problem. I'm happy to port to other branches if necessary, although I think appropriate maintainers could whip it up in no time flat. I've tested my 2.7 fix and it solves my problem -- no more zombies. This patch does not change the behaviour of the Popen class in the normal case but allows it to handle spurious wakeups.

msg156684 - (view)

Author: Gregory P. Smith (gregory.p.smith) * (Python committer)

Date: 2012-03-24 07:05

Thanks. I'll see that this fix gets into 2.7, 3.2 and 3.3.

Out of curiosity, what Linux kernel version and glibc version were you using?

I'm somewhat surprised that I haven't run into this before. :)

msg156685 - (view)

Author: Adin Scannell (amscanne)

Date: 2012-03-24 08:14

Kernel is 3.0.0-15-generic (I believe stock Ubuntu Oneric kernel). Version for glibc is Ubuntu EGLIBC 2.13-20ubuntu5.

I'm working on figuring out the exact conditions under which it happens and creating a harness. I'll post it when I've got it.

msg156743 - (view)

Author: Charles-François Natali (neologix) * (Python committer)

Date: 2012-03-25 08:40

I'm working on figuring out the exact conditions under which it happens and creating a harness. I'll post it when I've got it.

Please do so, because I'm quite skeptical about waitpid() returning 0 without WNOHANG. If you can reproduce it fairly consistently, you could try running under strace to see what's happening.

msg175320 - (view)

Author: Roundup Robot (python-dev) (Python triager)

Date: 2012-11-11 05:10

New changeset d478df13abde by Gregory P. Smith in branch '3.2': Fixes issue #14396: Handle the odd rare case of waitpid returning 0 when http://hg.python.org/cpython/rev/d478df13abde

New changeset 61a0eace0f2e by Gregory P. Smith in branch '3.3': Fixes issue #14396: Handle the odd rare case of waitpid returning 0 http://hg.python.org/cpython/rev/61a0eace0f2e

New changeset 512c1120332f by Gregory P. Smith in branch 'default': Fixes issue #14396: Handle the odd rare case of waitpid returning 0 http://hg.python.org/cpython/rev/512c1120332f

msg175321 - (view)

Author: Roundup Robot (python-dev) (Python triager)

Date: 2012-11-11 05:13

New changeset 82711f5ab507 by Gregory P. Smith in branch '2.7': Fixes issue #14396: Handle the odd rare case of waitpid returning 0 http://hg.python.org/cpython/rev/82711f5ab507

msg175322 - (view)

Author: Gregory P. Smith (gregory.p.smith) * (Python committer)

Date: 2012-11-11 05:16

regardless of knowing how to reproduce this system call behavior, the changes necessary to handle robustly it are easy enough. fixed.

3.3+ already handled it if a timeout was specified (new feature). I only had to fix the default no timeout case.

History

Date

User

Action

Args

2022-04-11 14:57:28

admin

set

github: 58604

2012-11-11 05:16:10

gregory.p.smith

set

status: open -> closed
resolution: fixed
messages: +

versions: + Python 3.4

2012-11-11 05:13:36

python-dev

set

messages: +

2012-11-11 05:10:39

python-dev

set

nosy: + python-dev
messages: +

2012-03-25 08:40:40

neologix

set

messages: +

2012-03-24 10:44:15

pitrou

set

nosy: + neologix

2012-03-24 08:14:38

amscanne

set

messages: +

2012-03-24 07:05:37

gregory.p.smith

set

assignee: gregory.p.smith
messages: +
nosy: + gregory.p.smith, - gps
versions: + Python 3.2, Python 3.3

2012-03-24 05:48:26

brian.curtin

set

keywords: + needs review
nosy: + gps
stage: patch review

versions: - Python 2.6, Python 3.1, Python 3.2, Python 3.3, Python 3.4

2012-03-24 04:04:23

amscanne

create