Message 291467 - Python tracker (original) (raw)

In trip_signal [1], the logic goes:

set the flag saying that this particular signal was tripped
write to the wakeup fd
set the global is_tripped flag saying "at least one signal was tripped", and do Py_AddPendingCall (which sets some global flags that the bytecode interpreter checks on every pass through the loop)

So the problem here is that it's step (2) that wakes up the main thread to check for signals, but it's step (3) that actually arranges for the Python-level signal handler to run. (Step (1) turns out to be irrelevant, because no-one looks at the per-signal flags unless the global is_tripped flag is set. This might be why no-one noticed this bug through code inspection though – I certainly missed it, despite explicitly checking for it several times!)

The result is that the following sequence of events is possible:

signal arrives (e.g. SIGINT)
trip_signal writes to the wakeup fd
the main thread blocked in IO wait sees this, and wakes up
the main thread checks for signals, and doesn't find any
the main thread empties the wakeup fd
the main thread goes back to sleep
trip_signal sets the flags to request the Python-level signal handler be run
the main thread doesn't notice, because it's asleep

It turns out that this is a real thing that can actually happen; it's causing an annoying intermittent failure in the trio testsuite on appveyor; and under the correct conditions I can reproduce it very reliably in my local Windows VM. See [2].

I think the fix is just to swap the order of steps (2) and (3), so we write to the wakeup fd last. Unfortunately I can't easily test this because I don't have a way to build CPython on Windows. But [2] has some IMHO pretty compelling evidence that this is what's happening.

[1] https://github.com/python/cpython/blob/6fab78e9027f9ebd6414995580781b480433e595/Modules/signalmodule.c#L238-L291 [2] https://github.com/python-trio/trio/issues/119