Issue 27889: ctypes interfers with signal handling (original) (raw)

Issue27889

Created on 2016-08-29 22:47 by Andre Merzky, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Messages (15)
msg273889 - (view) Author: Andre Merzky (Andre Merzky) * Date: 2016-08-29 22:47
Summary: A thread uses signals to the MainThread to signal an abnormal condition. The main thread is expected to get that signal, invoke the signal handler, and raise an exception. In combination with 'ctypes' calls, that does not happen. Consider the following code: ------------------------------------------------ #!/usr/bin/env python import multiprocessing as mp import threading as mt import signal import time import os # from uuid.py line 400 import ctypes, ctypes.util lib = ctypes.CDLL(ctypes.util.find_library('uuid')) def sigusr2_handler(signum, frame): raise RuntimeError('caught sigusr2') signal.signal(signal.SIGUSR2, sigusr2_handler) def sub(pid): time.sleep(1) os.kill(pid, signal.SIGUSR2) try: # p = mp.Process(target=sub, args=[os.getpid()]) # p.start() t = mt.Thread(target=sub, args=[os.getpid()]) t.start() time.sleep(3) except Exception as e: print 'except: %s' % e else: print 'unexcepted' finally: # p.join() t.join() ------------------------------------------------ With Python 2.7 I would expect the output: ------------------------------------------------ except: caught sigusr2 ------------------------------------------------ but I get: ------------------------------------------------ Traceback (most recent call last): File "./bug.py", line 29, in print 'unexcepted' File "./bug.py", line 13, in sigusr2_handler raise RuntimeError('caught sigusr2') File "./bug.py", line 29, in print 'unexcepted' File "./bug.py", line 13, in sigusr2_handler raise RuntimeError('caught sigusr2') RuntimeError: caught sigusr2 ------------------------------------------------ most of the time. The problem only happens when the 'ctypes.CDLL' line is enabled -- commenting it out results in the expected behavior. That line is copied from uuid.py -- importing uuid.py triggers the same unexpected behavior, which is ultimately why I am stuck. Note that the problem only occurs when a *thread* sends the signal -- it does *not* happen if the signal is sent by the main thread or by a different process (switch to the multiprocessing code path for confirmation). Interestingly, the problem also disappears when a 'print' statement is added after the 'time.sleep(3)', which may (or may not) indicate a timing issue. I would welcome any suggestion on how to further triage this.
msg273891 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2016-08-30 00:36
I can't reproduce this issue in 2.7.11 on Linux. I also don't see how it could be related to ctypes. It doesn't meddle with signal handlers, and neither does Linux libuuid -- not as far I as I can tell with breakpoints set on signal() and sigaction(). Please provide the Python version and platform that you're using.
msg273894 - (view) Author: Andre Merzky (Andre Merzky) * Date: 2016-08-30 02:04
Thanks for checking! I use: $ uname -a Linux thinkie 3.11-2-amd64 #1 SMP Debian 3.11.8-1 (2013-11-13) x86_64 GNU/Linux $ python -V Python 2.7.5+ Note that the problem does not occur on every run -- but in more than 50% of the cases, for me. I am on a somewhat old machine right now (M620), not sure if that matters, and will try to reproduce on some other boxes tomorrow.
msg273913 - (view) Author: Andre Merzky (Andre Merzky) * Date: 2016-08-30 12:40
I can confirm the same unexpected behavior for Python 2.7.6 / Linux radical 4.2.0-25-generic #30~14.04.1-Ubuntu (2nd attempt) Python 2.7.9 / Linux login1.stampede.tacc.utexas.edu 2.6.32-431.17.1.el6.x86_64 (32nd attempt) Python 2.7.10 / Linux gordon-ln2.sdsc.edu 2.6.32-431.29.2.el6.x86_64 (5th attempt) Any suggestion for further testing?
msg273953 - (view) Author: Andre Merzky (Andre Merzky) * Date: 2016-08-30 19:07
I also see the problem on 2.7.11, on MacOS, but with significantly lower frequency. I can't tell if that is due to changes in Python, the different architecture, or whatever... $ python -V Python 2.7.11 $ uname -a Darwin cameo.local 14.4.0 Darwin Kernel Version 14.4.0: Thu May 28 11:35:04 PDT 2015; root:xnu-2782.30.5~1/RELEASE_X86_64 x86_64 $ while true; do i=$((i+1)); echo -n "$i: "; python t.py | break; done 1: except: caught sigusr2 2: except: caught sigusr2 3: except: caught sigusr2 ... 134: except: caught sigusr2 Traceback (most recent call last): File "t.py", line 30, in print 'unexcepted' File "t.py", line 14, in sigusr2_handler raise RuntimeError('caught sigusr2') RuntimeError: caught sigusr2
msg274488 - (view) Author: George Slavin (gslavin) * Date: 2016-09-06 01:00
I can reproduce this issue with Python 2.7.12 and Python 3.5.2 on a Arch linux VM using the python script provided, but I think this is an error in the code. The repro is tied to the time.sleep call in the try block. If I do time.sleep(1), I can reproduce the issue almost every time (with or without ctypes). When the time.sleep() call is too short, you reach the finally block before the main thread receives the signal. Because you're in the finally block, the exception isn't caught.
msg274524 - (view) Author: Andre Merzky (Andre Merzky) * Date: 2016-09-06 05:29
> The repro is tied to the time.sleep call in the try block. If I do time.sleep(1) Yes - if both sleeps, the one in the `try` block and the one in the thread's routine (`sub`) are equal, then you'll have the typical race, and you can well be in the `finally` clause when the exception arises. The problem though is different: please feel free to set the sleep in the `try` block to `sleep(100)` -- and unless the thread creation and startup takes 99 seconds, you should *not* run into that race, the problem however persists: ------------------------------------------------------- merzky@thinkie:~ $ grep -C 2 sleep bug.py def sub(pid): time.sleep(1) os.kill(pid, signal.SIGUSR2) -- t = mt.Thread(target=sub, args=[os.getpid()]) t.start() time.sleep(100) except Exception as e: print 'except: %s' % e merzky@thinkie:~ $ while true; do i=$((i+1)); echo -n "$i: "; python bug.py | break; done 1: except: caught sigusr2 2: except: caught sigusr2 3: except: caught sigusr2 4: Traceback (most recent call last): File "bug.py", line 30, in print 'unexcepted' File "bug.py", line 14, in sigusr2_handler raise RuntimeError('caught sigusr2') RuntimeError: caught sigusr2 ------------------------------------------------------ In this case, the begaviour does depend on `ctypes` -- or at least I have not seen that problem without `ctypes` being used. Thanks, Andre.
msg274530 - (view) Author: George Slavin (gslavin) * Date: 2016-09-06 06:47
Thanks for the reply! I've managed to reproduce the issue without using ctypes, so we can exclude ctypes as the cause of the problem :) The bug only occurs for me after hundreds of iterations of the script, so it is very intermittent. From my results, it appears the signal handler is called, but the exception isn't caught in the try block (this matches your result). From the call stack it appears the exception handler is called at "print 'unexcepted'", which is after the try block. Interestingly, if I place a print after time.sleep, I cannot repro this issue. Maybe there is a race between the signal handler execution and execution of the try block in the main thread? Adding the print might slow down the main thread enough that the race disappears. I don't know the source that well, so I'll need to do some digging to get any more information. I hope this helps! -George Slavin
msg274531 - (view) Author: Andre Merzky (Andre Merzky) * Date: 2016-09-06 06:56
Also thanks for the reply! :) Interesting that our results differ: as said, my code stumbles over it rather frequently, and I can't reproduce w/o ctypes activities. But indeed, the latter might just reinforce a race condition which is present either way... Yes, I also observed that a print statement (or other activities) makes the problem appear less frequently -- I would also guess that it influences a potential race somewhere. Thanks for looking into it! Andre.
msg274533 - (view) Author: George Slavin (gslavin) * Date: 2016-09-06 07:20
In case anyone else sees this thread, here's my trimmed down script to repro the issue: #!/usr/bin/env python2 import threading as mt import signal import time import os def sigusr2_handler(signum, frame): raise RuntimeError('caught sigusr2') signal.signal(signal.SIGUSR2, sigusr2_handler) def sub(pid): time.sleep(1) os.kill(pid, signal.SIGUSR2) try: t = mt.Thread(target=sub, args=[os.getpid()]) t.start() time.sleep(500) except Exception as e: print('except: %s' % e) else: print('unexcepted') finally: t.join() When I run the above, I get the repro after hundreds of iterations. From these results, it appears there is no guarentee that the signal handler will run before the main thread continues execution at the time.sleep(500) line. This would explain why we advance to the else clause before the exception is raised.
msg274534 - (view) Author: Andre Merzky (Andre Merzky) * Date: 2016-09-06 07:35
Hi George, > From these results, it appears there is no guarentee that the signal handler will run before the main thread continues execution at the time.sleep(500) line. This would explain why we advance to the else clause before the exception is raised. To me it looks like the problem pops up *way* before the `sleep(100)` (or whatever) finishes, in fact it looks consistently like the sleep is indeed interrupted after one second. I would it thus interpret differently, as the code should not be able to advance to the `else` clause at that time. Is that different for you?
msg274537 - (view) Author: George Slavin (gslavin) * Date: 2016-09-06 07:46
The docs say the sleep call will end if a signal is caught, so once the main thread wakes, it won't go back to sleep. On Sep 6, 2016 12:35 AM, "Andre Merzky" <report@bugs.python.org> wrote: > > Andre Merzky added the comment: > > Hi George, > > > From these results, it appears there is no guarentee that the signal > handler will run before the main thread continues execution at the > time.sleep(500) line. This would explain why we advance to the else clause > before the exception is raised. > > To me it looks like the problem pops up *way* before the `sleep(100)` (or > whatever) finishes, in fact it looks consistently like the sleep is indeed > interrupted after one second. I would it thus interpret differently, as > the code should not be able to advance to the `else` clause at that time. > > Is that different for you? > > ---------- > > _______________________________________ > Python tracker <report@bugs.python.org> > <http://bugs.python.org/issue27889> > _______________________________________ >
msg274541 - (view) Author: Andre Merzky (Andre Merzky) * Date: 2016-09-06 08:51
I think we are on the same page then, thanks. AFAIU, the C-level signal handler results in a flag being set, which is evaluated at some later point in time[1], after a certain number of opcodes have been executed. Could it be that those opcodes blindly continue to walk into the `else` clause despite the sleep interruption? [1] https://docs.python.org/3/library/signal.html#execution-of-python-signal-handlers
msg275294 - (view) Author: Andre Merzky (Andre Merzky) * Date: 2016-09-09 09:29
I would appreciate any suggestion on how to avoid this problem -- otherwise it seems that I can't reliably use signals in threaded contexts at all :( Thanks, Andre.
msg276203 - (view) Author: Andre Merzky (Andre Merzky) * Date: 2016-09-13 07:44
FWIW, a workaround seems to be a nested try/except clause: try: try: do_lots_of_work() except RuntimeError as e: print 'signal handling worked' except RuntimeError: print 'signal handling delayed' I did a stress test over 100k runs, and got no unexpected behavior that way. I am not sure if the underlying race condition is just hidden, or if the inner except clause triggers the signal handling code internally -- either way, while cumbersome to use, it seems to work. I'd still appreciate feedback on a cleaner solution. Thanks, Andre.
History
Date User Action Args
2022-04-11 14:58:35 admin set github: 72076
2021-02-25 12:35:33 eryksun set status: open -> closedresolution: out of datestage: resolved
2016-09-13 07:44:48 Andre Merzky set messages: +
2016-09-09 09:29:07 Andre Merzky set messages: +
2016-09-06 08:51:41 Andre Merzky set messages: +
2016-09-06 07:46:28 gslavin set messages: +
2016-09-06 07:35:07 Andre Merzky set messages: +
2016-09-06 07:20:51 gslavin set messages: +
2016-09-06 06:56:29 Andre Merzky set messages: +
2016-09-06 06:47:04 gslavin set messages: +
2016-09-06 05:29:59 Andre Merzky set messages: +
2016-09-06 01:00:04 gslavin set nosy: + gslavinmessages: +
2016-08-30 19:07:31 Andre Merzky set messages: +
2016-08-30 12:40:55 Andre Merzky set messages: +
2016-08-30 02:04:31 Andre Merzky set messages: +
2016-08-30 00:36:43 eryksun set nosy: + eryksunmessages: +
2016-08-29 22:47:38 Andre Merzky create