msg260273 - (view) |
Author: Mark Dickinson (mark.dickinson) *  |
Date: 2016-02-14 11:38 |
On OS X (10.9.5), I'm getting an apparent deadlock in the following simple Python script: #-------------------------------------------------------- import itertools import threading def is_prime(n): return n >= 2 and all(n % d for d in xrange(2, n)) def count_primes_in_range(start, stop): return sum(is_prime(n) for n in xrange(start, stop)) def main(): threads = [ threading.Thread( target=count_primes_in_range, args=(12500*i, 12500*(i+1)) ) for i in xrange(8) ] for thread in threads: thread.start() for thread in threads: thread.join() if __name__ == '__main__': for i in itertools.count(): print "Iteration: ", i main() #-------------------------------------------------------- Each iteration takes around 60 seconds, and I typically get a deadlock within the first 5 iterations. It looks as though the deadlock happens during the "thread.join", at a stage where some of the threads have already completed and been joined. The code hangs with no CPU activity, and a backtrace (attached) shows that all the background threads are waiting for the GIL, while the main thread is in a blocking `thread.join` call. I've attached a gdb-generated stack trace. I was unable to reproduce this with a debug build of Python. I *have* reproduced with a normal build of Python, and on various Python 2.7 executables from 3rd party sources (Apple, Macports, Enthought Canopy). I've also not yet managed to reproduce on Python 3, but I haven't tried that hard. I suspect it's a Python 2-only problem, though. (And yes, this is horrible code that doesn't make much sense. It's actually a cut-down example from a "how not to do it" part of a course on concurrency. Nevertheless, it shouldn't be deadlocking.) |
|
|
msg260274 - (view) |
Author: Mark Dickinson (mark.dickinson) *  |
Date: 2016-02-14 11:39 |
Here's the hanging code as an attached file. |
|
|
msg260275 - (view) |
Author: Mark Dickinson (mark.dickinson) *  |
Date: 2016-02-14 11:44 |
Build details: this was a version of Python built with a simple `./configure && make` from commit 32893d8a52a9. taniyama:cpython mdickinson$ ./python.exe Python 2.7.10+ (2.7:32893d8a52a9, Feb 14 2016, 11:08:14) [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)] on darwin Type "help", "copyright", "credits" or "license" for more information. |
|
|
msg260277 - (view) |
Author: Mark Dickinson (mark.dickinson) *  |
Date: 2016-02-14 12:16 |
One thing that looks a bit odd to my eyes is the different values for `lock` in the PyThread_acquire_lock calls (see below). All of threads 2-5 should be waiting on the GIL, so I'd expect to see the same value for `lock` in all those calls. Instead, threads 2 and 5 seem to be waiting on the same lock, but threads 3 and 4 seem to have their own lock. But I don't know this code well enough to know whether it's supposed to be this way. Thread 5 (Thread 0x1903 of process 14075): #2 0x00000001000e7e21 in PyThread_acquire_lock (lock=0xed3ed500, waitflag=) at Python/thread_pthread.h:429 Thread 4 (Thread 0x1803 of process 14075): #2 0x00000001000e7e21 in PyThread_acquire_lock (lock=0xed3ed300, waitflag=) at Python/thread_pthread.h:429 Thread 3 (Thread 0x1703 of process 14075): #2 0x00000001000e7e21 in PyThread_acquire_lock (lock=0xed3ed600, waitflag=) at Python/thread_pthread.h:429 Thread 2 (Thread 0x1603 of process 14075): #2 0x00000001000e7e21 in PyThread_acquire_lock (lock=0xed3ed300, waitflag=) at Python/thread_pthread.h:429 Thread 1 (Thread 0x1503 of process 14075): #2 0x00000001000e7e21 in PyThread_acquire_lock (lock=0x0, waitflag=) at Python/thread_pthread.h:429 |
|
|
msg260278 - (view) |
Author: Mark Dickinson (mark.dickinson) *  |
Date: 2016-02-14 12:17 |
> threads 2 and 5 Sorry, that should have said "threads 2 and 4". |
|
|
msg260279 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) *  |
Date: 2016-02-14 12:22 |
lock=0x0 looks suspicious. |
|
|
msg260281 - (view) |
Author: Mark Dickinson (mark.dickinson) *  |
Date: 2016-02-14 12:43 |
> lock=0x0 looks suspicious. Agreed. I think that *could* possibly be a consequence of the earlier multiple GILs, though I haven't figured out exactly how yet. |
|
|
msg260283 - (view) |
Author: Mark Dickinson (mark.dickinson) *  |
Date: 2016-02-14 14:32 |
Update: I *have* been able to reproduce in a debug build. Here's the stack trace. |
|
|
msg260372 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2016-02-17 01:26 |
debug-build-stacktrace.txt: * Thread 1 is waiting on threading_hang.py:21: thread.join() * Thread 2-6 (5 threads) are waiting for the GIL The thread 1 is not supposed to hold the GIL: thread.join() is implemented by lock_PyThread_acquire_lock() at Modules/threadmodule.c:52: Py_BEGIN_ALLOW_THREADS i = PyThread_acquire_lock(self->lock_lock, i); Py_END_ALLOW_THREADS Note: I backported enhancements of python-gdb.py from Python 3, but I'm not sure that "Waiting for the GIL" line is reliable :-/ I recall vaguely a technical issue specific to Python 2 to check if a thread is waiting on the GIL or not. |
|
|
msg260374 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2016-02-17 01:33 |
> I've also not yet managed to reproduce on Python 3, but I haven't tried that hard. I suspect it's a Python 2-only problem, though. Python 3.2 got a new GIL and many bugs related to the GIL were fixed in Python 3 too. The "workaround" is to use Python 3, until this Python 2 issue is fixed (if we are able to fix it!). |
|
|
msg260375 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2016-02-17 01:46 |
>> lock=0x0 looks suspicious. > Agreed. I think that *could* possibly be a consequence of the earlier multiple GILs, though I haven't figured out exactly how yet. Don't trust gdb output when Python is compiled in release mode (with -O3). gdb output is only reliable when Python is compiled with -O0 (./configure --with-pydebug is enough for Python 2.7). |
|
|
msg260378 - (view) |
Author: Mark Dickinson (mark.dickinson) *  |
Date: 2016-02-17 07:14 |
> I'm not sure that "Waiting for the GIL" line is reliable Right; I was interpreting that line more as "I don't have the GIL at the moment", which I think applies to all threads, including the main thread. It's pretty clear that threads 2-5 really are waiting for the GIL, and (as you point out), for thread 1 to be where it is, it must have also released the GIL by that point. Agreed about the gdb results for the -O3 build. So it looks as though something's gone wrong with the "I've just released the GIL; now someone else can pick it up" signaling mechanism. |
|
|
msg260386 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2016-02-17 10:40 |
"Each iteration takes around 60 seconds, and I typically get a deadlock within the first 5 iterations." I ran a modified version of threading_hang.py (I added some debug print) a whole night: it's still running after 118 iterations. Each iteration takes ~5 minutes, I'm using a debug build of Python 2.7 on Linux (Fedora 23). |
|
|
msg260406 - (view) |
Author: Mark Dickinson (mark.dickinson) *  |
Date: 2016-02-17 17:29 |
Thanks for testing! It wouldn't entirely surprise me if this turned out to be an Apple bug somewhere. |
|
|
msg260407 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2016-02-17 17:39 |
> Thanks for testing! It wouldn't entirely surprise me if this turned out to be an Apple bug somewhere. I'm ran threading_hang.py on Mac OS X 10.10 Yosemite (darwin 14.5) on Python 2.7 compiled in release mode. After 28 iterations (~4 hours, 9 minutes per iteration), the bug was still not reproduced. It looks like my OS X version is more recent than yours. You should run your test on various OS X versions to try to identify the bug. Since it's likely a race condition, I would not be surprised that the bug is very hard to reproduce and may exist on other platforms, but may be less likely on other platforms or even less likely on some computers for various reasons... |
|
|
msg260422 - (view) |
Author: Ned Deily (ned.deily) *  |
Date: 2016-02-18 02:07 |
I'm not able to test it myself at the moment but you could also try ensuring your 2.7 build is with the most up-to-date ABI for your system: ./configure MACOSX_DEPLOYMENT_TARGET=10.9 2.7 builds do not set this automatically; 3.x builds do. |
|
|
msg260446 - (view) |
Author: Mark Dickinson (mark.dickinson) *  |
Date: 2016-02-18 09:39 |
Thanks, Ned. I tried with: ./configure --with-pydebug MACOSX_DEPLOYMENT_TARGET=10.9 && make I'm still able to reproduce. I plan to find some time this weekend to dig into the source and see if I can narrow down the cause (or at least make the hang easier / quicker to reproduce). |
|
|
msg260471 - (view) |
Author: Mark Dickinson (mark.dickinson) *  |
Date: 2016-02-18 17:31 |
A minor update: I did manage to verify that the GIL is indeed not held by any threads at the moment of the hang, so it does look very much as though a `pthread_cond_signal` signal has gone astray somehow. |
|
|
msg260742 - (view) |
Author: Mark Dickinson (mark.dickinson) *  |
Date: 2016-02-23 19:55 |
I've found OS X 10.10 and OS X 10.11 machines to test on. I'm able to reproduce on 10.10, but not on 10.11. This adds to the plausibility of the "Apple bug" explanation. |
|
|
msg352002 - (view) |
Author: Kirill Smelkov (navytux) * |
Date: 2019-09-11 17:45 |
Maybe related. |
|
|
msg352423 - (view) |
Author: Mark Dickinson (mark.dickinson) *  |
Date: 2019-09-14 09:37 |
> Maybe related. That looks plausible, but unfortunately I'm still able to reproduce the hang with your PR (commit 9b135c02aa1edab4c99c915c43cd62d988f1f9c1, macOS 10.14.6). |
|
|
msg352439 - (view) |
Author: Mark Dickinson (mark.dickinson) *  |
Date: 2019-09-14 16:01 |
FWIW, I've confirmed again that the exact same script on the same machine seems fine under Python 3.x. Given the imminent demise of Python 2, perhaps this issue is just destined to be an unsolved historical oddity. |
|
|
msg352459 - (view) |
Author: Kyle Stanley (aeros) *  |
Date: 2019-09-15 04:06 |
> FWIW, I've confirmed again that the exact same script on the same machine seems fine under Python 3.x. Given the imminent demise of Python 2, perhaps this issue is just destined to be an unsolved historical oddity. Since it doesn't seem to be occurring across every macOS platform, is specific to 2.7, and isn't proving to be an easy fix, it's probably not worthwhile to invest a significant amount of further time into the issue. Thanks for working on it though, at least we tried. (: |
|
|
msg352469 - (view) |
Author: Kirill Smelkov (navytux) * |
Date: 2019-09-15 09:06 |
> > Maybe related. > > That looks plausible, but unfortunately I'm still able to reproduce the hang with your PR (commit 9b135c02aa1edab4c99c915c43cd62d988f1f9c1, macOS 10.14.6). Thanks for feedback. Then hereby bug is probably deadlock of another kind. |
|
|
msg367349 - (view) |
Author: Zachary Ware (zach.ware) *  |
Date: 2020-04-27 01:56 |
As 2.7 is now EOL, I'm closing the issue. |
|
|