Issue 26360: Deadlock in thread.join on Python 2.7/macOS (original) (raw)

Issue26360

Created on 2016-02-14 11:38 by mark.dickinson, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
hang_backtrace.txt mark.dickinson,2016-02-14 11:38
threading_hang.py mark.dickinson,2016-02-14 11:39
debug-build-stacktrace.txt mark.dickinson,2016-02-14 14:32
Messages (25)
msg260273 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2016-02-14 11:38
On OS X (10.9.5), I'm getting an apparent deadlock in the following simple Python script: #-------------------------------------------------------- import itertools import threading def is_prime(n): return n >= 2 and all(n % d for d in xrange(2, n)) def count_primes_in_range(start, stop): return sum(is_prime(n) for n in xrange(start, stop)) def main(): threads = [ threading.Thread( target=count_primes_in_range, args=(12500*i, 12500*(i+1)) ) for i in xrange(8) ] for thread in threads: thread.start() for thread in threads: thread.join() if __name__ == '__main__': for i in itertools.count(): print "Iteration: ", i main() #-------------------------------------------------------- Each iteration takes around 60 seconds, and I typically get a deadlock within the first 5 iterations. It looks as though the deadlock happens during the "thread.join", at a stage where some of the threads have already completed and been joined. The code hangs with no CPU activity, and a backtrace (attached) shows that all the background threads are waiting for the GIL, while the main thread is in a blocking `thread.join` call. I've attached a gdb-generated stack trace. I was unable to reproduce this with a debug build of Python. I *have* reproduced with a normal build of Python, and on various Python 2.7 executables from 3rd party sources (Apple, Macports, Enthought Canopy). I've also not yet managed to reproduce on Python 3, but I haven't tried that hard. I suspect it's a Python 2-only problem, though. (And yes, this is horrible code that doesn't make much sense. It's actually a cut-down example from a "how not to do it" part of a course on concurrency. Nevertheless, it shouldn't be deadlocking.)
msg260274 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2016-02-14 11:39
Here's the hanging code as an attached file.
msg260275 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2016-02-14 11:44
Build details: this was a version of Python built with a simple `./configure && make` from commit 32893d8a52a9. taniyama:cpython mdickinson$ ./python.exe Python 2.7.10+ (2.7:32893d8a52a9, Feb 14 2016, 11:08:14) [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)] on darwin Type "help", "copyright", "credits" or "license" for more information.
msg260277 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2016-02-14 12:16
One thing that looks a bit odd to my eyes is the different values for `lock` in the PyThread_acquire_lock calls (see below). All of threads 2-5 should be waiting on the GIL, so I'd expect to see the same value for `lock` in all those calls. Instead, threads 2 and 5 seem to be waiting on the same lock, but threads 3 and 4 seem to have their own lock. But I don't know this code well enough to know whether it's supposed to be this way. Thread 5 (Thread 0x1903 of process 14075): #2 0x00000001000e7e21 in PyThread_acquire_lock (lock=0xed3ed500, waitflag=) at Python/thread_pthread.h:429 Thread 4 (Thread 0x1803 of process 14075): #2 0x00000001000e7e21 in PyThread_acquire_lock (lock=0xed3ed300, waitflag=) at Python/thread_pthread.h:429 Thread 3 (Thread 0x1703 of process 14075): #2 0x00000001000e7e21 in PyThread_acquire_lock (lock=0xed3ed600, waitflag=) at Python/thread_pthread.h:429 Thread 2 (Thread 0x1603 of process 14075): #2 0x00000001000e7e21 in PyThread_acquire_lock (lock=0xed3ed300, waitflag=) at Python/thread_pthread.h:429 Thread 1 (Thread 0x1503 of process 14075): #2 0x00000001000e7e21 in PyThread_acquire_lock (lock=0x0, waitflag=) at Python/thread_pthread.h:429
msg260278 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2016-02-14 12:17
> threads 2 and 5 Sorry, that should have said "threads 2 and 4".
msg260279 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-02-14 12:22
lock=0x0 looks suspicious.
msg260281 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2016-02-14 12:43
> lock=0x0 looks suspicious. Agreed. I think that *could* possibly be a consequence of the earlier multiple GILs, though I haven't figured out exactly how yet.
msg260283 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2016-02-14 14:32
Update: I *have* been able to reproduce in a debug build. Here's the stack trace.
msg260372 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-02-17 01:26
debug-build-stacktrace.txt: * Thread 1 is waiting on threading_hang.py:21: thread.join() * Thread 2-6 (5 threads) are waiting for the GIL The thread 1 is not supposed to hold the GIL: thread.join() is implemented by lock_PyThread_acquire_lock() at Modules/threadmodule.c:52: Py_BEGIN_ALLOW_THREADS i = PyThread_acquire_lock(self->lock_lock, i); Py_END_ALLOW_THREADS Note: I backported enhancements of python-gdb.py from Python 3, but I'm not sure that "Waiting for the GIL" line is reliable :-/ I recall vaguely a technical issue specific to Python 2 to check if a thread is waiting on the GIL or not.
msg260374 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-02-17 01:33
> I've also not yet managed to reproduce on Python 3, but I haven't tried that hard. I suspect it's a Python 2-only problem, though. Python 3.2 got a new GIL and many bugs related to the GIL were fixed in Python 3 too. The "workaround" is to use Python 3, until this Python 2 issue is fixed (if we are able to fix it!).
msg260375 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-02-17 01:46
>> lock=0x0 looks suspicious. > Agreed. I think that *could* possibly be a consequence of the earlier multiple GILs, though I haven't figured out exactly how yet. Don't trust gdb output when Python is compiled in release mode (with -O3). gdb output is only reliable when Python is compiled with -O0 (./configure --with-pydebug is enough for Python 2.7).
msg260378 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2016-02-17 07:14
> I'm not sure that "Waiting for the GIL" line is reliable Right; I was interpreting that line more as "I don't have the GIL at the moment", which I think applies to all threads, including the main thread. It's pretty clear that threads 2-5 really are waiting for the GIL, and (as you point out), for thread 1 to be where it is, it must have also released the GIL by that point. Agreed about the gdb results for the -O3 build. So it looks as though something's gone wrong with the "I've just released the GIL; now someone else can pick it up" signaling mechanism.
msg260386 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-02-17 10:40
"Each iteration takes around 60 seconds, and I typically get a deadlock within the first 5 iterations." I ran a modified version of threading_hang.py (I added some debug print) a whole night: it's still running after 118 iterations. Each iteration takes ~5 minutes, I'm using a debug build of Python 2.7 on Linux (Fedora 23).
msg260406 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2016-02-17 17:29
Thanks for testing! It wouldn't entirely surprise me if this turned out to be an Apple bug somewhere.
msg260407 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-02-17 17:39
> Thanks for testing! It wouldn't entirely surprise me if this turned out to be an Apple bug somewhere. I'm ran threading_hang.py on Mac OS X 10.10 Yosemite (darwin 14.5) on Python 2.7 compiled in release mode. After 28 iterations (~4 hours, 9 minutes per iteration), the bug was still not reproduced. It looks like my OS X version is more recent than yours. You should run your test on various OS X versions to try to identify the bug. Since it's likely a race condition, I would not be surprised that the bug is very hard to reproduce and may exist on other platforms, but may be less likely on other platforms or even less likely on some computers for various reasons...
msg260422 - (view) Author: Ned Deily (ned.deily) * (Python committer) Date: 2016-02-18 02:07
I'm not able to test it myself at the moment but you could also try ensuring your 2.7 build is with the most up-to-date ABI for your system: ./configure MACOSX_DEPLOYMENT_TARGET=10.9 2.7 builds do not set this automatically; 3.x builds do.
msg260446 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2016-02-18 09:39
Thanks, Ned. I tried with: ./configure --with-pydebug MACOSX_DEPLOYMENT_TARGET=10.9 && make I'm still able to reproduce. I plan to find some time this weekend to dig into the source and see if I can narrow down the cause (or at least make the hang easier / quicker to reproduce).
msg260471 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2016-02-18 17:31
A minor update: I did manage to verify that the GIL is indeed not held by any threads at the moment of the hang, so it does look very much as though a `pthread_cond_signal` signal has gone astray somehow.
msg260742 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2016-02-23 19:55
I've found OS X 10.10 and OS X 10.11 machines to test on. I'm able to reproduce on 10.10, but not on 10.11. This adds to the plausibility of the "Apple bug" explanation.
msg352002 - (view) Author: Kirill Smelkov (navytux) * Date: 2019-09-11 17:45
Maybe related.
msg352423 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2019-09-14 09:37
> Maybe related. That looks plausible, but unfortunately I'm still able to reproduce the hang with your PR (commit 9b135c02aa1edab4c99c915c43cd62d988f1f9c1, macOS 10.14.6).
msg352439 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2019-09-14 16:01
FWIW, I've confirmed again that the exact same script on the same machine seems fine under Python 3.x. Given the imminent demise of Python 2, perhaps this issue is just destined to be an unsolved historical oddity.
msg352459 - (view) Author: Kyle Stanley (aeros) * (Python committer) Date: 2019-09-15 04:06
> FWIW, I've confirmed again that the exact same script on the same machine seems fine under Python 3.x. Given the imminent demise of Python 2, perhaps this issue is just destined to be an unsolved historical oddity. Since it doesn't seem to be occurring across every macOS platform, is specific to 2.7, and isn't proving to be an easy fix, it's probably not worthwhile to invest a significant amount of further time into the issue. Thanks for working on it though, at least we tried. (:
msg352469 - (view) Author: Kirill Smelkov (navytux) * Date: 2019-09-15 09:06
> > Maybe related. > > That looks plausible, but unfortunately I'm still able to reproduce the hang with your PR (commit 9b135c02aa1edab4c99c915c43cd62d988f1f9c1, macOS 10.14.6). Thanks for feedback. Then hereby bug is probably deadlock of another kind.
msg367349 - (view) Author: Zachary Ware (zach.ware) * (Python committer) Date: 2020-04-27 01:56
As 2.7 is now EOL, I'm closing the issue.
History
Date User Action Args
2022-04-11 14:58:27 admin set github: 70548
2020-04-27 01:56:26 zach.ware set status: open -> closednosy: + zach.waremessages: + resolution: out of datestage: resolved
2019-09-15 09:06:46 navytux set messages: +
2019-09-15 04:06:22 aeros set nosy: + aerosmessages: +
2019-09-14 16:01:01 mark.dickinson set messages: +
2019-09-14 09:46:29 mark.dickinson set title: Deadlock in thread.join on Python 2.7/Mac OS X 10.9, 10.10 -> Deadlock in thread.join on Python 2.7/macOS
2019-09-14 09:37:11 mark.dickinson set messages: +
2019-09-11 17:45:33 navytux set nosy: + navytuxmessages: +
2016-02-23 19:55:48 mark.dickinson set title: Deadlock in thread.join on Python 2.7/Mac OS X 10.9 -> Deadlock in thread.join on Python 2.7/Mac OS X 10.9, 10.10
2016-02-23 19:55:39 mark.dickinson set messages: +
2016-02-18 17:31:24 mark.dickinson set messages: +
2016-02-18 09:39:26 mark.dickinson set messages: +
2016-02-18 02:07:19 ned.deily set nosy: + ned.deilymessages: +
2016-02-17 17:39:37 vstinner set messages: +
2016-02-17 17:29:03 mark.dickinson set messages: +
2016-02-17 10:42:59 vstinner set title: Deadlock in thread.join -> Deadlock in thread.join on Python 2.7/Mac OS X 10.9
2016-02-17 10:40:50 vstinner set messages: +
2016-02-17 07:14:03 mark.dickinson set messages: +
2016-02-17 01:46:41 vstinner set messages: +
2016-02-17 01:33:15 vstinner set messages: +
2016-02-17 01:26:35 vstinner set messages: +
2016-02-17 01:19:41 vstinner set nosy: + vstinner
2016-02-14 14:32:14 mark.dickinson set files: + debug-build-stacktrace.txtmessages: +
2016-02-14 12:43:15 mark.dickinson set messages: +
2016-02-14 12:22:01 serhiy.storchaka set messages: +
2016-02-14 12:17:08 mark.dickinson set messages: +
2016-02-14 12:16:13 mark.dickinson set messages: +
2016-02-14 11:57:10 serhiy.storchaka set nosy: + pitrou, serhiy.storchaka
2016-02-14 11:44:47 mark.dickinson set messages: +
2016-02-14 11:39:23 mark.dickinson set files: + threading_hang.pymessages: +
2016-02-14 11:38:44 mark.dickinson create