| msg151556 - (view) |
Author: Christoph Glaubitz (glaubich) |
Date: 2012-01-18 18:00 |
| Starting several threads, each just starting a subprocess.Popen and use communicate cause a deadlock in subprocess. (see attached file) I can only reproduce this with python 2.7.2, not with any other versions. 2.6.5, 2.6.7 and 3.2.2 are working fine. Also 2.4.6 is working, with a problem fixed by http://bugs.python.org/issue1731717. The attached script does not dead lock all the time, but at least every second call. I am on a linux x86_64. The python's are compiled by hand, except of 2.6.5. |
|
|
| msg151621 - (view) |
Author: Charles-François Natali (neologix) *  |
Date: 2012-01-19 13:36 |
| Here's the backtrace: """ #0 0x0000003bfb20c9b1 in sem_wait () from /lib64/libpthread.so.0 #1 0x000000000051a7c3 in PyThread_acquire_lock (lock=0x17db0750, waitflag=1) at Python/thread_pthread.h:321 #2 0x000000000051a9b4 in find_key (key=1, value=0x0) at Python/thread.c:268 #3 0x000000000051abdc in PyThread_get_key_value (key=1) at Python/thread.c:360 #4 0x00000000005025b1 in PyGILState_GetThisThreadState () at Python/pystate.c:598 #5 0x00000000005024f5 in _PyGILState_Reinit () at Python/pystate.c:547 #6 0x0000000000521fc7 in PyOS_AfterFork () at ./Modules/signalmodule.c:979 #7 0x000000000052461d in posix_fork (self=0x0, noargs=0x0) at ./Modules/posixmodule.c:3695 """ It's stuck in _PyGILState_Reinit(), when calling PyGILState_GetThisThreadState(). That's because in 2.7, TLS is emulated (see Python/thread.c), and it uses a global mutex. If this mutex is locked at the time of fork(), then the next call to TLS primitives (even PyGILState_GetThisThreadState()) will deadlock. Now, this particular bug is fixed in 2.7 since #13156, which backed-out _PyGILState_Reinit() because it was only relevant for native TLS implementations. The code is still present in 3.2 and and default, but this problem doesn't affect native TLS implementations. Just to be extra safe, we PyThread_ReInitTLS() - which resets this global mutex on emulated implementations, and is just a no-op on pthread and windows - should be moved earlier in PyOS_AfterFork(), to avoid this type of deadlock (I mean, PyGILState_GetThisThreadState() deadlock after fork() is bad). Patch attached. |
|
|
| msg151623 - (view) |
Author: Antoine Pitrou (pitrou) *  |
Date: 2012-01-19 14:16 |
| I think you forgot to attach the patch :) |
|
|
| msg151630 - (view) |
Author: Charles-François Natali (neologix) *  |
Date: 2012-01-19 14:45 |
| Here's the patch. It's probably possible to add a test for this, however I don't have access to my development machine, so I can't write it now. |
|
|
| msg151683 - (view) |
Author: Martijn van Oosterhout (Martijn.van.Oosterhout) |
Date: 2012-01-20 10:51 |
| What a concidence, we had this problem as well and just today tracked it down, and turns out someone filed this bug yesterday. I note the referenced bug (#13156) says that it's been fixed in the release branch but not yet released. I don't suppose there will be a release of 2.7 in the near future? |
|
|
| msg152162 - (view) |
Author: Charles-François Natali (neologix) *  |
Date: 2012-01-28 14:40 |
| And here's the test. To sum up: - reinit_tls.diff makes sure to call PyThread_ReInitTLS() at the beginning of PyOS_AfterFork(), so that the TLS API is usable within PyOS_AfterFork() (e.g. in _after_fork()). It would be applied to 3.2 and default, even though they are not affected (at least on platform withs a native TLS implementation) - reinit_tls_test.diff would be applied to 2.7, 3.2 and default: it should not trigger a failure on any of the active branches (2.7, 3.2 and default), it's just to prevent potential future regressions. Notes regarding the test: - I replaced subprocess with bare fork(), since it's way cheaper than spawning an entire subprocess - I reduced the number of threads compared to Christoph's reproducer, because I feel uncomfortable forking up to 1024 processes in parallel (even though in practise they won't be that much, because child processes will terminate before the other threads call fork()) - even with 2.7.2, it won't deadlock 100%, since the race window is really short (but having a failure on one of the buildbots once would be enough to detect the problem, should it ever resurface) |
|
|
| msg152445 - (view) |
Author: Charles-François Natali (neologix) *  |
Date: 2012-02-01 22:12 |
| I thought about this a bit more, and I realized that a slight variation of this bug also affects 2.7, and also older versions (i.e. before _PyGILState_Reinit() was introduced), because any code that gets called from PyOS_AfterFork() and uses the TLS API suffers from the same problem. I had the new test deadlock once, in theading._after_fork(): when trying to acquire an (unlocked) lock, it got stuck in Py_END_ALLOW_THREADS, which calls indirectly PyGILState_GetThisThreadState(), which calls find_key(). Although less likely, this bug has been present for a *long* time, it's surprising it had never been noticed before. So the fix - calling PyThread_ReInitTLS() right at the beginning of PyOS_AfterFork() - should also be applied to 2.7. I'll commit it tomorrow. |
|
|
| msg152474 - (view) |
Author: Roundup Robot (python-dev)  |
Date: 2012-02-02 19:39 |
| New changeset c3649173d093 by Charles-François Natali in branch '2.7': Issue #13817: After fork(), reinit the ad-hoc TLS implementation earlier to fix http://hg.python.org/cpython/rev/c3649173d093 New changeset 7b24dd587a7b by Charles-François Natali in branch '3.2': Issue #13817: After fork(), reinit the ad-hoc TLS implementation earlier to fix http://hg.python.org/cpython/rev/7b24dd587a7b New changeset a0100852b6fe by Charles-François Natali in branch 'default': Issue #13817: After fork(), reinit the ad-hoc TLS implementation earlier to fix http://hg.python.org/cpython/rev/a0100852b6fe |
|
|
| msg152480 - (view) |
Author: Charles-François Natali (neologix) *  |
Date: 2012-02-02 21:56 |
| Committed. Christoph, thanks for the report. |
|
|