Issue 24484: multiprocessing cleanup occasionally throws exception (original) (raw)

Issue24484

Created on 2015-06-21 15:08 by Jorge Herskovic, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
run_finalizers.patch	pitrou,2017-06-13 09:32	review
finalizestress.py	pitrou,2017-06-13 10:03

Pull Requests
URL	Status	Linked	Edit
PR 2159	merged	pitrou,2017-06-13 10:31
PR 2166	merged	pitrou,2017-06-13 15:15
PR 2167	merged	pitrou,2017-06-13 15:23
PR 2168	merged	pitrou,2017-06-13 15:32

Messages (22)
msg245597 - (view)	Author: Jorge Herskovic (Jorge Herskovic)	Date: 2015-06-21 15:08
I'm writing a library on top of multiprocessing. As part of the test suite, I create and destroy dozens of processes repeatedly. About once in 50 runs, the tests complete successfully but the program crashes with: Error in atexit._run_exitfuncs: Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/multiprocessing/util.py", line 286, in _exit_function _run_finalizers(0) File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/multiprocessing/util.py", line 246, in _run_finalizers items = [x for x in list(_finalizer_registry.items()) if f(x)] RuntimeError: dictionary changed size during iteration (this is on OS X 10.10.3, Python 3.4.3; I've been unable to replicate the behavior on Python 2.7.9, 3.5.0b2, or pypy despite hundreds of runs) The problematic code is available at https://github.com/jherskovic/mpetl and the tests should be run with nosetests
msg245629 - (view)	Author: Jorge Herskovic (Jorge Herskovic)	Date: 2015-06-22 13:47
This happens reliably between 1-5% of the time on my home Mac (a 4.0 GHz i7). My work Mac Pro, a lot slower, doesn't exhibit this behavior on the same 3.4.3 interpreter. Could it be related to a concurrency issue in the interpreter? We're attempting to replicate on other systems.
msg245687 - (view)	Author: Jorge Herskovic (Jorge Herskovic)	Date: 2015-06-23 13:01
More data: I've been unable to replicate this behavior on Ubuntu 15.04 on Python 3.4.3 over ~20 thousand test runs. (Same machine, running in a VM) An overnight repeated run on the original machine on OS X, 3.4.3, official distribution gave an actual frequency of 0.1% (7 crashes in 6980 runs). Approximately 3,300 runs on Python 2.7.9 on the same machine have not resulted in any crashes. I now built 3.4.3 from source using the latest clang and am trying it again, on the same machine, instead of using the official distribution. So far, 0 crashes in approximately 3,300 runs so far. At the moment, then, the problem seems localized to the official 3.4.3 binaries for OS X.
msg245704 - (view)	Author: Ned Deily (ned.deily) *	Date: 2015-06-24 02:14
Can you show the version info from the Python 3.4 where you see the failures? /usr/local/bin/python3.4 -c 'import sys;print(sys.version)'
msg245705 - (view)	Author: Ned Deily (ned.deily) *	Date: 2015-06-24 02:54
Exactly how are you building the Python 3.4 that does not fail for you? FWIW, I am able to reproduce the failure easily on a 10.10.3 system using python.org 64-bit/32-bit 3.4.3, using a current MacPorts source-built 3.4.3 (which is built with clang), and repeatedly using the current head of the 3.4 branch (post-3.4.3) built with clang, both with and without pydebug enabled.
msg245707 - (view)	Author: Jorge Herskovic (Jorge Herskovic)	Date: 2015-06-24 03:27
The failing Python: 3.4.3 (v3.4.3:9b73f1c3e601, Feb 23 2015, 02:52:03) [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] The non-failing Python: 3.4.3 (default, Jun 23 2015, 06:33:02) [GCC 4.2.1 Compatible Apple LLVM 6.1.0 (clang-602.0.53)] The non-failing python, I just downloaded the tarball and ran ./configure --prefix=/opt/python343, make, sudo make install, and then created a virtualenv using that binary. Given what you're telling me, I'll run some more loops with the "non-failing" python. I'm grateful you've been able to reproduce the problem, FWIW.
msg245713 - (view)	Author: Ned Deily (ned.deily) *	Date: 2015-06-24 03:52
FWIW, building 3.4.3 from source as you described results in failures about 30% of the time. With some of the earlier tests, it failed even more frequently. This is running on a 2.2GHz I7.
msg245731 - (view)	Author: Jorge Herskovic (Jorge Herskovic)	Date: 2015-06-24 11:45
20,000+ iterations of the test suite with my homebuilt 3.4.3 throw no exceptions. Weirder and weirder. Any suggestions? Without any actual knowledge ("unencumbered by the thought process"), a concurrency issue in the interpreter itself is my guess. Especially since I can't reproduce it at all on other versions. I'll see if I can come up with a shorter, self-contained test that can illustrate the issue too.
msg245738 - (view)	Author: Jorge Herskovic (Jorge Herskovic)	Date: 2015-06-24 13:17
Finally able to repro on my old Mac Pro at work (Universal 64/32bit 3.4.3 from python.org, OS X 10.10.3). For some reason I can't quite fathom, there were two exceptions this time. Error in atexit._run_exitfuncs: Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/multiprocessing/util.py", line 286, in _exit_function _run_finalizers(0) File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/multiprocessing/util.py", line 246, in _run_finalizers items = [x for x in list(_finalizer_registry.items()) if f(x)] RuntimeError: dictionary changed size during iteration Exception ignored in: <Finalize object, dead> Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/multiprocessing/util.py", line 185, in __call__ File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/multiprocessing/queues.py", line 207, in _finalize_close File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/threading.py", line 342, in notify TypeError: 'NoneType' object is not callable
msg245739 - (view)	Author: Jorge Herskovic (Jorge Herskovic)	Date: 2015-06-24 13:28
Ok, sorry, one more important thing I omitted in the previous comment. ^^^ The second exception above seems to be related to a new test I added, one that includes a daemonic process which itself relies on daemonic threads. The cleanup of that seems to be FUBARed, and it makes the original exception much more likely to pop up as well.
msg245810 - (view)	Author: Jorge Herskovic (Jorge Herskovic)	Date: 2015-06-25 15:06
Ok, I think I tracked it down to Connections getting cleaned up in a non-orderly fashion, sometimes while _run_finalizers is executing. The following patch to lib/python3.4/multiprocessing/connection.py cures the problem on my machine. @@ -124,10 +124,10 @@ self._handle = handle self._readable = readable self._writable = writable - + self._shutdown = util.Finalize(self, self._finalizer) # XXX should we use util.Finalize instead of a __del__? - - def __del__(self): + + def _finalizer(self): if self._handle is not None: self._close()
msg250137 - (view)	Author: Davin Potts (davin) *	Date: 2015-09-07 23:09
The 'crash' type is reserved for core dump type situations; 'behavior' is more appropriate for the misbehavior your describe.
msg294546 - (view)	Author: Marcin Słowik (Marandil)	Date: 2017-05-26 11:09
I can confirm, I have managed to accidentally reproduce this issue on a custom build of 3.6.0 on RHEL. sys.version: '3.6.0 (default, Jan 18 2017, 11:23:11) [GCC 4.4.7 20120313 (Red Hat 4.4.7-16)]' The issue popped out of nowhere, on a code that was working w/out any hiccups on multiple systems; after an insignificant change was made to an unrelated part of code. I managed to "patch it up" by forcingly closing and joining all Queues before letting the main process to stop. Unfortunately, I cannot share the repro code (it's a) too large, b) closed), but I can confirm the issue was still open in 3.6.0.
msg295876 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2017-06-13 09:25
The simple answer here is _run_finalizers() is not thread-safe, but finalizers can be registered or cancelled from any thread. I could not write a simple synthetic script to reproduce the issue, but it should nevertheless be fixable.
msg295880 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2017-06-13 09:32
Marcin, Jorge, Ned, could one of you apply the following simple patch and see if that seems to solve the issue?
msg295886 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2017-06-13 10:03
Ok, I got to write a reproducer, which fails reliably here.
msg295927 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2017-06-13 15:11
New changeset 1eb6c0074d17f4fd425cacfdda893d65f5f77f0a by Antoine Pitrou in branch 'master': bpo-24484: Avoid race condition in multiprocessing cleanup (#2159) https://github.com/python/cpython/commit/1eb6c0074d17f4fd425cacfdda893d65f5f77f0a
msg295929 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2017-06-13 15:32
Python 2.7 doesn't suffer from this bug, so no actual fix is needed. Still, I'll backport the additional test.
msg295931 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2017-06-13 15:51
New changeset 6fd03459957ee53941183212457bba19f977679f by Antoine Pitrou in branch '3.6': [3.6] bpo-24484: Avoid race condition in multiprocessing cleanup (GH-2159) (#2166) https://github.com/python/cpython/commit/6fd03459957ee53941183212457bba19f977679f
msg295933 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2017-06-13 15:52
New changeset d09f1674d70dd84c5be37b5fe11bca4913146c6b by Antoine Pitrou in branch '2.7': [2.7] bpo-24484: Avoid race condition in multiprocessing cleanup (GH-2159) (#2168) https://github.com/python/cpython/commit/d09f1674d70dd84c5be37b5fe11bca4913146c6b
msg295934 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2017-06-13 15:54
New changeset a0ecaab006849ade08fbcf8413fe1e92d9c8d71c by Antoine Pitrou in branch '3.5': [3.5] bpo-24484: Avoid race condition in multiprocessing cleanup (GH-2159) (#2167) https://github.com/python/cpython/commit/a0ecaab006849ade08fbcf8413fe1e92d9c8d71c
msg295935 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2017-06-13 15:54
This should be fixed now. Please reopen if not.

History
Date	User	Action	Args
2022-04-11 14:58:18	admin	set	github: 68672
2017-06-13 15:54:48	pitrou	set	status: open -> closedresolution: fixedmessages: + stage: commit review -> resolved
2017-06-13 15:54:12	pitrou	set	messages: +
2017-06-13 15:52:31	pitrou	set	messages: +
2017-06-13 15:51:30	pitrou	set	messages: +
2017-06-13 15:32:48	pitrou	set	stage: needs patch -> commit reviewmessages: + versions: - Python 2.7
2017-06-13 15:32:20	pitrou	set	pull_requests: + <pull%5Frequest2219>
2017-06-13 15:23:00	pitrou	set	pull_requests: + <pull%5Frequest2218>
2017-06-13 15:15:15	pitrou	set	pull_requests: + <pull%5Frequest2217>
2017-06-13 15:11:25	pitrou	set	messages: +
2017-06-13 10:31:20	pitrou	set	pull_requests: + <pull%5Frequest2210>
2017-06-13 10:03:26	pitrou	set	files: + finalizestress.pymessages: +
2017-06-13 09:32:05	pitrou	set	files: + run_finalizers.patchkeywords: + patchmessages: +
2017-06-13 09:25:11	pitrou	set	versions: + Python 2.7, Python 3.5, Python 3.6, Python 3.7, - Python 3.4nosy: + pitroumessages: + stage: needs patch
2017-05-26 11:09:31	Marandil	set	nosy: + Marandilmessages: +
2015-09-07 23:09:25	davin	set	type: crash -> behaviormessages: +
2015-06-25 15:06:21	Jorge Herskovic	set	messages: +
2015-06-24 13:28:58	Jorge Herskovic	set	messages: +
2015-06-24 13:17:03	Jorge Herskovic	set	messages: +
2015-06-24 11:45:46	Jorge Herskovic	set	messages: +
2015-06-24 03:52:42	ned.deily	set	messages: +
2015-06-24 03:27:28	Jorge Herskovic	set	messages: +
2015-06-24 02:54:33	ned.deily	set	messages: +
2015-06-24 02:14:05	ned.deily	set	nosy: + ned.deilymessages: +
2015-06-23 13:01:52	Jorge Herskovic	set	messages: +
2015-06-22 13:47:57	Jorge Herskovic	set	messages: +
2015-06-21 22:08:37	ned.deily	set	nosy: + sbt, davin
2015-06-21 15:08:26	Jorge Herskovic	create