Unstable tests — Unofficial Python Development (Victor's notes) documentation (original) (raw)
The multiprocessing tests leaked a lot of resources. Victor Stinner and others fixed dozens of bugs in these tests.
See also: Enable tracemalloc to get ResourceWarning traceback.
How to write reliable tests¶
Don’t use sleep as synchronization¶
Don’t use a sleep as a synchronization primitive between two threads or two processes. It will later, soon or later.
- Threads: use threading.Event
- Processes: use a pipe (os.pipe()), write a byte when read, read to wait
Don’t limit the maximum duration¶
Don’t make a test fail if it takes longer than a specified number of seconds. Example:
t1 = time.monotonic() func() t2 = time.monotonic() self.assertLess(t2 - t1, 60.0) # cannot happen
Python has buildbot workers which are very slow where “cannot happen” does happen. In most cases, the maximum duration is not a bug in Python and so the test must not fail.
For example, test_time had a test to ensure that time.sleep(0.5) takes less than 0.7 seconds. The test started to fail on slow buildbots where it took 0.8 seconds: maximum extended to 1 second. The test has been modified later to no longer check the maximum duration.
Another example, a sleep of 100 ms took 2 seconds on “AMD64 OpenIndiana 3.x” buildbot: https://bugs.python.org/issue20336
Debug race conditions¶
Debug test relying on time.sleep() or asyncio.sleep()¶
For example, test_asyncio: test_run_coroutine_threadsafe_with_timeout() has a race condition issue is caused byawait asyncio.sleep(0.05)
used in a test.
To reproduce the race condition, just use the smallest possible sleep of 1 nanosecond:
diff --git a/Lib/test/test_asyncio/test_tasks.py b/Lib/test/test_asyncio/test_tasks.py index dde84b84b1..c94113712a 100644 --- a/Lib/test/test_asyncio/test_tasks.py +++ b/Lib/test/test_asyncio/test_tasks.py @@ -3160,7 +3160,7 @@ class RunCoroutineThreadsafeTests(test_utils.TestCase):
async def add(self, a, b, fail=False, cancel=False):
"""Wait 0.05 second and return a + b."""
await asyncio.sleep(0.05)
await asyncio.sleep(1e-9) if fail: raise RuntimeError("Fail!") if cancel:
And run the test in a loop until it fails:
./python -m test test_asyncio -m test_run_coroutine_threadsafe_with_timeout -v -F
Debug Dangling process¶
For example, debug test_multiprocessing_spawn which logs:
Warning -- Dangling processes: {<SpawnProcess(QueueManager-1576, stopped)>}
https://bugs.python.org/issue38447
Get cases:
./python -m test test_multiprocessing_spawn --list-cases > cases
Bisect:
./python -m test.bisect_cmd -i cases -o bisect1 -n 5 -N 500 test_multiprocessing_spawn -R 3:3 --fail-env-changed
Debug reap_children() warning¶
For example, test_concurrent_futures logs such warning:
0:27:13 load avg: 4.88 [416/419/1] test_concurrent_futures failed (env changed) (17 min 11 sec) -- running: test_capi (7 min 28 sec), test_gdb (8 min 49 sec), test_asyncio (23 min 23 sec) beginning 6 repetitions 123456 .Warning -- reap_children() reaped child process 26487 ..... Warning -- multiprocessing.process._dangling was modified by test_concurrent_futures Before: set() After: {<weakref at 0x7fdc08f44e30; to 'SpawnProcess' at 0x7fdc0a467c30>}
https://bugs.python.org/issue38448
Run the test in a loop until it fails?
./python -m test test_concurrent_futures --fail-env-changed -F
If it’s not enough, spawn more jobs in parallel, example with 10 processes:
./python -m test test_concurrent_futures --fail-env-changed -F -j10
If it’s not enough, use the previous commands, but also inject some workload. For example, run a different terminal:
./python -m test -u all -r -F -j4
Hack reap_children() to detect more issues, sleep 100 ms before calling waitpid(WNOHANG):
diff --git a/Lib/test/support/init.py b/Lib/test/support/init.py index 0f294c5b0f..d938ae6b16 100644 --- a/Lib/test/support/init.py +++ b/Lib/test/support/init.py @@ -2320,6 +2320,8 @@ def reap_children(): if not (hasattr(os, 'waitpid') and hasattr(os, 'WNOHANG')): return
- time.sleep(0.1)
# Reap all our dead child processes so we don't leave zombies around. # These hog resources and might be causing some of the buildbots to die. while True:
Untested function which might help, count the number of child processes of a process on Linux: Add support.get_child_processes().
Coredump in multiprocessing¶
FreeBSD buildbot workers were useful to detect crashes at Python exit, bugs related to dangling threads. It helps to add a random sleep at Python exit, inModules/main.c
.
Multiprocessing issues¶
Open¶
- 2018-07-20: multiprocessing.Pool and ThreadPool leak resources after being deleted
- 2017-07-19: Missing multiprocessing.queues.SimpleQueue.close() method (OPEN).
Fixed, rejected, out of date¶
- 2018-12-05, multiprocessing: test_multiprocessing_fork: test_del_pool() leaks dangling threads and processes on AMD64 FreeBSD CURRENT Shared 3.x
- 2018-07-18: test_multiprocessing_spawn: Dangling processes leaked on AMD64 FreeBSD 10.x Shared 3.x
- 2018-07-03: asyncio: BaseEventLoop.close() shutdowns the executor without waiting causing leak of dangling threads (FIXED in Python 3.9).
- 2018-05-28, test_multiprocessing: test_multiprocessing_fork: dangling threads warning(commit: call Pool.join)
- 2017-07-28: test_multiprocessing_spawn and test_multiprocessing_forkserver leak dangling processes(commit: remove Process.daemon=True, call Process.join)
- 2017-07-24, multiprocessing: multiprocessing.Pool should join “dead” processes(commit)
- 2017-07-09, multiprocessing: multiprocessing.Queue.join_thread() does nothing if created and use in the same process(commit)
- 2017-06-08, multiprocessing: Add close() to multiprocessing.Process
- 2017-05-03: Emit a ResourceWarning in concurrent.futures executor destructors (OUT OF DATE).
- 2017-04-26: Emit ResourceWarning in multiprocessing Queue destructor (REJECTED).
- 2016-04-15, multiprocessing: test_multiprocessing_spawn leaves processes running in background. Add more checks to _test_multiprocessing to detect dangling processes and threads.
- 2015-11-18, multiprocessing: test_multiprocessing_spawn ResourceWarning with -Werror(commit: use closefd=False)
- 2011-08-18: Warning – multiprocessing.process._dangling was modified by test_multiprocessing(commit: test_multiprocessing.py calls the terminate() method of all classes).
Python issues¶
Open issues¶
Search for test_asyncio
, multiprocessing
tests.
- 2019-06-11: test__xxsubinterpreters fails randomly
Fixed issues¶
- 2018-05-16, socketserver: socketserver: Add an opt-in option to get Python 3.6 behavior on server_close()
- 2017-08-18, support: Make support.threading_cleanup() stricter (big issue with many fixes)
- 2017-08-18, test_logging: test_logging: ResourceWarning: unclosed socket
- 2017-08-18, socketserver: socketserver.ThreadingMixIn leaks running threads after server_close()
- 2017-08-09, socketserver: socketserver.ForkingMixIn.server_close() leaks zombie processes
Rejected, Not a Bug, Out of Date¶
Windows handles¶
Abandonned attempt to hunt for leak of Windows handles:
- https://github.com/python/cpython/pull/7827 from https://bugs.python.org/issue18174
- https://github.com/python/cpython/pull/7966 from https://bugs.python.org/issue33966
Unlimited recursion¶
Some specific unit tests rely on the exact C stack size and how Python detects stack overflow. These tests are fragile because each platform uses a different stack size and behaves differently on stack overflow. For example, the stack size can depend if Python is compiled using PGO or not (depend on functions inlining).
The support.infinite_recursion()
context manager reduces the risk of stack overflow. Example of tests using it:
- test_ast
- test_exceptions
- test_isinstance
- test_json
- test_pickle
- test_traceback
- test_tomllib: issue gh-108851
_Py_CheckRecursiveCall()
is a portable but not reliable test: basic counter using sys.getrecursionlimit()
.
MSVC allows to implement PyOS_CheckStack()
(USE_STACKCHECK
macro is defined) using alloca()
and catching STATUS_STACK_OVERFLOW
error. If uses _resetstkoflw()
to reset the stack overflow flag.
See also Py_C_RECURSION_LIMIT
constant.
WASI explicitly sets the stack memory in configure.ac
:
dnl gh-117645: Set the memory size to 20 MiB, the stack size to 8 MiB, dnl and move the stack first. dnl https://github.com/WebAssembly/wasi-libc/issues/233 AS_VAR_APPEND([LDFLAGS_NODIST], [" -z stack-size=8388608 -Wl,--stack-first -Wl,--initial-memory=20971520"])
Tests¶
- test_pickle: test_bad_getattr()
- test_marshal: test_recursion_limit()
History¶
- 2019-04-29: macOS no longer specify stack size. Previously, it was set to 8 MiB (
-Wl,-stack_size,1000000
). - 2018-07-05: test_marshal: “Improve tests for the stack overflow in marshal.loads()”
- 2018-06-04: test_marshal: “Reduces maximum marshal recursion depth on release builds” on Windows
- 2014-11-01: MAX_MARSHAL_STACK_DEPTH sets to 1000 instead of 1500 on Windows
- 2013-07-07: Visual Studio project (PCbuild) now uses 4.2 MiB stack, instead of 2 MiB
- 2013-05-30: macOS sets the stack size to 8 MiB
- 2007-08-29: test_marshal: MAX_MARSHAL_STACK_DEPTH set to 1500 instead of 2000 on Windows for debug build
Notes¶
On FreeBSD, sudo sysctl -w 'kern.corefile =%N.%P.core'
command can be used to include the pid in coredump filenames, since 2 processes can crash at the same time.