Issue 14107: Debian bigmem buildbot hanging in test_bigmem (original) (raw)

Issue14107

Created on 2012-02-24 15:17 by nadeem.vawda, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
mem_watchdog_nonblock.diff	neologix,2012-02-26 14:36	review

Messages (13)
msg154139 - (view)	Author: Nadeem Vawda (nadeem.vawda) *	Date: 2012-02-24 15:17
On the debian bigmem buildbot, test_bigmem hangs until it gets killed by a timeout: http://www.python.org/dev/buildbot/all/builders/AMD64%20debian%20bigmem%203.x/builds/134/steps/test/logs/stdio http://www.python.org/dev/buildbot/all/builders/AMD64%20debian%20bigmem%203.x/builds/131/steps/test/logs/stdio http://www.python.org/dev/buildbot/all/builders/AMD64%20debian%20bigmem%203.x/builds/129/steps/test/logs/stdio http://www.python.org/dev/buildbot/all/builders/AMD64%20debian%20bigmem%203.x/builds/128/steps/test/logs/stdio http://www.python.org/dev/buildbot/all/builders/AMD64%20debian%20bigmem%203.x/builds/123/steps/test/logs/stdio http://www.python.org/dev/buildbot/all/builders/AMD64%20debian%20bigmem%203.x/builds/85/steps/test/logs/stdio This has happened on every run I've checked that hasn't crashed before reaching test_bigmem for some other reason (e.g. the recently-fixed zlib bus error), as far back as the buildbot history goes. Issue 5438 might be related.
msg154166 - (view)	Author: Roundup Robot (python-dev)	Date: 2012-02-24 23:43
New changeset 4afcb25988c4 by Victor Stinner in branch 'default': Issue #14107: fix bigmem tests on str.capitalize(), str.swapcase() and http://hg.python.org/cpython/rev/4afcb25988c4
msg154168 - (view)	Author: STINNER Victor (vstinner) *	Date: 2012-02-24 23:53
I ran test_bigmem on my computer. The test pass but it is really slow. ---- $ ./python -m test -v -M 8G test_bigmem (... Ran 164 tests in 3228.227s OK (skipped=44) 1 test OK. ---- One hour, whereas 44 tests are skipped. Skipped tests are the tests using the most quantity of memory, so I expect that they are even slower. Do you need these tests using so much memory (and are really slow)? --- test_capitalize (test.test_bigmem.StrTest) ... skipped 'not enough memory: 24.0G minimum needed' test_encode_utf32 (test.test_bigmem.StrTest) ... skipped 'not enough memory: 9.0G minimum needed' test_format (test.test_bigmem.StrTest) ... skipped 'not enough memory: 12.0G minimum needed' test_repr_large (test.test_bigmem.StrTest) ... skipped 'not enough memory: 10.0G minimum needed' test_split_large (test.test_bigmem.StrTest) ... skipped 'not enough memory: 20.0G minimum needed' test_swapcase (test.test_bigmem.StrTest) ... skipped 'not enough memory: 24.0G minimum needed' test_title (test.test_bigmem.StrTest) ... skipped 'not enough memory: 24.0G minimum needed' test_translate (test.test_bigmem.StrTest) ... skipped 'not enough memory: 12.0G minimum needed' test_split_large (test.test_bigmem.BytesTest) ... skipped 'not enough memory: 20.0G minimum needed' test_compare (test.test_bigmem.TupleTest) ... skipped 'not enough memory: 32.0G minimum needed' test_concat_large (test.test_bigmem.TupleTest) ... skipped 'not enough memory: 48.0G minimum needed' test_concat_small (test.test_bigmem.TupleTest) ... skipped 'not enough memory: 24.0G minimum needed' test_contains (test.test_bigmem.TupleTest) ... skipped 'not enough memory: 16.0G minimum needed' test_from_2G_generator (test.test_bigmem.TupleTest) ... skipped 'not enough memory: 9.0G minimum needed' test_from_almost_2G_generator (test.test_bigmem.TupleTest) ... skipped 'not enough memory: 9.0G minimum needed' test_hash (test.test_bigmem.TupleTest) ... skipped 'not enough memory: 16.0G minimum needed' test_index_and_slice (test.test_bigmem.TupleTest) ... skipped 'not enough memory: 16.0G minimum needed' test_repeat_large (test.test_bigmem.TupleTest) ... skipped 'not enough memory: 48.0G minimum needed' test_repeat_large_2 (test.test_bigmem.TupleTest) ... skipped 'not enough memory: 12.0G minimum needed' test_repeat_small (test.test_bigmem.TupleTest) ... skipped 'not enough memory: 24.0G minimum needed' test_repr_large (test.test_bigmem.TupleTest) ... skipped 'not enough memory: 22.0G minimum needed' test_append (test.test_bigmem.ListTest) ... skipped 'not enough memory: 18.0G minimum needed' test_compare (test.test_bigmem.ListTest) ... skipped 'not enough memory: 32.0G minimum needed' test_concat_large (test.test_bigmem.ListTest) ... skipped 'not enough memory: 48.0G minimum needed' test_concat_small (test.test_bigmem.ListTest) ... skipped 'not enough memory: 24.0G minimum needed' test_contains (test.test_bigmem.ListTest) ... skipped 'not enough memory: 16.0G minimum needed' test_count (test.test_bigmem.ListTest) ... skipped 'not enough memory: 16.0G minimum needed' test_extend_large (test.test_bigmem.ListTest) ... skipped 'not enough memory: 32.0G minimum needed' test_extend_small (test.test_bigmem.ListTest) ... skipped 'not enough memory: 16.0G minimum needed' test_hash (test.test_bigmem.ListTest) ... skipped 'not enough memory: 16.0G minimum needed' test_index (test.test_bigmem.ListTest) ... skipped 'not enough memory: 16.0G minimum needed' test_index_and_slice (test.test_bigmem.ListTest) ... skipped 'not enough memory: 16.0G minimum needed' test_inplace_concat_large (test.test_bigmem.ListTest) ... skipped 'not enough memory: 48.0G minimum needed' test_inplace_concat_small (test.test_bigmem.ListTest) ... skipped 'not enough memory: 24.0G minimum needed' test_inplace_repeat_large (test.test_bigmem.ListTest) ... skipped 'not enough memory: 32.0G minimum needed' test_inplace_repeat_small (test.test_bigmem.ListTest) ... skipped 'not enough memory: 16.0G minimum needed' test_insert (test.test_bigmem.ListTest) ... skipped 'not enough memory: 18.0G minimum needed' test_pop (test.test_bigmem.ListTest) ... skipped 'not enough memory: 16.0G minimum needed' test_remove (test.test_bigmem.ListTest) ... skipped 'not enough memory: 16.0G minimum needed' test_repeat_large (test.test_bigmem.ListTest) ... skipped 'not enough memory: 48.0G minimum needed' test_repeat_small (test.test_bigmem.ListTest) ... skipped 'not enough memory: 24.0G minimum needed' test_repr_large (test.test_bigmem.ListTest) ... skipped 'not enough memory: 22.0G minimum needed' test_reverse (test.test_bigmem.ListTest) ... skipped 'not enough memory: 16.0G minimum needed' test_sort (test.test_bigmem.ListTest) ... skipped 'not enough memory: 16.0G minimum needed' ---
msg154169 - (view)	Author: Nadeem Vawda (nadeem.vawda) *	Date: 2012-02-25 00:03
> I ran test_bigmem on my computer. The test pass but it is really slow. Is this with or without the fix you just pushed? How much RAM does your system have? I tried running with "-M 4.5G" on my 8GB machine earlier today, and it ate through all of my RAM plus 4GB of swap before the OOM killer terminated it. > Do you need these tests using so much memory (and are really slow)? In general, I think it's important to exercise cases where 32-bit truncation issues and the like might arise, because otherwise they'll lie around undetected until they bite some unsuspecting user at an inopportune moment. But I don't know much about those specific tests, so I can't really offer much of an opinion here. [Still, 48GB minimum? ೦_೦ ]
msg154170 - (view)	Author: STINNER Victor (vstinner) *	Date: 2012-02-25 00:06
>> I ran test_bigmem on my computer. The test pass but it is really slow. > > Is this with or without the fix you just pushed? How much RAM does your > system have? I tried running with "-M 4.5G" on my 8GB machine earlier > today, and it ate through all of my RAM plus 4GB of swap before the OOM > killer terminated it. My PC has 12 GB of RAM and no swap. I ran the test after my commit. I tried to enable swap but writing to the swap does completly crash my kernel. My / uses btrfs and btrfs doesn't look to support swap files, even with a loopback device.
msg154172 - (view)	Author: Nadeem Vawda (nadeem.vawda) *	Date: 2012-02-25 00:12
> My PC has 12 GB of RAM and no swap. I ran the test after my commit. That explains. I ran it earlier (obviously), so it included a bunch of those more-demanding tests, which must be where the OOM killer hit it.
msg154310 - (view)	Author: Nadeem Vawda (nadeem.vawda) *	Date: 2012-02-26 07:32
The buildbot still isn't happy: http://www.python.org/dev/buildbot/all/builders/AMD64%20debian%20bigmem%203.x/builds/144/steps/test/logs/stdio Also, I ran the test on my own machine and noticed two tests that are still exceeding their nominal peak memory usage: - test_encode_utf7: expected 4.8G, actual 7.2G - test_unicode_repr: expected 4.8G, actual 5.6G
msg154330 - (view)	Author: Charles-François Natali (neologix) *	Date: 2012-02-26 11:55
""" Thread 0x00002ba588709700: File "/var/tmpfs/martin.vonloewis/3.x.loewis-parallel2/build/Lib/test/support.py", line 1168 in consumer File "/var/tmpfs/martin.vonloewis/3.x.loewis-parallel2/build/Lib/threading.py", line 682 in run File "/var/tmpfs/martin.vonloewis/3.x.loewis-parallel2/build/Lib/threading.py", line 729 in _bootstrap_inner File "/var/tmpfs/martin.vonloewis/3.x.loewis-parallel2/build/Lib/threading.py", line 702 in _bootstrap Current thread 0x00002ba57b2d4260: File "/var/tmpfs/martin.vonloewis/3.x.loewis-parallel2/build/Lib/test/support.py", line 1198 in stop File "/var/tmpfs/martin.vonloewis/3.x.loewis-parallel2/build/Lib/test/support.py", line 1240 in wrapper File "/var/tmpfs/martin.vonloewis/3.x.loewis-parallel2/build/Lib/unittest/case.py", line 385 in _executeTestPart File "/var/tmpfs/martin.vonloewis/3.x.loewis-parallel2/build/Lib/unittest/case.py", line 440 in run File "/var/tmpfs/martin.vonloewis/3.x.loewis-parallel2/build/Lib/unittest/case.py", line 492 in __call__ File "/var/tmpfs/martin.vonloewis/3.x.loewis-parallel2/build/Lib/unittest/suite.py", line 105 in run File "/var/tmpfs/martin.vonloewis/3.x.loewis-parallel2/build/Lib/unittest/suite.py", line 67 in __call__ File "/var/tmpfs/martin.vonloewis/3.x.loewis-parallel2/build/Lib/unittest/suite.py", line 105 in run File "/var/tmpfs/martin.vonloewis/3.x.loewis-parallel2/build/Lib/unittest/suite.py", line 67 in __call__ File "/var/tmpfs/martin.vonloewis/3.x.loewis-parallel2/build/Lib/unittest/runner.py", line 168 in run File "/var/tmpfs/martin.vonloewis/3.x.loewis-parallel2/build/Lib/test/support.py", line 1369 in _run_suite File "/var/tmpfs/martin.vonloewis/3.x.loewis-parallel2/build/Lib/test/support.py", line 1403 in run_unittest File "/var/tmpfs/martin.vonloewis/3.x.loewis-parallel2/build/Lib/test/test_bigmem.py", line 1252 in test_main File "/var/tmpfs/martin.vonloewis/3.x.loewis-parallel2/build/Lib/test/regrtest.py", line 1221 in runtest_inner File "/var/tmpfs/martin.vonloewis/3.x.loewis-parallel2/build/Lib/test/regrtest.py", line 907 in runtest File "/var/tmpfs/martin.vonloewis/3.x.loewis-parallel2/build/Lib/test/regrtest.py", line 710 in main File "/var/tmpfs/martin.vonloewis/3.x.loewis-parallel2/build/Lib/test/__main__.py", line 13 in File "/var/tmpfs/martin.vonloewis/3.x.loewis-parallel2/build/Lib/runpy.py", line 73 in _run_code File "/var/tmpfs/martin.vonloewis/3.x.loewis-parallel2/build/Lib/runpy.py", line 160 in _run_module_as_main """ There's a problem with the _file_watchdog thread: if the pipe gets full (because the consumer thread doesn't get to run often enough), the watchdog thread will block on the write() to the pipe. Then, the main thread tries to stop the watchdog: """ static void cancel_file_watchdog(void) { /* Notify cancellation / PyThread_release_lock(watchdog.cancel_event); / Wait for thread to join / PyThread_acquire_lock(watchdog.running, 1); PyThread_release_lock(watchdog.running); / The main thread should always hold the cancel_event lock / PyThread_acquire_lock(watchdog.cancel_event, 1); } """ The `cancel_event` lock is released, but the watchdog thread is stuck on the write(). The only thing that could wake it up is a read() from the consumer thread, but the main thread - the one calling cancel_file_watchdog() - blocks when acquiring the `running` lock: since the GIL is not released, the consumer thread can't run, so it doesn't drain the pipe, and game over... """ / We can't do anything if the consumer is too slow, just bail out / if (write(watchdog.wfd, (void ) &x, sizeof(x)) < sizeof(x)) break; if (write(watchdog.wfd, data, data_len) < data_len) break; """ AFAICT, this can't happen, because the write end of the pipe is not in non-blocking mode (which would solve this issue). Otherwise, there are two things I don't understand: 1. IIUC, the goal of the watchdog thread is to collect memory consumption in a timely manner: that's now the case, but since the information is printed in a standard thread, it doesn't bring any improvement (because it can be delayed for arbitrarily long), or am I missing something? 2. instead of using a thread and the faulthandler infrastructure to run GIL-less, why not simply use a subprocess? It could then simply parse /proc//statm at a regular interval, and print stats to stdout. It would also solve point 1.
msg154335 - (view)	Author: Charles-François Natali (neologix) *	Date: 2012-02-26 12:19
> 1. IIUC, the goal of the watchdog thread is to collect memory > consumption in a timely manner: that's now the case, but since the > information is printed in a standard thread, it doesn't bring any improvement (because it can be delayed for arbitrarily long), or am I > missing something? Scratch that. It does bring an improvement for a post facto analysis (I thought at first it was used in a pseudo-realtime way).
msg154345 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2012-02-26 13:31
> """ > /* We can't do anything if the consumer is too slow, just bail out / > if (write(watchdog.wfd, (void ) &x, sizeof(x)) < sizeof(x)) > break; > if (write(watchdog.wfd, data, data_len) < data_len) > break; > """ > > AFAICT, this can't happen, because the write end of the pipe is not in > non-blocking mode (which would solve this issue). I think my original plan was to put it in non-blocking mode, but I must have forgotten in the end. > 2. instead of using a thread and the faulthandler infrastructure to run > GIL-less, why not simply use a subprocess? It could then simply > parse /proc//statm at a regular interval, and print stats to > stdout. It would also solve point 1. I can't think of any drawback off the top of my head, so that sounds reasonable.
msg154353 - (view)	Author: Charles-François Natali (neologix) *	Date: 2012-02-26 14:36
> I think my original plan was to put it in non-blocking mode, but I > must have forgotten in the end. Here's a patch. > I can't think of any drawback off the top of my head, so that sounds > reasonable. I'll try to write up a patch (I have only 1.2GB RAM, so it won't be easy to test :-).
msg154369 - (view)	Author: Roundup Robot (python-dev)	Date: 2012-02-26 16:28
New changeset 52dc4fcd0d6f by Charles-François Natali in branch 'default': Issue #14107: test: Fix a deadlock involving the memory watchdog thread. http://hg.python.org/cpython/rev/52dc4fcd0d6f
msg154439 - (view)	Author: Charles-François Natali (neologix) *	Date: 2012-02-27 08:02
test_bigmem now passes on the buildbot: http://python.org/dev/buildbot/all/builders/AMD64%20debian%20bigmem%203.x/builds/149/steps/test/logs/stdio

History
Date	User	Action	Args
2022-04-11 14:57:27	admin	set	github: 58315
2012-02-27 18:01:25	neologix	set	status: open -> closedresolution: fixedstage: needs patch -> resolved
2012-02-27 08:02:24	neologix	set	messages: +
2012-02-26 16:28:37	python-dev	set	messages: +
2012-02-26 14:36:49	neologix	set	files: + mem_watchdog_nonblock.diffkeywords: + patchmessages: +
2012-02-26 13:31:30	pitrou	set	messages: +
2012-02-26 12:19:33	neologix	set	messages: +
2012-02-26 11:55:50	neologix	set	nosy: + neologixmessages: +
2012-02-26 07:32:35	nadeem.vawda	set	messages: +
2012-02-25 00:12:21	nadeem.vawda	set	messages: +
2012-02-25 00:06:51	vstinner	set	messages: +
2012-02-25 00:03:54	nadeem.vawda	set	messages: +
2012-02-24 23:53:10	vstinner	set	nosy: + vstinnermessages: +
2012-02-24 23:43:42	python-dev	set	nosy: + python-devmessages: +
2012-02-24 15:17:25	nadeem.vawda	create