[Python-Dev] Green buildbot failure. (original) (raw)
David Bolen db3l.net at gmail.com
Mon Aug 12 00:49:45 CEST 2013
- Previous message: [Python-Dev] Green buildbot failure.
- Next message: [Python-Dev] Reaping threads and subprocesses
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Victor Stinner <victor.stinner at gmail.com> writes:
test.regrtest uses faulthandler.dumptracebacklater() to stop the test after a timeout if --timeout command line option is used.
The slave doesn't actually control the test parameters, which come from build/Tools/buildbot/test.bat (which runs build/PCBuild/rt.bat) plus anything sent from the master. But no, it doesn't look like that flow is currently using --timeout, so the main timeout in place is that from the buildbot slave processing (currently 3900s and based on output activity by the process under test).
Windows buildbots also have an additional "kill" path where the build scripts build and execute a separate kill_python_d executable (in PCBuild) to kill off any python_d process. It does have some sequencing issues (it runs during the build stage rather than clean) but no matter where it is used, being part of the build sequence risks it being skipped if the master/slave connection breaks mid-test.
For some additional background, see email threads:
http://mail.python.org/pipermail/python-dev/2010-November/105585.html http://mail.python.org/pipermail/python-dev/2010-December/106510.html http://mail.python.org/pipermail/python-dev/2011-January/107776.html
Anyway, the termination in this particular case is completely separate from buildbot processing. It's a small script combining pslist/pskill from sysinternals (as pskill proved always able to kill the processes) and just looking for old python_d processes that just runs constantly in the background.
My Windows buildbots have three additional layers of termination handling (beyond the standard buildbot timeout and kill_python in the test itself):
- Modification to buildbot slave code to prevent Windows process and file dialogs.
- Auto-it script in the background to acknowledge C RTL dialogs that the prior step doesn't block. (There have been past discussions about having Python itself disable RTL dialogs in test builds)
- The external watchdog script as a fail-safe.
The first two cases will definitely be recognized as test failures, since while the dialogs are suppressed/acknowledged, the triggering code will receive a failure result.
The purpose of the watchdog script was to handle cases encountered for which the normal termination processing (buildbot or python itself) simply didn't seem to work. The buildbot slave/master thought the test ended or aborted, so started new tests, but a process remained stuck in memory from the prior test. The frequency of occurrence varied over time, but during some periods was a major pain in the neck adversely affecting buildbot stability.
Not sure if faulthandler's approach to process termination would have more luck, or if it would even run if, for example, the process was stuck in the RTL or at the Win32 layer.
I'd certainly be willing to retire the watchdog scripts (as long as I don't just end up firefighting stuck processes again), but I suspect the first challenge would be to figure out how to simulate an appropriately stuck process that would have required the watchdog script previously, given that it was never really obvious why they were hung.
-- David
- Previous message: [Python-Dev] Green buildbot failure.
- Next message: [Python-Dev] Reaping threads and subprocesses
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]