[Python-Dev] Most 3.x buildbots are green again, please don't break them and watch them! (original) (raw)

Victor Stinner victor.stinner at gmail.com
Wed Apr 13 07:40:44 EDT 2016

Previous message (by thread): [Python-Dev] ping on issue 18378: locale.getdefaultlocale() fails on recent Mac OS X
Next message (by thread): [Python-Dev] Most 3.x buildbots are green again, please don't break them and watch them!
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi,

Last months, most 3.x buildbots failed randomly. Some of them were always failing. I spent some time to fix almost all Windows and Linux buildbots. There were a lot of different issues.

So please try to not break buildbots again and remind to watch them sometimes:

http://buildbot.python.org/all/waterfall?category=3.x.stable&category=3.x.unstable

Next weeks, I will try to backport some fixes to Python 3.5 (if needed) to make these buildbots more stable too.

Python 2.7 buildbots are also in a sad state (ex: test_marshal segfaults on Windows, see issue #25264). But it's not easy to get a Windows with the right compiler to develop on Python 2.7 on Windows.

Maybe it's time to move more 3.x buildbots to the "stable" category? http://buildbot.python.org/all/waterfall?category=3.x.stable

By the way, I don't understand why "AMD64 OpenIndiana 3.x" is considered as stable since it's failing with multiple issues since many months and nobody is working on these failures. I suggest to move this buildbot back to the unstable category.

We have many offline buildbots. What's the status of these buildbots? Should we expect that they come back soon?

Or would it be possible to hide them? It would help to check the status of all buildbots.

Failing buildbots:

AMD64 FreeBSD CURRENT 3.x: http://bugs.python.org/issue26566 -- I installed a fresh FreeBSD CURRENT in a VM and I'm unable to reproduce failures. Maybe the buildbot slave is oudated and FreeBSD must be upgraded?
AMD64 OpenIndiana 3.x, x86 OpenIndiana 3.x: test_socket failures on sendfile. Sorry but I'm not really interested by this OS.
PPC64 AIX 3.x: failing tests: test_httplib, test_httpservers, test_socket, test_distutils, test_asyncio, (...); random timeout failure in test_eintr, etc. I don't have access to AIX and I'm not interested to acquire an AIX license, nor to install it. I'm not sure that it's useful to have an AIX buildbot and no core developer have access to AIX, and nobody is working on AIX failures. Maybe HP wants to help us to support AIX? (Provide manpower, access to AIX servers, or something like that.)
x86 OpenBSD 3.x: 5 tests failed, test_crypt test_socket test_ssl test_strptime test_time. This OS needs some love ;-)
the 4 ICC buildbots are failing with stack overflow, segfault, etc. Again, I'm not sure that these buildbots are useful since it looks like we don't support this compiler yet. Or does it help to work on supporting this compiler? Who is working on ICC support?

FYI I also made some enhancements on regrtest (our test runner for the test suite), mostly to debug failures:

display the duration of tests taking longer than 30 seconds
new timestamp prefix, used to debug buildbot hangs
when parallel tests are interrupted, display progress on waiting for completion
add timeout to main process when using -jN: it should help to debug buildbot hang
"Run tests in parallel using 3 child processes" or "Run tests sequentially" message which helps to understand how tests are running. There is the -j1 trap which has no effect: tests are still run sequentially. By the way, I proposed to really use subprocesses when -j1 is used: http://bugs.python.org/issue25285

The default timeout changed from 1 hour to 15 min, it's the maximum duration to run a single test file (ex: test_os.py). On my Linux box, running the whole test suite in parallel (10 child processes for my 4 CPU cores with hyperthreading) with Python compiled in debug mode (slow) takes 4 min 37 sec.

Tell me if the default timeout is too low. It can be configured per buildbot if needed (TESTTIMEOUT env var).

By the way, I'm always surprised by the huge difference of time needed to run a build on the different slaves: from a few minutes to more than 3 hours. The fatest Windows slave takes 28 minutes (run tests in parallel using 4 child processes), whereas the 3 others (run tests sequentially and) take between 2 hours and more than 3 hours! Why running tests on Windows takes so long?

Maybe we should make sure that no buildbot run tests sequentially, because it creates a lot of annoying side effects (even if sometimes it helps to find tricky bugs, sometimes bugs restricted to the tests themself) and because a lot of time simply wait a few seconds. So running mutliple tests in parallel don't burn your CPU, it's just faster. IMHO the risk of random timeout failures is low compared to the speedup.

The most interesting bug was a deadlock in locale.setlocale() on Windows 7: the bug made the buildbot to hang "sometimes" (randomly). Jeremy Kloth identified the bug, but Steve Dower noticed us that it's already fixed in Visual Studio 2015 Update 1: so please update VS if it's not the case yet. Steve added a post-build test to check if the ucrtbase/ucrtbased DLL has the known bug. => http://bugs.python.org/issue26624

Victor

Previous message (by thread): [Python-Dev] ping on issue 18378: locale.getdefaultlocale() fails on recent Mac OS X
Next message (by thread): [Python-Dev] Most 3.x buildbots are green again, please don't break them and watch them!
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list