Issue 11962: Buildbot reliability - Python tracker (original) (raw)

Created on 2011-04-30 06:43 by skrah, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
freebsd-amd64-log.txt skrah,2011-05-02 18:23
Messages (10)
msg134839 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2011-04-30 06:43
The FreeBSD-AMD64 bot exhibits sporadic hanging in unspecific places. FreeBSD is running under kvm in the background. When the hanging occurs, the virtual machine uses 100% CPU and I can't log in via ssh, so I have to kill the kvm process. The fact that the ssh login fails if a user process is misbehaving seems like a FreeBSD/kvm issue to me. However, this problem did not occur when I set up the bot a couple of weeks ago. I've started a series of older revision builds to see if anything recent causes this.
msg134890 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-04-30 23:15
> The FreeBSD-AMD64 bot exhibits sporadic hanging in unspecific places. You can try a shorter regrtest timeout, edit Lib/test/regrtest.py near: if hasattr(faulthandler, 'dump_tracebacks_later'): timeout = 60*60 (or use --timeout option of the regrtest.py program) If you have an access to a terminal (using ssh), you can also set a signal to dump the traceback: edit regrtest.py to add "import signal; faulthandler.register(signal.SIGUSR1, all_threads=True)" after "faulthandler.enable()". Then use "kill -USR1 pid" to dump the traceback. Or the problem is an unlimited loop while dumping the traceback because of a timeout :-D In this case, disable the timeout using --timeout=0 option of regrtest.py.
msg134901 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2011-05-01 06:03
Thanks Victor, I can try some of that. Could this also be a problem with the buildbot software or a networking problem? The Ubuntu PPC bot might have the same issue. Here the tests appear to be finished but the clean doesn't start: http://www.python.org/dev/buildbot/all/builders/PPC%20Ubuntu%203.1/builds/387/steps/test/logs/stdio http://www.python.org/dev/buildbot/all/builders/PPC%20Ubuntu%203.1/builds/387
msg134922 - (view) Author: Ned Deily (ned.deily) * (Python committer) Date: 2011-05-01 19:36
That might be another instance of this: http://thread.gmane.org/gmane.comp.python.devel/123698 You might want to bring this up on python-dev.
msg134997 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2011-05-02 18:23
Going through the logs, this indeed looks like a buildbot software issue to me. I attach the logs that correspond to this incident: http://www.python.org/dev/buildbot/all/builders/AMD64%20FreeBSD%208.2%203.2/builds/85 After ... 2011-04-30 01:10:56+0200 [Broker,client] closing stdin 2011-04-30 01:10:56+0200 [Broker,client] using PTY: False ... normally you should see: ... [-] command finished with signal None, exit code 0, elapsedTime: But there is nothing until I restarted the bot.
msg135084 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2011-05-03 22:15
Another instance: 2011-05-03 20🔞08+0200 [Broker,client] closing stdin 2011-05-03 20🔞08+0200 [Broker,client] using PTY: False 2011-05-03 20:20:38+0200 [-] sending app-level keepalive Again this is missing: ... [-] command finished with signal None, exit code 0, elapsedTime: Also, as we speak the Ubuntu PPC bot is hanging as well: http://www.python.org/dev/buildbot/all/builders/PPC%20Ubuntu%202.7/builds/386/steps/test/logs/stdio Antoine, do you have access to the server logs for the relevant times? My bot is on CEST.
msg135085 - (view) Author: Barry A. Warsaw (barry) * (Python committer) Date: 2011-05-03 22:40
My Ubuntu PPC server is having hardware problems. It will just intermittently shut off. I've reset the SMU and the PRAM, vacuumed out the guts, reseated the RAM, pulled any possibly problematic 3rd party boards, and it still crashes. I was watching the syslog and it didn't look like a thermal shutdown, though it acted like that. The only thing I can think of is a power supply problem, so I'm going to see if I can find an inexpensive replacement. In the meantime, this machine will be offline for a couple of weeks at least.
msg135174 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2011-05-05 07:10
The FreeBSD bot had these error messages in the log files: 1) kernel: swap_pager: indefinite wait buffer: device 2) Approaching the limit on PV entries, consider increasing either the vm.pmap.shpgperproc or the vm.pmap.p v_entry_max sysctl. I set up the bot from scratch with these changes: a) Use swap partition (2GB) instead of swap file (2 GB). b) Use these sysctls: kern.ipc.shm_use_phys=1 vm.pmap.shpgperproc=4096 vm.pmap.pv_entry_max=16777216 c) Use self-compiled Python2.7 instead of the system Python2.6. Let's see how that works out. Error 1) is bad, perhaps FreeBSD does not play well with the qcow2 file system under high load.
msg135175 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2011-05-05 07:36
On second thought, I don't want to debug possible qcow2 issues, so I made another change: d) Use raw format for the image.
msg135421 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2011-05-07 09:06
I think the FreeBSD bot changes are working out fine. The Ubuntu-PPC issues were unrelated, so I'm closing this.
History
Date User Action Args
2022-04-11 14:57:16 admin set github: 56171
2011-05-07 09:06:55 skrah set status: open -> closedmessages: + keywords: + buildbotresolution: fixedstage: resolved
2011-05-05 07:36:05 skrah set messages: +
2011-05-05 07:10:24 skrah set messages: +
2011-05-03 22:40:58 barry set messages: +
2011-05-03 22:17:09 skrah set nosy: + barry
2011-05-03 22:15:40 skrah set messages: + title: FreeBSD-AMD64 bot sporadic hanging -> Buildbot reliability
2011-05-02 18:23:41 skrah set files: + freebsd-amd64-log.txtmessages: +
2011-05-01 19:36:41 ned.deily set nosy: + ned.deilymessages: +
2011-05-01 06:03:43 skrah set messages: +
2011-04-30 23:15:44 vstinner set nosy: + vstinnermessages: +
2011-04-30 06:43:10 skrah create