Issue 19386: selectors test_interrupted_retry is flaky (original) (raw)

Created on 2013-10-25 08:06 by christian.heimes, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Messages (13)
msg201225 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2013-10-25 08:06
The test seems to be unstable on some build bots: http://buildbot.python.org/all/builders/x86%20Gentoo%20Non-Debug%203.x/builds/5271/steps/test/logs/stdio ====================================================================== FAIL: test_interrupted_retry (test.test_selectors.EpollSelectorTestCase) ---------------------------------------------------------------------- Traceback (most recent call last): File "/var/lib/buildslave/3.x.murray-gentoo-wide/build/Lib/test/test_selectors.py", line 299, in test_interrupted_retry self.assertLess(time() - t, 2.5) AssertionError: 105.86884237639606 not less than 2.5 ---------------------------------------------------------------------- Ran 58 tests in 127.113s
msg201237 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2013-10-25 13:39
This failure doesn't make sense: whether SIGALRM is delivered or not, epoll() should time out after 2 seconds, not 105s. What's more, here's another test_subprocess failure: """ ====================================================================== ERROR: test_wait_timeout (test.test_subprocess.ProcessTestCase) ---------------------------------------------------------------------- Traceback (most recent call last): File "/var/lib/buildslave/3.x.murray-gentoo-wide/build/Lib/test/test_subprocess.py", line 993, in test_wait_timeout self.assertEqual(p.wait(timeout=3), 0) File "/var/lib/buildslave/3.x.murray-gentoo-wide/build/Lib/subprocess.py", line 1514, in wait raise TimeoutExpired(self.args, timeout) subprocess.TimeoutExpired: Command '['/var/lib/buildslave/3.x.murray-gentoo-wide/build/python', '-c', 'import time; time.sleep(0.3)']' timed out after 3 seconds """ Even more strange, a test_asyncio failure: """ ====================================================================== FAIL: test_interrupted_retry (test.test_selectors.EpollSelectorTestCase) ---------------------------------------------------------------------- Traceback (most recent call last): File "/var/lib/buildslave/3.x.murray-gentoo-wide/build/Lib/test/test_selectors.py", line 299, in test_interrupted_retry self.assertLess(time() - t, 2.5) AssertionError: 105.86884237639606 not less than 2.5 """ asyncio uses selectors as backend, so let's admit the test fails, but what's really surprising is that the timeout is *exactly* the same, up to the last digit. Something fishy is going on: David, did something change recently on that buildbot?
msg201238 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-10-25 14:05
Flaky VM clock perhaps?
msg201240 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2013-10-25 14:44
> Flaky VM clock perhaps? Yeah, thought about that, but here we see *exactly* the same value for two time intervals (using time.monotonic()), computed over different test suite runs. I'd expected something more erratic...
msg201241 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-10-25 14:49
The only change on the buildbot for some months now was increasing the size of the in-memory /tmp filesystem, which I did yesterday. That also involved a reboot, though, and it is conceivable that there were software changes made since the last reboot and before this one. I don't *think* there were, but I can't promise it. No kernel changes, though, for sure.
msg201242 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2013-10-25 15:22
Could you try to run the test in a loop under strace ("strace -ttTf") ?
msg201678 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2013-10-29 21:35
Here's another failure: """ ====================================================================== FAIL: test_interpreter_shutdown (test.test_concurrent_futures.ProcessPoolShutdownTest) ---------------------------------------------------------------------- Traceback (most recent call last): File "/var/lib/buildslave/3.x.murray-gentoo-wide/build/Lib/test/test_concurrent_futures.py", line 77, in tearDown self.assertLess(dt, 60, "synchronization issue: test lasted too long") AssertionError: 108.88635897636414 not less than 60 : synchronization issue: test lasted too long """ Many timeout-related tests are failing randomly on this buildbot, so I assume that's really a buildbot issue (is it a VM?).
msg201679 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-10-29 21:45
Yes, it is a linux vserver VM. I would think we would want tests to not fail in VMs, though, in this day and age.
msg201681 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-10-29 21:47
Also, unless people haven't been reporting other errors, these buildbots have been stable up until recently, with the exception of the multiprocessing hangs that plague all the buildbots as far as I know. (My impression could be mistaken, though.)
msg201682 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2013-10-29 21:55
> Also, unless people haven't been reporting other errors, these buildbots have been stable up until recently Well, the fact that several tests - which were working just fine - started failing randomly recently (like test_concurrent_futures, test_subprocess) is a strong hint that the culprit is the buildbot (I mean, the test_selectors failure just calls epoll_wait(): there is a known overflow for epoll_wait() timeout, but for large timeout values, not 2s). This wouldn't be the first time I see clock issues on a virtual machine...
msg201685 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-10-29 22:13
What could cause a VM clock problem on a machine that hasn't been modified, but the VMs have been rebooted recently (ie: it isn't stale state inside the VM itself). Anything you can suggest I look at?
msg201687 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-10-29 22:20
I'm going to reboot the host just to see if that makes any difference. It's been up for 105 days.
msg203652 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-11-21 16:21
Looking at the current buildbot history I don't see these errors any more. I'm not sure how frequent they were, but I think I'll mark this as resolved and we can reopen it if we see it again.
History
Date User Action Args
2022-04-11 14:57:52 admin set github: 63585
2013-11-21 16:21:32 r.david.murray set status: open -> closedresolution: not a bugmessages: + stage: needs patch -> resolved
2013-10-29 22:20:47 r.david.murray set messages: +
2013-10-29 22:13:17 r.david.murray set messages: +
2013-10-29 21:55:12 neologix set messages: +
2013-10-29 21:47:42 r.david.murray set messages: +
2013-10-29 21:45:19 r.david.murray set messages: +
2013-10-29 21:35:56 neologix set messages: +
2013-10-25 15:22:05 neologix set messages: +
2013-10-25 14:49:15 r.david.murray set messages: +
2013-10-25 14:44:48 neologix set messages: +
2013-10-25 14:05:25 pitrou set nosy: + pitroumessages: +
2013-10-25 13:39:51 neologix set nosy: + r.david.murraymessages: +
2013-10-25 08:06:31 christian.heimes create