msg153464 - (view) |
Author: Gregory P. Smith (gregory.p.smith) *  |
Date: 2012-02-16 07:57 |
Using a 32-bit Python 2.6.5 on a Linux system at work we observed the following: File "/.../lib/python2.6/tempfile.py", line 349, in mktemp name = names.next() File "/.../lib/python2.6/tempfile.py", line 134, in next letters = [choose(c) for dummy in "123456"] File "/.../lib/python2.6/random.py", line 261, in choice return seq[int(self.random() * len(seq))] # raises IndexError if seq is empty ValueError: cannot convert float NaN to integer This is rare and hard to reproduce. The hardware appears to be healthy and this was on a server with ECC. Some searching reveals that other people have hit this in random.choice in Python 2.7 as well: https://bugs.launchpad.net/ubuntu/+source/desktopcouch/+bug/886159 The ubuntu developer seems to think this is related to time.time() returning NaN at some point (I haven't looked into that myself). |
|
|
msg153473 - (view) |
Author: Mark Dickinson (mark.dickinson) *  |
Date: 2012-02-16 10:43 |
Hmm, this is a little odd. For 2.7 at least, the error message is coming from PyLong_FromDouble in Objects/longobject.c. I can't immediately see how PyLong_FromDouble could be called by the random seeding process. So it seems more likely that the error is really coming from the int() call in the traceback. But now that implies that the random call is returning NaN, which looks unpossible from the code (random_random in Modules/_randommodule.c). static PyObject * random_random(RandomObject *self) { unsigned long a=genrand_int32(self)>>5, b=genrand_int32(self)>>6; return PyFloat_FromDouble((a*67108864.0+b)*(1.0/9007199254740992.0)); } So despite your comments about healthy hardware, my bet's on corrupted memory. :-) |
|
|
msg153474 - (view) |
Author: Mark Dickinson (mark.dickinson) *  |
Date: 2012-02-16 10:57 |
The bugs.launchpad.net URL shows a call to 'entropy.choice'. Any idea what 'entropy' is? Could it be that they're using their own Random subclass, not tied to the Python MT implementation? |
|
|
msg153475 - (view) |
Author: Raymond Hettinger (rhettinger) *  |
Date: 2012-02-16 11:09 |
The hypothesis that time.time() is returning NaN doesn't match the provided traceback. If time.time() had returned NaN, the exception would have happened earlier, on line 113 in random.py: long(time.time() * 256) I'm wondering if the NaN arises in the C code for random(): random_random(RandomObject *self) { unsigned long a=genrand_int32(self)>>5, b=genrand_int32(self)>>6; return PyFloat_FromDouble((a*67108864.0+b)*(1.0/9007199254740992.0)); } Upstream from that, only integers are used, so this would be the earliest a NaN could arise when running the code in choice(): ``return seq[int(self.random() * len(seq))]`` |
|
|
msg153476 - (view) |
Author: Mark Dickinson (mark.dickinson) *  |
Date: 2012-02-16 11:29 |
> I'm wondering if the NaN arises in the C code for random(): I don't think that's possible. In the second line: return PyFloat_FromDouble((a*67108864.0+b)*(1.0/9007199254740992.0)); a and b are already C unsigned longs, so no matter what their value, the result of the expression is well in range for an IEEE 754 double, and on a normal machine there's just no realistic way that this calculation could produce a NaN. PyFloat_FromDouble does no manipulation of the C double, but just stores it directly in the PyFloat object. I think there are two different things going on here. (1) The Ubuntu error reporter seems to be using something other than the standard Random class, so all bets are off there without knowing more about what's being used. Chances seem good that whatever random number generator they're using really *is* producing a NaN. (2) That leaves Greg's report above, where the standard Random class is apparently what's being used. Here I'm baffled---I can't see any realistic mechanism that might produce that traceback. |
|
|
msg153488 - (view) |
Author: Gregory P. Smith (gregory.p.smith) *  |
Date: 2012-02-16 17:29 |
I think my claim the hardware appears healthy was premature. I misunderstood our initial error report internally on where the code ran and was looking at the wrong host. doh. my bad. Several more of these have been found in the last week and they all suspiciously ran on the same machine. One of them had a _different_ failure that is an even stronger suggestion of bad hardware: File "/.../lib/python2.6/random.py", line 57, in NV_MAGICCONST = 4 * _exp(-0.5)/_sqrt(2.0) ValueError: math domain error Sorry for the false alarm. |
|
|
msg153521 - (view) |
Author: Raymond Hettinger (rhettinger) *  |
Date: 2012-02-17 01:23 |
Well, at least it was an interesting bug report ;-) |
|
|