Issue 1335972: Fix for int(string, base) wrong answers (take 2) (original) (raw)

Created on 2005-10-24 05:33 by alanmcintyre, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
python-mystrtoul3.tgz alanmcintyre,2005-10-24 05:33 Fix for int(string, base) errors and added tests
python-mystrtoul4.tgz alanmcintyre,2005-10-24 15:26 Corrected patch to fix problem with literals
python-mystrtoul5.tgz alanmcintyre,2005-12-21 04:20 Cleaned up digit lookup vector & added tests for base 10 and base 2**x
Messages (8)
msg48895 - (view) Author: Alan McIntyre (alanmcintyre) * (Python committer) Date: 2005-10-24 05:33
This incorporates patch #1334979, adds test cases for all the cases listed in bug #1334662, and adds test cases for evaluation of 2**32+1. There seem to be some minor speed improvements (simplistic stats shown below). Some simple performance test scripts have been included in the attached file as well. A lookup table was added for the maximum number of digits that can never overflow on a 32-bit ulong for each base. Overflow is only checked when this limit is exceeded by 1, and once the input string is determined to be too long (2 over the limit), the evaluation is halted and an overflow indication is returned. This appears to help reduce the evaluation time for very long strings (no time is wasted trying to evaluate all of it into a 32-bit ulong). Evaluation of each character has also been replaced by a lookup table. I'm not certain of the amount of speed benefit obtained from this; I added it early on and haven't had time to go back and test. It may be that it's not worth the extra static table. Baseline Python from CVS: alan@tarantula:~/python/dist/src# ./python -m timeit 'int("9")' 100000 loops, best of 3: 4 usec per loop alan@tarantula:~/python/dist/src# ./python -m timeit 'int("999999999")' 100000 loops, best of 3: 5.49 usec per loop alan@tarantula:~/python/dist/src# ./python -m timeit 'int("999999999999")' 100000 loops, best of 3: 11.8 usec per loop alan@tarantula:~/python/dist/src# ./python -m timeit 'int("999999999999999")' 100000 loops, best of 3: 13.4 usec per loop alan@tarantula:~/python/dist/src# ./python -m timeit 'int("1"*600)' 1000 loops, best of 3: 997 usec per loop Modified: alan@tarantula:~/python_testint/dist/src# ./python -m timeit 'int("9")' 100000 loops, best of 3: 3.63 usec per loop alan@tarantula:~/python_testint/dist/src# ./python -m timeit 'int("999999999")' 100000 loops, best of 3: 3.93 usec per loop alan@tarantula:~/python_testint/dist/src# ./python -m timeit 'int("999999999999")' 100000 loops, best of 3: 9.79 usec per loop alan@tarantula:~/python_testint/dist/src# ./python -m timeit 'int("999999999999999")' 100000 loops, best of 3: 11 usec per loop alan@tarantula:~/python_testint/dist/src# ./python -m timeit 'int("1"*600)' 1000 loops, best of 3: 905 usec per loop 10.2% faster for 1-digit int 39.7% faster for 9-digit int 20.5% faster for 12-digit int 21.8% faster for 15-digit int 10.2% faster for 600-digit int Test program that takes 750k ints from [0, 2**32) through stdin: Baseline: 8.114 sec (best of 5 consecutive runs) Modified: 6.774 sec (best of 5 consecutive runs) 19.8% faster NOTE: This patch causes new errors in test_array and test_compile, but it seems that these *should* be failing given the input string for long(), unless I'm missing something: ====================================================================== ERROR: test_repr (__main__.FloatTest) ---------------------------------------------------------------------- Traceback (most recent call last): File "Lib/test/test_array.py", line 187, in test_repr self.assertEqual(a, eval(repr(a), {"array": array.array})) ValueError: invalid literal for long(): 10000000000.0 ====================================================================== ERROR: test_repr (__main__.DoubleTest) ---------------------------------------------------------------------- Traceback (most recent call last): File "Lib/test/test_array.py", line 187, in test_repr self.assertEqual(a, eval(repr(a), {"array": array.array})) ValueError: invalid literal for long(): 10000000000.0 ---------------------------------------------------------------------- test test_compile crashed -- exceptions.ValueError: invalid literal for long(): 90000000000000.
msg48896 - (view) Author: Alan McIntyre (alanmcintyre) * (Python committer) Date: 2005-10-24 05:41
Logged In: YES user_id=1115903 I forgot to add that these results were obtained on a PIIIm 833MHz running Linux 2.4.2, GCC 3.2.2, with the Python 2.5a0 CVS source from about 8pm EST Oct 23, 2005.
msg48897 - (view) Author: funny_falcon (funny_falcon) Date: 2005-10-24 07:07
Logged In: YES user_id=1290388 Instead of: overflowed: /* spool through remaining characters */ while ((c = Py_CHARMASK(*str)) != '\0') str ++; Shoold be while ((c = Py_CHARMASK(*str)) != '\0') { c = digitlookup[c]; if (c < 0 | c >= base) /* non-"digit" character */ break; str++; } And why not static int digitlookup[] = { 37, 37, 37 ...... }; and if (c >= base) break;
msg48898 - (view) Author: Alan McIntyre (alanmcintyre) * (Python committer) Date: 2005-10-24 15:26
Logged In: YES user_id=1115903 Thanks, funny_falcon - that corrected the problem with the literals. I also included the change to the digit lookup table. The new patch is attached as python-mystrtoul4.tgz; it passes all tests now on my machine.
msg48899 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2005-11-01 00:55
Logged In: YES user_id=31435 This looks pretty good to me -- talk about every trick in the book . Note that the digitlimit vector is too cautious for bases that are powers of 2. For example, it's obvious that any string of 32 bits can't overflow an unsigned long, but the table cuts base 2 off at 31 instead. The formula should use log(2**32, base) instead: "N digits can't overflow" iff base**N-1 < 2**32 iff base**N < 2**32+1 base**N <= 2**32 iff N <= log(2**32, base) Assuming exact calculation of log(2**32, base) then (dubious, really), the floor of that is exactly the maximum safe N. The power-of-2 bases, and base 10, should be added to the tests. We really want to check that _all_ supported bases work, right?
msg48900 - (view) Author: Alan McIntyre (alanmcintyre) * (Python committer) Date: 2005-12-21 04:20
Logged In: YES user_id=1115903 I cleaned up the digitlimit vector and test cases now include all bases on [2, 36]. Uploaded as python-mystrtoul5.tgz.
msg48901 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2006-02-19 09:43
Logged In: YES user_id=1188172 Tim, if it looks good can it be applied?
msg48902 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2006-05-23 18:49
Logged In: YES user_id=31435 Thank you, Alan! A variant of the patch was checked in as revision 46133 for Python 2.5, affecting Lib/test/test_builtin.py Misc/ACKS Misc/NEWS Python/mystrtoul.c For future reference, note that C doesn't define whether an unqualified "char" is signed or unsigned, and your assumption that it was signed doesn't actually work everywhere ;-) Fixing that was the only real pain remaining here: thank you for the careful work and testing!
History
Date User Action Args
2022-04-11 14:56:13 admin set github: 42513
2005-10-24 05:33:10 alanmcintyre create