msg177863 - (view) |
Author: ganges master (gangesmaster) |
Date: 2012-12-20 23:37 |
I'm not sure if it's a bug or just an inconvenience, but when a string containing \x00 is passed to int/float/etc, they return a misleading exception: >>> int("abc") Traceback (most recent call last): File "", line 1, in ValueError: invalid literal for int() with base 10: 'abc' >>> int("\x00abc") Traceback (most recent call last): File "", line 1, in ValueError: invalid literal for int() with base 10: '' >>> float("\x00abc") Traceback (most recent call last): File "", line 1, in ValueError: could not convert string to float: I noticed the code does actually try to handle it: http://hg.python.org/cpython/file/39803c20c9bf/Objects/intobject.c#l1066 but still, the reported error is very misleading. |
|
|
msg177915 - (view) |
Author: Matthew Barnett (mrabarnett) *  |
Date: 2012-12-22 01:30 |
Python takes a long way round when converting strings to int. It does the following (I'll be talking about Python 3.3 here): 1. In function 'fix_decimal_and_space_to_ascii', the different kinds of spaces are converted to " " and the different kinds of digits are converted to their equivalents in the ASCII range; 2. The resulting string is converted to UTF-8; 3. The resulting string is passed to 'PyLong_FromString', which expects a null-terminated string. 4. If 'PyLong_FromString' is unable to parse the string as an int, it builds an error message using the string that was passed into it, which it does by converting that string _back_ into Unicode. As a result of step 4, the string that's reported as the value in the error message is _not_ necessarily correct. For example: >>> int("\N{ARABIC-INDIC DIGIT ONE}") 1 >>> int("#\N{ARABIC-INDIC DIGIT ONE}") Traceback (most recent call last): File "<pyshell#1>", line 1, in int("#\N{ARABIC-INDIC DIGIT ONE}") ValueError: invalid literal for int() with base 10: '#1' And it also means a "\x00" and anything after it will be omitted: >>> int("foo\x00bar") Traceback (most recent call last): File "<pyshell#2>", line 1, in int("foo\x00bar") ValueError: invalid literal for int() with base 10: 'foo' And in a final point, 'PyLong_FromString' limits the length of the value it reports in the error message, and the code that does it includes this line: slen = strlen(orig_str) < 200 ? strlen(orig_str) : 200; |
|
|
msg178003 - (view) |
Author: Matthew Barnett (mrabarnett) *  |
Date: 2012-12-23 18:12 |
It occurred to me that the truncation of the string when building the error message could cause a UnicodeDecodeError: >>> int("1".ljust(199) + "\u0100") Traceback (most recent call last): File "<pyshell#0>", line 1, in int("1".ljust(199) + "\u0100") UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc4 in position 199: unexpected end of data This is because it's truncating a UTF-8 string, and the truncation is in the middle of a multi-byte sequence. |
|
|
msg178548 - (view) |
Author: Matthew Barnett (mrabarnett) *  |
Date: 2012-12-30 02:52 |
I've attached a patch. It now reports an invalid literal as-is: >>> int("#\N{ARABIC-INDIC DIGIT ONE}") Traceback (most recent call last): File "<pyshell#1>", line 1, in int("#\N{ARABIC-INDIC DIGIT ONE}") ValueError: invalid literal for int() with base 10: '#١' >>> int("foo\x00bar") Traceback (most recent call last): File "<pyshell#2>", line 1, in int("foo\x00bar") ValueError: invalid literal for int() with base 10: 'foo\x00bar' There's a slight difference in that it truncates to 200 codepoints, not 200 UTF-8 bytes. |
|
|
msg178587 - (view) |
Author: Matthew Barnett (mrabarnett) *  |
Date: 2012-12-30 17:12 |
I've attached a small additional patch for truncating the UTF-8. I don't know whether it's strictly necessary, but I don't know that it's unnecessary either! (Better safe than sorry.) |
|
|
msg188458 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) *  |
Date: 2013-05-05 19:40 |
Here is a patch based on Matthew's patch. It is smaller (+35 lines vs +59) but fixes error messages for more cases: int(b'123\0') -- bytes string with null without base. int(b'123\xbd') -- non-utf-8 bytes string. int('123\ud800') -- lone surrogate in unicode string. Unfortunately it is not easy to backport it to 2.7. PyErr_Format() in 2.7 works only with null-terminated strings. I propose to fix this issue on 3.3+ and declare it as "won't fix" for 2.7. |
|
|
msg188463 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2013-05-05 20:17 |
int_from_str.patch: + strobj = PySequence_GetSlice(u, 0, 200); + if (strobj != NULL) { + PyErr_Format(PyExc_ValueError, + "invalid literal for int() with base %d: %R", + base, strobj); + Py_DECREF(strobj); + } Oh, it remembers me that #7330 is still open. |
|
|
msg188777 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) *  |
Date: 2013-05-09 13:31 |
Thanks, for 3.4 I will use new formatting feature. |
|
|
msg190212 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) *  |
Date: 2013-05-28 13:50 |
Are there any other comments? |
|
|
msg190866 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) *  |
Date: 2013-06-09 17:23 |
Patch updated. It now reuses code for bytes->int in longobject.c and abstract.c, doesn't raise UnicodeDecodeError for non-utf-8 bytes, and always reports an invalid bytes literal as a bytes object. |
|
|
msg193047 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) *  |
Date: 2013-07-14 12:37 |
Here is a patch for 2.7. |
|
|
msg193556 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) *  |
Date: 2013-07-22 18:13 |
If there are no objections I'm going to commit patches soon. |
|
|
msg193558 - (view) |
Author: Christian Heimes (christian.heimes) *  |
Date: 2013-07-22 18:45 |
I don't like the idea to change the behavior of 2.7 so late in its release cycle. Benjamin, what's your opinion? |
|
|
msg193565 - (view) |
Author: Benjamin Peterson (benjamin.peterson) *  |
Date: 2013-07-22 20:42 |
Yeah, let's just fix Python 3. |
|
|
msg194282 - (view) |
Author: Roundup Robot (python-dev)  |
Date: 2013-08-03 18:20 |
New changeset ecc8512b427d by Serhiy Storchaka in branch '3.3': Issue #16741: Fix an error reporting in int(). http://hg.python.org/cpython/rev/ecc8512b427d New changeset 4fd48a807812 by Serhiy Storchaka in branch 'default': Issue #16741: Fix an error reporting in int(). http://hg.python.org/cpython/rev/4fd48a807812 |
|
|
msg194302 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) *  |
Date: 2013-08-03 21:16 |
There is a test in test_unicode which expects an UnicodeError for int('\ud800'). Now it fails. Should we fix a test or int()? |
|
|
msg194304 - (view) |
Author: Alexander Belopolsky (belopolsky) *  |
Date: 2013-08-03 21:45 |
I'd say fix the test. Raising ValueError is correct in this case. UnicodeError was an implementation artifact. |
|
|
msg194310 - (view) |
Author: Roundup Robot (python-dev)  |
Date: 2013-08-03 22:06 |
New changeset 7b023134ad83 by Serhiy Storchaka in branch '3.3': Issue #16741: Remove testing of implementation artifact. http://hg.python.org/cpython/rev/7b023134ad83 New changeset 1b4772ab420f by Serhiy Storchaka in branch 'default': Issue #16741: Remove testing of implementation artifact. http://hg.python.org/cpython/rev/1b4772ab420f |
|
|