Issue 16741: int(), float(), etc think python strings are null-terminated (original) (raw)

process

Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: serhiy.storchaka Nosy List: belopolsky, benjamin.peterson, chris.jerdonek, christian.heimes, ezio.melotti, gangesmaster, isoschiz, mark.dickinson, mrabarnett, python-dev, serhiy.storchaka, vstinner
Priority: normal Keywords: patch

Created on 2012-12-20 23:37 by gangesmaster, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
issue16741.patch mrabarnett,2012-12-30 02:52 review
issue16741#2.patch mrabarnett,2012-12-30 17:12
int_from_str.patch serhiy.storchaka,2013-05-05 19:40 review
int_from_str-3.3_2.patch serhiy.storchaka,2013-06-09 17:23 Patch for 3.3 review
int_from_str-2.7.patch serhiy.storchaka,2013-07-14 12:37 Patch for 2.7 review
Messages (18)
msg177863 - (view) Author: ganges master (gangesmaster) Date: 2012-12-20 23:37
I'm not sure if it's a bug or just an inconvenience, but when a string containing \x00 is passed to int/float/etc, they return a misleading exception: >>> int("abc") Traceback (most recent call last): File "", line 1, in ValueError: invalid literal for int() with base 10: 'abc' >>> int("\x00abc") Traceback (most recent call last): File "", line 1, in ValueError: invalid literal for int() with base 10: '' >>> float("\x00abc") Traceback (most recent call last): File "", line 1, in ValueError: could not convert string to float: I noticed the code does actually try to handle it: http://hg.python.org/cpython/file/39803c20c9bf/Objects/intobject.c#l1066 but still, the reported error is very misleading.
msg177915 - (view) Author: Matthew Barnett (mrabarnett) * (Python triager) Date: 2012-12-22 01:30
Python takes a long way round when converting strings to int. It does the following (I'll be talking about Python 3.3 here): 1. In function 'fix_decimal_and_space_to_ascii', the different kinds of spaces are converted to " " and the different kinds of digits are converted to their equivalents in the ASCII range; 2. The resulting string is converted to UTF-8; 3. The resulting string is passed to 'PyLong_FromString', which expects a null-terminated string. 4. If 'PyLong_FromString' is unable to parse the string as an int, it builds an error message using the string that was passed into it, which it does by converting that string _back_ into Unicode. As a result of step 4, the string that's reported as the value in the error message is _not_ necessarily correct. For example: >>> int("\N{ARABIC-INDIC DIGIT ONE}") 1 >>> int("#\N{ARABIC-INDIC DIGIT ONE}") Traceback (most recent call last): File "<pyshell#1>", line 1, in int("#\N{ARABIC-INDIC DIGIT ONE}") ValueError: invalid literal for int() with base 10: '#1' And it also means a "\x00" and anything after it will be omitted: >>> int("foo\x00bar") Traceback (most recent call last): File "<pyshell#2>", line 1, in int("foo\x00bar") ValueError: invalid literal for int() with base 10: 'foo' And in a final point, 'PyLong_FromString' limits the length of the value it reports in the error message, and the code that does it includes this line: slen = strlen(orig_str) < 200 ? strlen(orig_str) : 200;
msg178003 - (view) Author: Matthew Barnett (mrabarnett) * (Python triager) Date: 2012-12-23 18:12
It occurred to me that the truncation of the string when building the error message could cause a UnicodeDecodeError: >>> int("1".ljust(199) + "\u0100") Traceback (most recent call last): File "<pyshell#0>", line 1, in int("1".ljust(199) + "\u0100") UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc4 in position 199: unexpected end of data This is because it's truncating a UTF-8 string, and the truncation is in the middle of a multi-byte sequence.
msg178548 - (view) Author: Matthew Barnett (mrabarnett) * (Python triager) Date: 2012-12-30 02:52
I've attached a patch. It now reports an invalid literal as-is: >>> int("#\N{ARABIC-INDIC DIGIT ONE}") Traceback (most recent call last): File "<pyshell#1>", line 1, in int("#\N{ARABIC-INDIC DIGIT ONE}") ValueError: invalid literal for int() with base 10: '#١' >>> int("foo\x00bar") Traceback (most recent call last): File "<pyshell#2>", line 1, in int("foo\x00bar") ValueError: invalid literal for int() with base 10: 'foo\x00bar' There's a slight difference in that it truncates to 200 codepoints, not 200 UTF-8 bytes.
msg178587 - (view) Author: Matthew Barnett (mrabarnett) * (Python triager) Date: 2012-12-30 17:12
I've attached a small additional patch for truncating the UTF-8. I don't know whether it's strictly necessary, but I don't know that it's unnecessary either! (Better safe than sorry.)
msg188458 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-05-05 19:40
Here is a patch based on Matthew's patch. It is smaller (+35 lines vs +59) but fixes error messages for more cases: int(b'123\0') -- bytes string with null without base. int(b'123\xbd') -- non-utf-8 bytes string. int('123\ud800') -- lone surrogate in unicode string. Unfortunately it is not easy to backport it to 2.7. PyErr_Format() in 2.7 works only with null-terminated strings. I propose to fix this issue on 3.3+ and declare it as "won't fix" for 2.7.
msg188463 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2013-05-05 20:17
int_from_str.patch: + strobj = PySequence_GetSlice(u, 0, 200); + if (strobj != NULL) { + PyErr_Format(PyExc_ValueError, + "invalid literal for int() with base %d: %R", + base, strobj); + Py_DECREF(strobj); + } Oh, it remembers me that #7330 is still open.
msg188777 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-05-09 13:31
Thanks, for 3.4 I will use new formatting feature.
msg190212 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-05-28 13:50
Are there any other comments?
msg190866 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-06-09 17:23
Patch updated. It now reuses code for bytes->int in longobject.c and abstract.c, doesn't raise UnicodeDecodeError for non-utf-8 bytes, and always reports an invalid bytes literal as a bytes object.
msg193047 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-07-14 12:37
Here is a patch for 2.7.
msg193556 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-07-22 18:13
If there are no objections I'm going to commit patches soon.
msg193558 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2013-07-22 18:45
I don't like the idea to change the behavior of 2.7 so late in its release cycle. Benjamin, what's your opinion?
msg193565 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2013-07-22 20:42
Yeah, let's just fix Python 3.
msg194282 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2013-08-03 18:20
New changeset ecc8512b427d by Serhiy Storchaka in branch '3.3': Issue #16741: Fix an error reporting in int(). http://hg.python.org/cpython/rev/ecc8512b427d New changeset 4fd48a807812 by Serhiy Storchaka in branch 'default': Issue #16741: Fix an error reporting in int(). http://hg.python.org/cpython/rev/4fd48a807812
msg194302 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-08-03 21:16
There is a test in test_unicode which expects an UnicodeError for int('\ud800'). Now it fails. Should we fix a test or int()?
msg194304 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2013-08-03 21:45
I'd say fix the test. Raising ValueError is correct in this case. UnicodeError was an implementation artifact.
msg194310 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2013-08-03 22:06
New changeset 7b023134ad83 by Serhiy Storchaka in branch '3.3': Issue #16741: Remove testing of implementation artifact. http://hg.python.org/cpython/rev/7b023134ad83 New changeset 1b4772ab420f by Serhiy Storchaka in branch 'default': Issue #16741: Remove testing of implementation artifact. http://hg.python.org/cpython/rev/1b4772ab420f
History
Date User Action Args
2022-04-11 14:57:39 admin set github: 60945
2014-05-23 20:39:13 terry.reedy link issue21546 superseder
2013-08-06 14:48:58 serhiy.storchaka set status: open -> closedresolution: fixed
2013-08-03 22:06:07 python-dev set messages: +
2013-08-03 21:45:20 belopolsky set nosy: + belopolskymessages: +
2013-08-03 21:16:13 serhiy.storchaka set status: closed -> openresolution: fixed -> (no value)messages: +
2013-08-03 18:33:00 serhiy.storchaka set status: open -> closedresolution: fixedstage: patch review -> resolved
2013-08-03 18:20:47 python-dev set nosy: + python-devmessages: +
2013-07-22 20:42:37 benjamin.peterson set messages: +
2013-07-22 18:45:13 christian.heimes set nosy: + christian.heimes, benjamin.petersonmessages: +
2013-07-22 18:13:18 serhiy.storchaka set messages: +
2013-07-14 12:37:58 serhiy.storchaka set files: + int_from_str-2.7.patchmessages: +
2013-06-09 17:23:10 serhiy.storchaka set files: + int_from_str-3.3_2.patchmessages: +
2013-05-28 13:50:15 serhiy.storchaka set messages: +
2013-05-09 13:31:11 serhiy.storchaka set messages: +
2013-05-05 20:17:02 vstinner set messages: +
2013-05-05 19:41:35 serhiy.storchaka set assignee: serhiy.storchaka
2013-05-05 19:40:45 serhiy.storchaka set files: + int_from_str.patchversions: - Python 3.2nosy: + chris.jerdonekmessages: +
2013-04-19 21:36:16 isoschiz set nosy: + isoschiz
2012-12-30 17🔞22 serhiy.storchaka set nosy: + serhiy.storchakaversions: + Python 3.4
2012-12-30 17:12:39 mrabarnett set files: + issue16741#2.patchmessages: +
2012-12-30 05:32:02 ezio.melotti set nosy: + vstinner, ezio.melottistage: patch review
2012-12-30 02:52:19 mrabarnett set files: + issue16741.patchkeywords: + patchmessages: +
2012-12-23 18:12:33 mrabarnett set messages: +
2012-12-22 01:30:16 mrabarnett set nosy: + mrabarnettmessages: +
2012-12-22 01:17:17 terry.reedy set versions: - Python 2.6, Python 3.1
2012-12-21 02:03:46 benjamin.peterson set nosy: + mark.dickinson
2012-12-20 23:37:21 gangesmaster create