msg106506 - (view) |
Author: Matt Giuca (mgiuca) |
Date: 2010-05-26 04:11 |
In unicodeobject.c's unicodeescape_string, in UCS2 builds, if the last character of the string is the start of a UTF-16 surrogate pair (between '\ud800' and '\udfff'), there is a slight overrun problem. For example: >>> repr(u'abcd\ud800') Upon reading ch = 0xd800, the test (ch >= 0xD800 && ch < 0xDC00) succeeds, and it then reads ch2 = *s++. Note that preceding this line, s points at one character past the end of the string, so the value read will be garbage. I imagine that unless it falls on a segment boundary, the worst that could happen is the character '\ud800' is interpreted as some other wide character. Nevertheless, this is bad. Note that *technically* this is never bad, because _PyUnicode_New allocates an extra character and sets it to '\u0000', and thus the above example will always set ch2 to 0, and it will behave correctly. But this is a tenuous thing to rely on, especially given the comment above _PyUnicode_New: /* We allocate one more byte to make sure the string is Ux0000 terminated -- XXX is this needed ? */ I thought about removing that XXX, but I'd rather fix the problem. Therefore, I have attached a patch which does a range check before reading ch2: --- Objects/unicodeobject.c (revision 81539) +++ Objects/unicodeobject.c (working copy) @@ -3065,7 +3065,7 @@ } #else /* Map UTF-16 surrogate pairs to '\U00xxxxxx' */ - else if (ch >= 0xD800 && ch < 0xDC00) { + else if (ch >= 0xD800 && ch < 0xDC00 && size > 0) { Py_UNICODE ch2; Py_UCS4 ucs; Also affects Python 3. |
|
|
msg112291 - (view) |
Author: Georg Brandl (georg.brandl) *  |
Date: 2010-08-01 08:49 |
Applied in r83395. Thanks! |
|
|
msg112379 - (view) |
Author: Antoine Pitrou (pitrou) *  |
Date: 2010-08-01 20:24 |
Well, the patch was technically useless since, as mentioned, unicode strings are terminated by a NUL character by design. Anyway, I now get the following error on the 2.7 branch. Perhaps it's related: ====================================================================== FAIL: test_ucs4 (test.test_unicode.UnicodeTest) ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/antoine/cpython/27/Lib/test/test_unicode.py", line 941, in test_ucs4 self.assertEqual(x, y) AssertionError: '\\udbc0\\udc00' != '\\U00100000' |
|
|
msg112382 - (view) |
Author: Georg Brandl (georg.brandl) *  |
Date: 2010-08-01 20:51 |
You're right. Reverted in r83444 and merging back, and I'll also remove the "XXX is this needed" from 2.7. |
|
|
msg112434 - (view) |
Author: Marc-Andre Lemburg (lemburg) *  |
Date: 2010-08-02 09:07 |
Antoine Pitrou wrote: > > Antoine Pitrou <pitrou@free.fr> added the comment: > > Well, the patch was technically useless since, as mentioned, unicode strings are terminated by a NUL character by design. There are two things to keep in mind: * Unicode objects are NUL-terminated, but only very external APIs rely on this (e.g. code using the Windows Unicode API). Please don't make the code in unicodeobject.c itself rely on this subtle detail. * The codecs work on Py_UNICODE* buffers which are *never* guaranteed to be NUL-terminated, so the problem in question is real. > Anyway, I now get the following error on the 2.7 branch. Perhaps it's related: > > ====================================================================== > FAIL: test_ucs4 (test.test_unicode.UnicodeTest) > ---------------------------------------------------------------------- > Traceback (most recent call last): > File "/home/antoine/cpython/27/Lib/test/test_unicode.py", line 941, in test_ucs4 > self.assertEqual(x, y) > AssertionError: '\\udbc0\\udc00' != '\\U00100000' > > ---------- > nosy: +pitrou > status: closed -> open > > _______________________________________ > Python tracker <report@bugs.python.org> > <http://bugs.python.org/issue8821> > _______________________________________ > _______________________________________________ > Python-bugs-list mailing list > Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/mal%40egenix.com |
|
|
msg112441 - (view) |
Author: Antoine Pitrou (pitrou) *  |
Date: 2010-08-02 09:47 |
> * Unicode objects are NUL-terminated, but only very external APIs > rely on this (e.g. code using the Windows Unicode API). Please > don't make the code in unicodeobject.c itself rely on this > subtle detail. That's wishful thinking, don't you think? *I* am not making code in unicodeobject.c rely on this. It has been so for years, long before I was here. You should check who made that design decision in the first place instead of putting the blame on me. Besides, the fact that external APIs rely on it make it much more unchangeable than if it were an implementation detail. |
|
|
msg112447 - (view) |
Author: Marc-Andre Lemburg (lemburg) *  |
Date: 2010-08-02 12:03 |
Antoine Pitrou wrote: > > Antoine Pitrou <pitrou@free.fr> added the comment: > >> * Unicode objects are NUL-terminated, but only very external APIs >> rely on this (e.g. code using the Windows Unicode API). Please >> don't make the code in unicodeobject.c itself rely on this >> subtle detail. > > That's wishful thinking, don't you think? *I* am not making code in > unicodeobject.c rely on this. It has been so for years, long before I > was here. You should check who made that design decision in the first > place instead of putting the blame on me. I'm not blaming you for this. However, I don't want more code to rely on this behavior. The NUL-termination has never been documented and my decision to use NUL-termination on the PyUnicodeObject buffers was merely a safety measure. > Besides, the fact that external APIs rely on it make it much more > unchangeable than if it were an implementation detail. It's an undocumented implementation detail. We can certainly deprecate it's use using the standard approach we have for this. But all that is off-topic for this ticket, since codecs operate on Py_UNICODE* buffers together with a size parameter and relying on those buffers being NUL-terminated is bound to cause problems. |
|
|
msg112454 - (view) |
Author: Antoine Pitrou (pitrou) *  |
Date: 2010-08-02 12:33 |
> But all that is off-topic for this ticket, since codecs > operate on Py_UNICODE* buffers together with a size parameter > and relying on those buffers being NUL-terminated is bound to > cause problems. That's right. Then perhaps a fixed patch can be uploaded, if someone cares enough. |
|
|
msg112460 - (view) |
Author: Matt Giuca (mgiuca) |
Date: 2010-08-02 13:17 |
OK, I finally had time to review this issue again. Firstly, granted the original fix broke a test case, shouldn't we figure out why it broke and fix it, rather than just revert the change and continue relying on this tenuous assumption? Surely it's best to have as little code relying on it as possible. Secondly, please have a look at my patch again. It wasn't committed properly -- no offense to Georg, it's an honest mistake! My patch was: --- Objects/unicodeobject.c (revision 81539) +++ Objects/unicodeobject.c (working copy) @@ -3065,7 +3065,7 @@ } #else /* Map UTF-16 surrogate pairs to '\U00xxxxxx' */ - else if (ch >= 0xD800 && ch < 0xDC00) { + else if (ch >= 0xD800 && ch < 0xDC00 && size > 0) { Py_UNICODE ch2; Py_UCS4 ucs; The commit made in r83418 by Georg Brandl (and similarly r83395 in py3k): http://svn.python.org/view/python/branches/release27-maint/Objects/unicodeobject.c?r1=82980&r2=83418 --- Objects/unicodeobject.c (revision 83417) +++ Objects/unicodeobject.c (revision 83418) @@ -3067,7 +3067,7 @@ ch2 = *s++; size--; - if (ch2 >= 0xDC00 && ch2 <= 0xDFFF) { + if (ch2 >= 0xDC00 && ch2 <= 0xDFFF && size) { ucs = (((ch & 0x03FF) << 10) | (ch2 & 0x03FF)) + 0x00010000; *p++ = '\\'; *p++ = 'U'; @@ -3316,7 +3316,7 @@ ch2 = *s++; size--; - if (ch2 >= 0xDC00 && ch2 <= 0xDFFF) { + if (ch2 >= 0xDC00 && ch2 <= 0xDFFF && size) { ucs = (((ch & 0x03FF) << 10) |
(ch2 & 0x03FF)) + 0x00010000; *p++ = '\\'; *p++ = 'U'; I put the size check on the first character of the surrogate pair; in the committed version the size check was on the second character (after the "size" variable is decremented), causing it to break out of that branch too early in some cases. Moving the size check to the outer if block fixes the test breakage. PS. Good find on the second copy of that code in the PyUnicode_EncodeRawUnicodeEscape function. I've attached a new patch which fixes both functions instead of just the unicodeescape_string function. Passes all test cases on UCS2 build of the 2.7 branch. |
|
msg124890 - (view) |
Author: Alexander Belopolsky (belopolsky) *  |
Date: 2010-12-29 23:59 |
[MAL] > * Unicode objects are NUL-terminated, but only very external APIs > rely on this (e.g. code using the Windows Unicode API). Please > don't make the code in unicodeobject.c itself rely on this > subtle detail. I would like to note that several APIs have been introduced that require NUL-terminated unicode strings: Py_UNICODE_strlen(), Py_UNICODE_strcpy(), etc. I see them used quite extensively in unicodeobject.c and elsewhere in Python codebase. Perhaps this train has left the station already. |
|
|
msg124894 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2010-12-30 00:45 |
> Unicode objects are NUL-terminated, but only very external APIs > rely on this (e.g. code using the Windows Unicode API). All Py_UNICODE_str*() functions rely on the NUL character. They are useful when patching a function from bytes (char*) to unicode (PyUnicodeObject): the API is very close. It is possible to avoid them with new functions using the strings length. All functions using PyUNICODE* as wchar_t* to the Windows wide character API (*W functions) also rely on the NUL character. Python core uses a lot of these functions. Don't write a NUL character require to create a temporary new string ending with a NUL character. It is not efficient, especially on long strings. And there is the problem of all third party modules (written in C) relying on the NUL character. I think that we have good reasons to not remove the NUL character. So I think that we can continue to accept that unicode[length] character can be read. Eg. implement text.startswith("ab") as "p=PyUnicode_AS_UNICODE(text); if (p[0] == 'a' && p[1] == 'b')" without checking the length of text. Using the NUL character or the length as a terminator condition doesn't really matter. I just see one advantage for the NUL character: it is faster in some cases. |
|
|
msg124901 - (view) |
Author: Matt Giuca (mgiuca) |
Date: 2010-12-30 02:37 |
> I think that we have good reasons to not remove the NUL character. Please note: Nobody is suggesting that we remove the NUL character. I was merely suggesting that we don't rely on it where it is unnecessary. Returning to my original patch: If the code was using the NUL character as a terminator, then it wouldn't be a bug. What the repr code does is it uses the length, and does not explicitly search for a NUL character. However, there is a *bug* where it reads one too many characters in certain cases. As I said in the first place, it just happens to *not* be catastrophic due to the presence of the NUL character. But that does not mean this isn't a bug -- at the very least, the code is very confusing to read because it does not do what it is trying to do. Anyway the important issue is what Marc-Andre raised about buffers. Since we are in agreement that there is a potential problem here, and I have a patch which seems correct and doesn't break any test cases (note my above post responding to test case breakages), can it be applied? |
|
|
msg176810 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) *  |
Date: 2012-12-02 21:26 |
This patch is not needed for 3.3+. And for 2.7 and 3.2 it actually doesn't fix any bug in the current code. |
|
|
msg178912 - (view) |
Author: Ezio Melotti (ezio.melotti) *  |
Date: 2013-01-03 04:56 |
Should this be closed then? |
|
|
msg178929 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) *  |
Date: 2013-01-03 10:03 |
You can accept the patch. You can reject the patch. It doesn't matter. |
|
|