Issue 1251300: Decoding with unicode_internal segfaults on UCS-4 builds (original) (raw)

Created on 2005-08-03 19:49 by nhaldimann, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
unicode_internal.diff nhaldimann,2005-08-05 14:50 Patch
unicode_internal.diff nhaldimann,2005-08-05 21:08 Improved Patch
Messages (11)
msg25964 - (view) Author: Nik Haldimann (nhaldimann) Date: 2005-08-03 19:49
On UCS-4 builds, decoding a byte string with the unicode_internal codec doesn't correctly work for code points from 0x80000000 upwards and even segfaults. I have observed the same behaviour on 2.5 from CVS and 2.4.0 on OS X/PowerPC as well as on 2.3.5 on Linux/x86. Here's an example: Python 2.5a0 (#1, Aug 3 2005, 21:34:05) [GCC 3.3 20030304 (Apple Computer, Inc. build 1671)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> "\x7f\xff\xff\xff".decode("unicode_internal") u'\U7fffffff' >>> "\x80\x00\x00\x00".decode("unicode_internal") u'\x00' >>> "\x80\x00\x00\x01".decode("unicode_internal") u'\x01' >>> "\x81\x00\x00\x00".decode("unicode_internal") Segmentation fault On little endian architectures the byte strings must be reversed for the same effect. I'm not sure if I understand what's going on, but I guess there are 2 solution strategies: 1. Make unicode_internal work for any code point up to 0xFFFFFFFF. 2. Make unicode_internal raise a UnicodeDecodeError for anything above 0x10FFFF (== sys.maxunicode for UCS-4 builds). It seems like there are no unicode code points above 0x10FFFF, so the latter solution feels more correct to me, even though it might break backwards compatibility a tiny bit. The unicodeescape codec already does a similar thing: >>> u"\U00110000" UnicodeDecodeError: 'unicodeescape' codec can't decode bytes in position 0-9: illegal Unicode character
msg25965 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2005-08-04 14:41
Logged In: YES user_id=38388 I think solution 2 is the right approach, since UCS-4 only has 0x10FFFF possible code points. Could you provide a patch ?
msg25966 - (view) Author: Nik Haldimann (nhaldimann) Date: 2005-08-05 14:50
Logged In: YES user_id=1317086 OK, I put something together. Please review carefully as I'm not very familiar with the C API. I have tested this with the CVS HEAD on OS X and Linux.
msg25967 - (view) Author: Walter Dörwald (doerwalter) * (Python committer) Date: 2005-08-05 16:03
Logged In: YES user_id=89016 Your patch doesn't support PEP 293 error handlers. Could you add support for that?
msg25968 - (view) Author: Nik Haldimann (nhaldimann) Date: 2005-08-05 16:35
Logged In: YES user_id=1317086 Ah, that PEP clears some things up for me. I will look into it, but I hope you realize this requires tinkering with unicodeobject.c since the error handler code seems to live there.
msg25969 - (view) Author: Nik Haldimann (nhaldimann) Date: 2005-08-05 21:08
Logged In: YES user_id=1317086 Here's the patch with error handler support + test. Again: Please review carefully.
msg25970 - (view) Author: Walter Dörwald (doerwalter) * (Python committer) Date: 2005-08-18 20:17
Logged In: YES user_id=89016 The patch has a problem with input strings of a length that is not a multiple of 4, e.g. "\x00".decode("unicode-internal") returns u"" instead of raising an error. Also in a UCS-2 build most of the tests are irrelevant (as it's not possible to create codepoints above 0x10ffff even when using surrogates), so probably they should be ifdef'd out.
msg25971 - (view) Author: Nik Haldimann (nhaldimann) Date: 2005-08-19 14:17
Logged In: YES user_id=1317086 I agree about the ifdefs. I'm not sure about how to handle input strings of incorrect length. I guess raising an UnicodeDecodeError is in order. But I think it doesn't make sense to let it pass through the error handler, since the data the handler would see is potentially nonsensical (e.g., the code point value). Can you comment on this? Is it ok to raise a UnicodeDecodeError and skip the error handler here?
msg25972 - (view) Author: Walter Dörwald (doerwalter) * (Python committer) Date: 2005-08-19 15:39
Logged In: YES user_id=89016 The data the handler sees is nonsensical by definition. ;) To get an idea how to handle an incorrect length, take a look at Objects/unicodeobject.c::PyUnicode_DecodeUTF16Stateful()
msg25973 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2005-08-19 15:45
Logged In: YES user_id=38388 Assigning to Walter, the error handler expert :-)
msg25974 - (view) Author: Walter Dörwald (doerwalter) * (Python committer) Date: 2005-08-30 10:47
Logged In: YES user_id=89016 I've checked in a version that detects truncated data as: Include/unicodeobject.h 2.49 Lib/test/test_codeccallbacks.py 1.18 Lib/test/test_codecs.py 1.26 Misc/NEWS 1.1358 Modules/_codecsmodule.c 2.22 Objects/unicodeobject.c 2.231 and Include/unicodeobject.h 2.48.2.1 Lib/test/test_codeccallbacks.py 1.16.4.2 Lib/test/test_codecs.py 1.15.2.8 Misc/NEWS 1.1193.2.92 Modules/_codecsmodule.c 2.20.2.2 Objects/unicodeobject.c 2.230.2.1 Thanks for the patch!
History
Date User Action Args
2022-04-11 14:56:12 admin set github: 42248
2005-08-03 19:49:18 nhaldimann create