msg49780 - (view) |
Author: Baris Metin (tbmetin) |
Date: 2006-03-20 12:44 |
Array module fails handling utf-8 strings giving a SIGSEGV. Attached patch seems to do the trick... gdb> run (no debugging symbols found) (no debugging symbols found) [Thread debugging using libthread_db enabled] [New Thread -1480337216 (LWP 31303)] Python 2.4.2 (#1, Mar 20 2006, 12:08:06) [GCC 3.4.5] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import array >>> x = array.array("u") >>> x.append(u"barış") Traceback (most recent call last): File "", line 1, in ? TypeError: array item must be unicode character >>> x.append("barış") >>> x Program received signal SIGSEGV, Segmentation fault. [Switching to Thread -1480337216 (LWP 31303)] Error while running hook_stop: Invalid type combination in ordering comparison. 0xa7ee0799 in PyUnicodeUCS4_FromUnicode () from /usr/lib/libpython2.4.so.1.0 |
|
|
msg49781 - (view) |
Author: Neal Norwitz (nnorwitz) *  |
Date: 2006-03-24 06:11 |
Logged In: YES user_id=33168 With the stock 2.4.2 on my linux box I was able to reproduce this. I couldn't reproduce with 2.4.3c1. Can you verify this is fixed in 2.4.3? Sagol. |
|
|
msg49782 - (view) |
Author: Baris Metin (tbmetin) |
Date: 2006-03-24 09:19 |
Logged In: YES user_id=1045504 I'm able to reproduce the bug with 2.5a0 SVN (r43289). Please try with --enable-unicode=ucs4 The patch is against svn too.. |
|
|
msg49783 - (view) |
Author: Alyssa Coghlan (ncoghlan) *  |
Date: 2006-03-24 14:43 |
Logged In: YES user_id=1038590 To get the effect of the patch, it should be sufficient to just change the format character to an uppercase U. That doesn't seem like the right fix though - the actual explosion isn't happening until later when the array elements are being converted to Unicode for output. |
|
|
msg49784 - (view) |
Author: Neal Norwitz (nnorwitz) *  |
Date: 2006-03-25 07:29 |
Logged In: YES user_id=33168 Verified for 2.4 and head. The probably could exist w/ucs2 also if you use 'bar' (I think). I agree with Nick, this patch doesn't really solve the problem. The attached patch fixes the crash more generally, but I'm think there is a better solution. I hope Martin has time to review this and suggest a better fix. Martin, the change in getargs ensures that we don't try to convert an 8-bit string of length 5 to unicode. The change in unicodeobject ensures that we don't reference the array with a negative offset as happens if the buffer conversion succeeds with an invalid unicode character. |
|
|
msg49785 - (view) |
Author: Martin v. Löwis (loewis) *  |
Date: 2006-03-26 21:36 |
Logged In: YES user_id=21627 The second part of the patch (checking that *u is not negative) is definitely right. The first part (requiring an even number of bytes for a u# argument) probably requires discussion on python-dev (or this patch should be assigned to MAL): I don't think it should be allowed to pass a non-Unicode object to u# in the first place. In particular, if you pass a byte string, there would be an implicit assumption that the byte encoding is the same internal representation as a Py_UNICODE. This is bad - Python normally assumes the encoding of a string is the system encoding, which normally is ASCII. Of course, changing the call to a type error for 2.4.3 probably won't work, either, because it might break existing code. Anyway, I believe the latter fix alone should fix the crash: the current getargs implementation will round down to the next multiple of sizeof(Py_UNICODE), thanks the integer division. u_setitem will then refuse the call if the length is not 1. IOW, it is possible to append between 4 and 7 bytes to a Unicode array. I wonder why the patch fixes the problem: *u should be an unsigned, and comparing an unsigned with a signed should convert the signed to unsigned, no? |
|
|
msg49786 - (view) |
Author: Neal Norwitz (nnorwitz) *  |
Date: 2006-03-29 07:52 |
Logged In: YES user_id=33168 Attached is an updated patch to only do the (unsigned) cast in unicodeobject.c. The test included in the patch still crashes the interpreter, this time in unicodectype.c. |
|
|
msg49787 - (view) |
Author: Martin v. Löwis (loewis) *  |
Date: 2006-03-29 18:13 |
Logged In: YES user_id=21627 I think there is a design choice here: Should it be an invariant that all characters in a Unicode string are below sys.maxunicode? If so, the procedures to create them should check whether that invariant holds. If not, all code that deals with them most consider cases where they are larger than sys.maxunicode or smaller than 0. Also, should it be a requirement that Py_UNICODE is an unsigned type? Then tests for <0 could be dropped, of course. |
|
|
msg49788 - (view) |
Author: Martin v. Löwis (loewis) *  |
Date: 2006-04-11 07:16 |
Logged In: YES user_id=21627 It turns out that the test whether wchar_t is unsigned was reversed. Correcting the test stops makes the program not crash anymore; this is fixed in r45264 Not sure whether this can be backported to 2.4; it will mean that Python will stop using wchar_t as Py_UNICODE on Linux. |
|
|
msg49789 - (view) |
Author: Neal Norwitz (nnorwitz) *  |
Date: 2006-04-14 05:22 |
Logged In: YES user_id=33168 I removed the code which tried to convert a buffer (string) to unicode. Committed revision 45374. We still have to address 2.4. Martin, I can make the fix if you tell me what to do. |
|
|
msg49790 - (view) |
Author: Martin v. Löwis (loewis) *  |
Date: 2006-06-05 10:46 |
Logged In: YES user_id=21627 I added a work-around for the crash in r46669: PyUnicode_FromUnicode now only uses the shared characters for non-negative values. Every other patch I thought of would have broken backwards compatibility in some way. |
|
|