| msg26623 - (view) |
Author: Ralf Schmitt (titty) |
Date: 2005-10-19 08:23 |
| the following code doesn't work as expected: ralf@stronzo:~$ cat t.py #! /usr/bin/env python s = 'Auguste and Louis Lumi\xe8re' print repr(s) u1 = s.decode('utf7') print 'from utf7: %d %r' % (len(u1), u1) u2 = u'Auguste and Louis Lumi\xe8re' print ' u2: %d %r' % (len(u2), u2) print 'u1==u2', u1==u2 e1 = u1.encode('utf8') e2 = u2.encode('utf8') print 'e1=%r' % e1 print 'e2=%r' % e2 unicode(e2, 'utf8') unicode(e1, 'utf8') ralf@stronzo:~$ python t.py 'Auguste and Louis Lumi\xe8re' from utf7: 25 u'Auguste and Louis Lumi\xe8re' u2: 25 u'Auguste and Louis Lumi\xe8re' u1==u2 False e1='Auguste and Louis Lumi\xff\xbf\xbf\xa8re' e2='Auguste and Louis Lumi\xc3\xa8re' Traceback (most recent call last): File "t.py", line 19, in ? unicode(e1, 'utf8') File "/usr/local/lib/python2.4/encodings/utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in position 22: unexpected code byte |
|
|
| msg26624 - (view) |
Author: Marc-Andre Lemburg (lemburg) *  |
Date: 2005-10-19 10:30 |
| Logged In: YES user_id=38388 Hmm, running Python 2.4.2 I get: >>> s = 'Auguste and Louis Lumi\xe8re' >>> print repr(s) 'Auguste and Louis Lumi\xe8re' >>> u1 = s.decode('utf7') Traceback (most recent call last): File "", line 1, in ? UnicodeDecodeError: 'utf7' codec can't decode bytes in position 0-22: unexpected special character Which looks correct as UTF-7 may not contain characters having the hig bit set. |
|
|
| msg26625 - (view) |
Author: Ralf Schmitt (titty) |
Date: 2005-10-19 10:58 |
| Logged In: YES user_id=17929 On Debian testing and Freebsd 4.11 using Python 2.4.2 '\xe8'.decode('utf7') succeeds... Using the windows version I also get that error. |
|
|
| msg26626 - (view) |
Author: Marc-Andre Lemburg (lemburg) *  |
Date: 2005-10-19 11:07 |
| Logged In: YES user_id=38388 I was testing on SuSE Linux 9.2. Sounds like a compiler bug. Could you try compiling with optimization switched off on FreeBSD ? Thanks. |
|
|
| msg26627 - (view) |
Author: Sjoerd Mullender (sjoerd) *  |
Date: 2005-10-19 11:17 |
| Logged In: YES user_id=43607 The definition of SPECIAL in unicodeobject.c is wrong. It tests a character for > 127, but when characters are signed and Py_UNICODE expands to a signed type, this doesn't do what was intended. |
|
|
| msg26628 - (view) |
Author: Ralf Schmitt (titty) |
Date: 2005-10-19 11:29 |
| Logged In: YES user_id=17929 The problem *disappears* on freebsd if I configure *without* --enable-unicode=ucs4. Guess this is also what the debian people are using and not a compiler bug, since freebsd uses gcc 2.95 and debian 4.0.x. |
|
|
| msg26629 - (view) |
Author: Marc-Andre Lemburg (lemburg) *  |
Date: 2005-10-19 12:09 |
| Logged In: YES user_id=38388 I can confirm this: using a UCS4 build Python accepts the malformed UTF-7 string. I'll have a look at Sjoerd's suggestion. |
|
|
| msg26630 - (view) |
Author: Marc-Andre Lemburg (lemburg) *  |
Date: 2005-10-19 22:34 |
| Logged In: YES user_id=38388 Fixed in CVS: Checking in unicodeobject.c; /cvsroot/python/python/dist/src/Objects/unicodeobject.c,v <-- unicodeobject.c new revision: 2.233; previous revision: 2.232 done I've marked this as backport candidate. |
|
|