Issue 1331062: utf 7 codec broken (original) (raw)

Created on 2005-10-19 08:23 by titty, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Messages (8)
msg26623 - (view)	Author: Ralf Schmitt (titty)	Date: 2005-10-19 08:23
the following code doesn't work as expected: ralf@stronzo:~$ cat t.py #! /usr/bin/env python s = 'Auguste and Louis Lumi\xe8re' print repr(s) u1 = s.decode('utf7') print 'from utf7: %d %r' % (len(u1), u1) u2 = u'Auguste and Louis Lumi\xe8re' print ' u2: %d %r' % (len(u2), u2) print 'u1==u2', u1==u2 e1 = u1.encode('utf8') e2 = u2.encode('utf8') print 'e1=%r' % e1 print 'e2=%r' % e2 unicode(e2, 'utf8') unicode(e1, 'utf8') ralf@stronzo:~$ python t.py 'Auguste and Louis Lumi\xe8re' from utf7: 25 u'Auguste and Louis Lumi\xe8re' u2: 25 u'Auguste and Louis Lumi\xe8re' u1==u2 False e1='Auguste and Louis Lumi\xff\xbf\xbf\xa8re' e2='Auguste and Louis Lumi\xc3\xa8re' Traceback (most recent call last): File "t.py", line 19, in ? unicode(e1, 'utf8') File "/usr/local/lib/python2.4/encodings/utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in position 22: unexpected code byte
msg26624 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2005-10-19 10:30
Logged In: YES user_id=38388 Hmm, running Python 2.4.2 I get: >>> s = 'Auguste and Louis Lumi\xe8re' >>> print repr(s) 'Auguste and Louis Lumi\xe8re' >>> u1 = s.decode('utf7') Traceback (most recent call last): File "", line 1, in ? UnicodeDecodeError: 'utf7' codec can't decode bytes in position 0-22: unexpected special character Which looks correct as UTF-7 may not contain characters having the hig bit set.
msg26625 - (view)	Author: Ralf Schmitt (titty)	Date: 2005-10-19 10:58
Logged In: YES user_id=17929 On Debian testing and Freebsd 4.11 using Python 2.4.2 '\xe8'.decode('utf7') succeeds... Using the windows version I also get that error.
msg26626 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2005-10-19 11:07
Logged In: YES user_id=38388 I was testing on SuSE Linux 9.2. Sounds like a compiler bug. Could you try compiling with optimization switched off on FreeBSD ? Thanks.
msg26627 - (view)	Author: Sjoerd Mullender (sjoerd) *	Date: 2005-10-19 11:17
Logged In: YES user_id=43607 The definition of SPECIAL in unicodeobject.c is wrong. It tests a character for > 127, but when characters are signed and Py_UNICODE expands to a signed type, this doesn't do what was intended.
msg26628 - (view)	Author: Ralf Schmitt (titty)	Date: 2005-10-19 11:29
Logged In: YES user_id=17929 The problem disappears on freebsd if I configure without --enable-unicode=ucs4. Guess this is also what the debian people are using and not a compiler bug, since freebsd uses gcc 2.95 and debian 4.0.x.
msg26629 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2005-10-19 12:09
Logged In: YES user_id=38388 I can confirm this: using a UCS4 build Python accepts the malformed UTF-7 string. I'll have a look at Sjoerd's suggestion.
msg26630 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2005-10-19 22:34
Logged In: YES user_id=38388 Fixed in CVS: Checking in unicodeobject.c; /cvsroot/python/python/dist/src/Objects/unicodeobject.c,v <-- unicodeobject.c new revision: 2.233; previous revision: 2.232 done I've marked this as backport candidate.

History
Date	User	Action	Args
2022-04-11 14:56:13	admin	set	github: 42499
2005-10-19 08:23:23	titty	create