[Python-Dev] Regression in unicodestr.encode()? (original) (raw)

M.-A. Lemburg mal@lemburg.com
Wed, 10 Apr 2002 19:14:46 +0200

Previous message: [Python-Dev] Regression in unicodestr.encode()?
Next message: [Python-Dev] Regression in unicodestr.encode()?
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

"Barry A. Warsaw" wrote:

I'm porting over the latest email package to Python 2.3cvs, and I've had one of my tests fail. I've narrowed it down to the following test case: a = u'\u6b63\u78ba\u306b\u8a00\u3046\u3068\u7ffb\u8a33\u306f\u3055\u308c\u3066\u3044\u307e\u305b\u3093\u3002\u4e00\u90e8\u306f\u30c9\u30a4\u30c4\u8a9e\u3067\u3059\u304c\u3001\u3042\u3068\u306f\u3067\u305f\u3089\u3081\u3067\u3059\u3002\u5b9f\u969b\u306b\u306f\u300cWenn ist das Nunstuck git und' print repr(a.encode('utf-8', 'replace')) In Python 2.2.1 I get '\xe6\xad\xa3\xe7\xa2\xba\xe3\x81\xab\xe8\xa8\x80\xe3\x81\x86\xe3\x81\xa8\xe7\xbf\xbb\xe8\xa8\xb3\xe3\x81\xaf\xe3\x81\x95\xe3\x82\x8c\xe3\x81\xa6\xe3\x81\x84\xe3\x81\xbe\xe3\x81\x9b\xe3\x82\x93\xe3\x80\x82\xe4\xb8\x80\xe9\x83\xa8\xe3\x81\xaf\xe3\x83\x89\xe3\x82\xa4\xe3\x83\x84\xe8\xaa\x9e\xe3\x81\xa7\xe3\x81\x99\xe3\x81\x8c\xe3\x80\x81\xe3\x81\x82\xe3\x81\xa8\xe3\x81\xaf\xe3\x81\xa7\xe3\x81\x9f\xe3\x82\x89\xe3\x82\x81\xe3\x81\xa7\xe3\x81\x99\xe3\x80\x82\xe5\xae\x9f\xe9\x9a\x9b\xe3\x81\xab\xe3\x81\xaf\xe3\x80\x8cWenn ist das Nunstuck git und' but in Python 2.3 cvs I get '\xe6\xad\xa3\xe7\xa2\xba\xe3\x81\xab\xe8\xa8\x80\xe3\x81\x86\xe3\x81\xa8\xe7\xbf\xbb\xe8\xa8\xb3\xe3\x81\xaf\xe3\x81\x95\xe3\x82\x8c\xe3\x81\xa6\xe3\x81\x84\xe3\x81\xbe\xe3\x81\x9b\xe3\x82\x93\xe3\x80\x82\xe4\xb8\x80\xe9\x83\xa8\xe3\x81\xaf\xe3\x83\x89\xe3\x82\xa4\xe3\x83\x84\xe8\xaa\x9e\xe3\x81\xa7\xe3\x81\x99\xe3\x81\x8c\xe3\x80\x81\xe3\x81\x82\xe3\x81\xa8\xe3\x81\xaf\xe3\x81\xa7\xe3\x81\x9f\xe3\x82\x89\xe3\x82\x81\xe3\x81\xa7\xe3\x81\x99\xe3\x80\x82\xe5\xae\x9f\xe9\x9a\x9b\xe3\x81\xab\xe3\x81\xaf\xe3\x80\x8cWenn ist das Nunstuck git u\x00\x00' Note that the last two characters, which should be n' and d' are now NULs. My very limited Tim-enlightened understanding is that encoding a string to UTF-8 should never produce a string with NULs. Is the utf-8 encoder in cvs broken?

Some debugging with gdb indicates that the codec is indeed writing the 'nd', but the final _PyString_Resize() (which allocates a new buffer and copies the data into that buffer) fails to copy the last two characters from the string or overwrites it with NULLs.

Looks like a pymalloc problem to me. Tim ?

In any case, I'll checkin a test case for this.

-- Marc-Andre Lemburg CEO eGenix.com Software GmbH

Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

Previous message: [Python-Dev] Regression in unicodestr.encode()?
Next message: [Python-Dev] Regression in unicodestr.encode()?
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]