Issue 541828: Regression in unicodestr.encode() (original) (raw)

Created on 2002-04-10 01:56 by barry, last changed 2022-04-10 16:05 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
unicode.diff	loewis,2002-04-10 18:07

Messages (7)
msg10231 - (view)	Author: Barry A. Warsaw (barry) *	Date: 2002-04-10 01:56
I'm porting over the latest email package to Python 2.3cvs, and I've had one of my tests fail. I've narrowed it down to the following test case: a = u'\u6b63\u78ba\u306b\u8a00\u3046\u3068\u7ffb\u8a33\u306f\u3055\u308c\u3066\u3044\u307e\u305b\u3093\u3002\u4e00\u90e8\u306f\u30c9\u30a4\u30c4\u8a9e\u3067\u3059\u304c\u3001\u3042\u3068\u306f\u3067\u305f\u3089\u3081\u3067\u3059\u3002\u5b9f\u969b\u306b\u306f\u300cWenn ist das Nunstuck git und' print repr(a.encode('utf-8', 'replace')) In Python 2.2.1 I get '\xe6\xad\xa3\xe7\xa2\xba\xe3\x81\xab\xe8\xa8\x80\xe3\x81\x86\xe3\x81\xa8\xe7\xbf\xbb\xe8\xa8\xb3\xe3\x81\xaf\xe3\x81\x95\xe3\x82\x8c\xe3\x81\xa6\xe3\x81\x84\xe3\x81\xbe\xe3\x81\x9b\xe3\x82\x93\xe3\x80\x82\xe4\xb8\x80\xe9\x83\xa8\xe3\x81\xaf\xe3\x83\x89\xe3\x82\xa4\xe3\x83\x84\xe8\xaa\x9e\xe3\x81\xa7\xe3\x81\x99\xe3\x81\x8c\xe3\x80\x81\xe3\x81\x82\xe3\x81\xa8\xe3\x81\xaf\xe3\x81\xa7\xe3\x81\x9f\xe3\x82\x89\xe3\x82\x81\xe3\x81\xa7\xe3\x81\x99\xe3\x80\x82\xe5\xae\x9f\xe9\x9a\x9b\xe3\x81\xab\xe3\x81\xaf\xe3\x80\x8cWenn ist das Nunstuck git und' but in Python 2.3 cvs I get '\xe6\xad\xa3\xe7\xa2\xba\xe3\x81\xab\xe8\xa8\x80\xe3\x81\x86\xe3\x81\xa8\xe7\xbf\xbb\xe8\xa8\xb3\xe3\x81\xaf\xe3\x81\x95\xe3\x82\x8c\xe3\x81\xa6\xe3\x81\x84\xe3\x81\xbe\xe3\x81\x9b\xe3\x82\x93\xe3\x80\x82\xe4\xb8\x80\xe9\x83\xa8\xe3\x81\xaf\xe3\x83\x89\xe3\x82\xa4\xe3\x83\x84\xe8\xaa\x9e\xe3\x81\xa7\xe3\x81\x99\xe3\x81\x8c\xe3\x80\x81\xe3\x81\x82\xe3\x81\xa8\xe3\x81\xaf\xe3\x81\xa7\xe3\x81\x9f\xe3\x82\x89\xe3\x82\x81\xe3\x81\xa7\xe3\x81\x99\xe3\x80\x82\xe5\xae\x9f\xe9\x9a\x9b\xe3\x81\xab\xe3\x81\xaf\xe3\x80\x8cWenn ist das Nunstuck git u\x00\x00' Note that the last two characters, which should be `n' and `d' are now NULs. My very limited Tim-enlightened understanding is that encoding a string to UTF-8 should never produce a string with NULs.
msg10232 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2002-04-10 18:07
Logged In: YES user_id=21627 It appears that cbWritten can still run above cbAllocated, namely if a long sequence of 3-byte characters is followed by a long sequence of 1-byte or 2-byte characters. I'm still in favour of dropping the resizing of the result string, and computing the number of bytes in a first run. The code becomes clearer that way and more performant; see attached unicode.diff.
msg10233 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2002-04-10 18:53
Logged In: YES user_id=38388 I'm not in favour of the precomputation. We already had a discussion about the performance of this. About the cbWritten thingie: that was your invention, IIRC :-) I'll try ripping that bit out again and use pointer arithmetics instead. Still, I believe the real cause of the problem is in pymalloc, since a debugging session indicated that the codec did write the 'n', 'd' characters. It's the final _PyString_Resize() which causes these to be dropped during the copying of the memory block.
msg10234 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2002-04-10 20:37
Logged In: YES user_id=38388 Fix checked in. Probably does not apply to the 2.2.1 branch since this uses a different technique.
msg10235 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2002-04-10 20:50
Logged In: YES user_id=38388 Just confirmed: Python 2.2.1 definitely doesn't have this problem.
msg10236 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2002-04-10 21:36
Logged In: YES user_id=21627 There is no bug in pymalloc. The codec wrote beyond the end of the allocated buffer, this causes undefined behaviour. The malloc implemementation could not possibly know that the data extends beyond the space it provided to the application. Python 2.2 suffers from the same problem: If you have a string of 10 characters, it will allocate 30 bytes. In UCS4 mode, if the first 6 characters consume each 4 bytes, this will consume 24 bytes, leaving 6 bytes (resizing would only be triggered if 4 bytes or less would be left). Now, if the remaining 4 characters each consume 2 bytes, the total size written will be 32 bytes, causing a write into unallocated memory by 2 bytes. So this is the same problem. About cbWritten: it was introduced in unicodeobject.c 2.41, where the checkin message says New surrogate support in the UTF-8 codec. By Bill Tutt. So I'd challenge the claim that this is my doing. As for computing the size in advance: Your arguments on performance are not convincing, since your measurements were flawed.
msg10237 - (view)	Author: Tim Peters (tim.peters) *	Date: 2002-04-10 22:22
Logged In: YES user_id=31435 Note that the debug-build pymalloc does catch the overwrite, and complains about it as soon as the fatal realloc is entered. Unfortunately, the overwrite was so bad that it also destroyed the "serial number" info the debug pymalloc tried to display in its error report. I agree Martin didn't introduce cbWritten (BTW, that kind of Hungarian naming is a sure sign that someone at Microsoft introduced it ), but don't care where it came from. What I do care about is that there weren't (and still aren't) asserts verifying that this delicate code isn't spilling over the allocated bounds. About timing, last time we went around on this, the "measure once, cut once" version of the code was significantly slower in my timing tests too. I don't care so much if the code is tricky, but the trickier the code the more asserts are required. Note that pymalloc's realloc still doesn't give memory back when a small block is realloc'ed to a smaller size. That makes the current method enjoy a speed advantage (at the expense of using more memory) in the usual cases today, but this special advantage may not persist.

History
Date	User	Action	Args
2022-04-10 16:05:12	admin	set	github: 36404
2002-04-10 01:56:11	barry	create