msg100678 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2010-03-08 23:21 |
utf8 encoder doesn't work in backslashreplace error handler: >>> "\uDC80".encode("utf8", "backslashreplace") TypeError: error handler should have returned bytes |
|
|
msg100681 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2010-03-08 23:23 |
See also issue #6697. |
|
|
msg100716 - (view) |
Author: Walter Dörwald (doerwalter) *  |
Date: 2010-03-09 11:55 |
After the patch the comment: /* Implementation limitations: only support error handler that return bytes, and only support up to four replacement bytes. */ no longer applies. Also I would like to see a version of this patch where the length limitation for the replacement returned from the error handler is removed (ideally for both the str and bytes case). |
|
|
msg103741 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2010-04-20 19:38 |
New version without the hardcoded limit: don't use goto encodeUCS4;, chain if to limit indentation depth: it only costs one copy of the UCS4 (5 lines are duplicated). The buffer is now reallocated each time a surrogate escape is longer than 4 bytes. I don't know if "nallocated += repsize - 4;" can overflow or not. If yes, how can I detect the overflow? I added: /* FIXME: check integer overflow? */ |
|
|
msg103766 - (view) |
Author: Antoine Pitrou (pitrou) *  |
Date: 2010-04-20 21:09 |
> I don't know if "nallocated += repsize - 4;" can overflow or not. > If yes, how can I detect the overflow? Sure, if they are both Py_ssize_t, just use: if (nallocated > PY_SSIZE_T_MAX - repsize + 4) { /* handle overflow ... */ } |
|
|
msg103777 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2010-04-20 21:50 |
Oh no :-( I realized that I removed the first message of this issue! . Copy/paste of the message: --- This issue is a regression introduced by r72208 to fix the issue #3672. Attached patch fixes PyUnicode_EncodeUTF8() if unicode_encode_call_errorhandler() returns an unicode string (eg. backslackreplace error handler). I don't know unicodeobject.c code (very well), and my patch should be far from being perfect. I suppose that the maximum length of an escaped characters is 8 bytes (xmlcharrefreplace error error for U+DFFFF). When the first lone surrogate is found, reallocate the buffer to size*8 bytes. The escaped character have to be an ASCII character or an UnicodeEncodeError is raised. Note: unicode_encode_ucs1() doesn't have hardcoded for the maximum length ot escaped string. Its code might be reused in PyUnicode_EncodeUTF8() to remove the hardcoded limits. --- |
|
|
msg103779 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2010-04-20 21:59 |
Oops, I forgot the remove the reallocation in the unicode case in the patch version 2. Patch version 3: - micro-optimization: group both surrogates cases in the same if to avoid checking 0xD800 <= ch twice - check for integer overflow - (remove the duplication reallocation introduced by version 2) I think that PyUnicode_EncodeUTF8() is more readable after my patch: there maximum if depth is 2 instead of 3, and I removed the goto. It shouldn't change anything about performances for chacters < 0x800 (ASCII and Latin-1), and I expect similar performances for characters >= 0x800. |
|
|
msg103977 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2010-04-22 19:41 |
Fixed: r80382 (py3k), r80383 (3.1). |
|
|