Issue 8092: utf8, backslashreplace and surrogates (original) (raw)

Created on 2010-03-08 23:21 by vstinner, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
utf8_surrogate_error-3.patch vstinner,2010-04-20 21:59
Messages (8)
msg100678 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-03-08 23:21
utf8 encoder doesn't work in backslashreplace error handler: >>> "\uDC80".encode("utf8", "backslashreplace") TypeError: error handler should have returned bytes
msg100681 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-03-08 23:23
See also issue #6697.
msg100716 - (view) Author: Walter Dörwald (doerwalter) * (Python committer) Date: 2010-03-09 11:55
After the patch the comment: /* Implementation limitations: only support error handler that return bytes, and only support up to four replacement bytes. */ no longer applies. Also I would like to see a version of this patch where the length limitation for the replacement returned from the error handler is removed (ideally for both the str and bytes case).
msg103741 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-04-20 19:38
New version without the hardcoded limit: don't use goto encodeUCS4;, chain if to limit indentation depth: it only costs one copy of the UCS4 (5 lines are duplicated). The buffer is now reallocated each time a surrogate escape is longer than 4 bytes. I don't know if "nallocated += repsize - 4;" can overflow or not. If yes, how can I detect the overflow? I added: /* FIXME: check integer overflow? */
msg103766 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-04-20 21:09
> I don't know if "nallocated += repsize - 4;" can overflow or not. > If yes, how can I detect the overflow? Sure, if they are both Py_ssize_t, just use: if (nallocated > PY_SSIZE_T_MAX - repsize + 4) { /* handle overflow ... */ }
msg103777 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-04-20 21:50
Oh no :-( I realized that I removed the first message of this issue! . Copy/paste of the message: --- This issue is a regression introduced by r72208 to fix the issue #3672. Attached patch fixes PyUnicode_EncodeUTF8() if unicode_encode_call_errorhandler() returns an unicode string (eg. backslackreplace error handler). I don't know unicodeobject.c code (very well), and my patch should be far from being perfect. I suppose that the maximum length of an escaped characters is 8 bytes (xmlcharrefreplace error error for U+DFFFF). When the first lone surrogate is found, reallocate the buffer to size*8 bytes. The escaped character have to be an ASCII character or an UnicodeEncodeError is raised. Note: unicode_encode_ucs1() doesn't have hardcoded for the maximum length ot escaped string. Its code might be reused in PyUnicode_EncodeUTF8() to remove the hardcoded limits. ---
msg103779 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-04-20 21:59
Oops, I forgot the remove the reallocation in the unicode case in the patch version 2. Patch version 3: - micro-optimization: group both surrogates cases in the same if to avoid checking 0xD800 <= ch twice - check for integer overflow - (remove the duplication reallocation introduced by version 2) I think that PyUnicode_EncodeUTF8() is more readable after my patch: there maximum if depth is 2 instead of 3, and I removed the goto. It shouldn't change anything about performances for chacters < 0x800 (ASCII and Latin-1), and I expect similar performances for characters >= 0x800.
msg103977 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-04-22 19:41
Fixed: r80382 (py3k), r80383 (3.1).
History
Date User Action Args
2022-04-11 14:56:58 admin set github: 52339
2010-04-22 19:41:20 vstinner set status: open -> closedresolution: fixedmessages: +
2010-04-22 10:26:00 vstinner set files: - utf8_surrogate_error-2.patch
2010-04-20 21:59:51 vstinner set files: + utf8_surrogate_error-3.patchmessages: +
2010-04-20 21:50:24 vstinner set messages: +
2010-04-20 21:09:53 pitrou set nosy: + pitroumessages: +
2010-04-20 21:08:37 vstinner set files: - utf8_surrogate_error.patch
2010-04-20 19:38:47 vstinner set files: + utf8_surrogate_error-2.patchmessages: +
2010-04-20 19:36:44 vstinner set messages: -
2010-04-20 11:16:16 vstinner link issue8242 dependencies
2010-03-09 11:55:53 doerwalter set nosy: + doerwaltermessages: +
2010-03-09 01:37:20 pitrou set versions: + Python 3.2nosy: + lemburg, loewispriority: normaltype: behaviorstage: patch review
2010-03-09 01:11:58 vstinner set files: + utf8_surrogate_error.patchkeywords: + patchmessages: +
2010-03-08 23:23:25 vstinner set messages: +
2010-03-08 23:21:05 vstinner create