Issue 16585: surrogateescape broken w/ multibytecodecs' encode (original) (raw)

Issue16585

Created on 2012-11-30 20:20 by pjenvey, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Messages (5)
msg176711 - (view) Author: Philip Jenvey (pjenvey) * (Python committer) Date: 2012-11-30 20:20
surrogateescape claims to be "implemented by all standard Python codecs" http://docs.python.org/3/library/codecs.html#codec-base-classes However it fails w/ multibytecodecs on encode: Python 3.2.3+ (3.2:eb999002916c, Oct 26 2012, 16:11:03) [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> "\u30fb".encode('gb18030') b'\x819\xa79' >>> "\u30fb\udc80".encode('gb18030', 'surrogateescape') Traceback (most recent call last): File "", line 1, in TypeError: encoding error handler must return (unicode, int) tuple The problem being that multibytecodec.c forces error handler return results to always be unicode and surrogateescape returns bytes here. (surrogatepass also similarly returns bytes but it claims to be utf-8 only) The error handler spec seems to imply that error handlers should always return unicode, because "The encoder will encode the replacement" http://docs.python.org/3/library/codecs.html#codecs.register_error but obviously that's not really the case: some codecs special case bytes results and copy them directly to the output, e.g.: http://hg.python.org/cpython/file/ce3f0399ea33/Objects/unicodeobject.c#l6305
msg176717 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2012-11-30 20:50
Codecs should be fixed to accept bytes from the error handler and the definition in the docs loosened. Returning bytes seems to be useful.
msg176780 - (view) Author: Walter Dörwald (doerwalter) * (Python committer) Date: 2012-12-02 10:38
And returning bytes is documented in PEP 383, as an extension to the PEP 293 machinery: """To convert non-decodable bytes, a new error handler ([2]) "surrogateescape" is introduced, which produces these surrogates. On encoding, the error handler converts the surrogate back to the corresponding byte. This error handler will be used in any API that receives or produces file names, command line arguments, or environment variables. The error handler interface is extended to allow the encode error handler to return byte strings immediately, in addition to returning Unicode strings which then get encoded again (also see the discussion below)."""
msg176798 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2012-12-02 16:21
New changeset 5c88c72dec60 by Benjamin Peterson in branch '3.3': support encoding error handlers that return bytes (closes #16585) http://hg.python.org/cpython/rev/5c88c72dec60 New changeset 2181c37977d3 by Benjamin Peterson in branch 'default': merge 3.3 (#16585) http://hg.python.org/cpython/rev/2181c37977d3
msg176799 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2012-12-02 16:33
New changeset 777aabdff35a by Benjamin Peterson in branch '3.3': document that encoding error handlers may return bytes (#16585) http://hg.python.org/cpython/rev/777aabdff35a
History
Date User Action Args
2022-04-11 14:57:38 admin set github: 60789
2012-12-02 16:33:24 python-dev set messages: +
2012-12-02 16:21:14 python-dev set status: open -> closednosy: + python-devmessages: + resolution: fixedstage: needs patch -> resolved
2012-12-02 12:04:01 pitrou set assignee: docs@python -> components: + Library (Lib), - Documentation, Interpreter Core, Unicode
2012-12-02 10:38:36 doerwalter set nosy: + doerwaltermessages: +
2012-11-30 21:29:14 serhiy.storchaka set assignee: docs@pythonnosy: + docs@pythoncomponents: + Documentationstage: needs patch
2012-11-30 20:50:11 benjamin.peterson set messages: +
2012-11-30 20:28:55 serhiy.storchaka set nosy: + lemburg, pitrou, vstinner, benjamin.peterson, ezio.melotti, serhiy.storchakatype: behaviorcomponents: + Unicodeversions: + Python 3.4
2012-11-30 20:20:22 pjenvey create