Issue 16585: surrogateescape broken w/ multibytecodecs' encode (original) (raw)
Issue16585
Created on 2012-11-30 20:20 by pjenvey, last changed 2022-04-11 14:57 by admin. This issue is now closed.
Messages (5) | ||
---|---|---|
msg176711 - (view) | Author: Philip Jenvey (pjenvey) * ![]() |
Date: 2012-11-30 20:20 |
surrogateescape claims to be "implemented by all standard Python codecs" http://docs.python.org/3/library/codecs.html#codec-base-classes However it fails w/ multibytecodecs on encode: Python 3.2.3+ (3.2:eb999002916c, Oct 26 2012, 16:11:03) [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> "\u30fb".encode('gb18030') b'\x819\xa79' >>> "\u30fb\udc80".encode('gb18030', 'surrogateescape') Traceback (most recent call last): File "", line 1, in TypeError: encoding error handler must return (unicode, int) tuple The problem being that multibytecodec.c forces error handler return results to always be unicode and surrogateescape returns bytes here. (surrogatepass also similarly returns bytes but it claims to be utf-8 only) The error handler spec seems to imply that error handlers should always return unicode, because "The encoder will encode the replacement" http://docs.python.org/3/library/codecs.html#codecs.register_error but obviously that's not really the case: some codecs special case bytes results and copy them directly to the output, e.g.: http://hg.python.org/cpython/file/ce3f0399ea33/Objects/unicodeobject.c#l6305 | ||
msg176717 - (view) | Author: Benjamin Peterson (benjamin.peterson) * ![]() |
Date: 2012-11-30 20:50 |
Codecs should be fixed to accept bytes from the error handler and the definition in the docs loosened. Returning bytes seems to be useful. | ||
msg176780 - (view) | Author: Walter Dörwald (doerwalter) * ![]() |
Date: 2012-12-02 10:38 |
And returning bytes is documented in PEP 383, as an extension to the PEP 293 machinery: """To convert non-decodable bytes, a new error handler ([2]) "surrogateescape" is introduced, which produces these surrogates. On encoding, the error handler converts the surrogate back to the corresponding byte. This error handler will be used in any API that receives or produces file names, command line arguments, or environment variables. The error handler interface is extended to allow the encode error handler to return byte strings immediately, in addition to returning Unicode strings which then get encoded again (also see the discussion below).""" | ||
msg176798 - (view) | Author: Roundup Robot (python-dev) ![]() |
Date: 2012-12-02 16:21 |
New changeset 5c88c72dec60 by Benjamin Peterson in branch '3.3': support encoding error handlers that return bytes (closes #16585) http://hg.python.org/cpython/rev/5c88c72dec60 New changeset 2181c37977d3 by Benjamin Peterson in branch 'default': merge 3.3 (#16585) http://hg.python.org/cpython/rev/2181c37977d3 | ||
msg176799 - (view) | Author: Roundup Robot (python-dev) ![]() |
Date: 2012-12-02 16:33 |
New changeset 777aabdff35a by Benjamin Peterson in branch '3.3': document that encoding error handlers may return bytes (#16585) http://hg.python.org/cpython/rev/777aabdff35a |
History | |||
---|---|---|---|
Date | User | Action | Args |
2022-04-11 14:57:38 | admin | set | github: 60789 |
2012-12-02 16:33:24 | python-dev | set | messages: + |
2012-12-02 16:21:14 | python-dev | set | status: open -> closednosy: + python-devmessages: + resolution: fixedstage: needs patch -> resolved |
2012-12-02 12:04:01 | pitrou | set | assignee: docs@python -> components: + Library (Lib), - Documentation, Interpreter Core, Unicode |
2012-12-02 10:38:36 | doerwalter | set | nosy: + doerwaltermessages: + |
2012-11-30 21:29:14 | serhiy.storchaka | set | assignee: docs@pythonnosy: + docs@pythoncomponents: + Documentationstage: needs patch |
2012-11-30 20:50:11 | benjamin.peterson | set | messages: + |
2012-11-30 20:28:55 | serhiy.storchaka | set | nosy: + lemburg, pitrou, vstinner, benjamin.peterson, ezio.melotti, serhiy.storchakatype: behaviorcomponents: + Unicodeversions: + Python 3.4 |
2012-11-30 20:20:22 | pjenvey | create |