Issue 17828: More informative error handling when encoding and decoding (original) (raw)
Created on 2013-04-24 14:09 by ncoghlan, last changed 2022-04-11 14:57 by admin. This issue is now closed.
Messages (27)
Author: Alyssa Coghlan (ncoghlan) *
Date: 2013-04-24 14:09
Passing the wrong types to codecs can currently lead to rather confusing exceptions, like:
==================== >>> b"ZXhhbXBsZQ==\n".decode("base64_codec") Traceback (most recent call last): File "", line 1, in File "/usr/lib64/python3.2/encodings/base64_codec.py", line 20, in base64_decode return (base64.decodebytes(input), len(input)) File "/usr/lib64/python3.2/base64.py", line 359, in decodebytes raise TypeError("expected bytes, not %s" % s.class.name) TypeError: expected bytes, not memoryview
codecs.decode("example", "utf8") Traceback (most recent call last): File "", line 1, in File "/usr/lib64/python3.2/encodings/utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) TypeError: 'str' does not support the buffer interface
This situation could be improved by having the affected APIs use the exception chaining system to wrap these errors in a more informative exception that also display information on the codec involved. Note that UnicodeEncodeError and UnicodeDecodeError are not appropriate, as those are specific to text encoding operations, while these new wrappers will apply to arbitrary codecs, regardless of whether or not they use the unicode error handlers. Furthermore, for backwards compatibility with existing exception handling, it is probably necessary to limit ourselves to specific exception types and ensure that the wrapper exceptions are subclasses of those types.
These new wrappers would have cause set to the exception raised by the codec, but emit a message more along the lines of the following:
============== codecs.DecodeTypeError: encoding='utf8', details="TypeError: 'str' does not support the buffer interface"
Wrapping TypeError and ValueError should cover most cases, which would mean four new exception types in the codecs module:
Raised by codecs.decode, bytes.decode and bytearray.decode:
- codecs.DecodeTypeError
- codecs.DecodeValueError
Raised by codecs.encode, str.encode:
- codecs.EncodeTypeError
- codecs.EncodeValueError
Instances of UnicodeError wouldn't be wrapped, since they already contain codec information.
Author: Alyssa Coghlan (ncoghlan) *
Date: 2013-04-24 14:15
There may also be some specific improvement to be made to str.encode, bytes.decode and bytearray.decode in relation to the additional type checks they do to enforce the appropriate input and output types (see the bizarre "expected bytes, not memoryview" example)
Author: Alyssa Coghlan (ncoghlan) *
Date: 2013-04-25 07:34
I tracked down the proximate cause of the weird exception in the bytes.decode case: the base64 module only accepts bytes and bytearray objects, instead of using memoryview to accept anything that supports the buffer API and provides a C-contiguous 8-bit view of the underlying data. Raised as issue 17839.
Author: Alyssa Coghlan (ncoghlan) *
Date: 2013-04-25 07:47
Here's an example of the specific type errors raised by additional checks in the text-encoding specific methods. I believe the main improvement needed here is to mention the encoding name in the exception message:
"example".encode("rot_13") Traceback (most recent call last): File "", line 1, in TypeError: encoder did not return a bytes object (type=str)
b'BZh91AY&SY\xc1uvK\x00\x00\x01F\x80\x00\x10\x00"\x04\x00\x00\x10 \x000\xcd\x00\xc1\xa0P\xe2\xeeH\xa7\n\x12\x18.\xae\xc9`'.decode("bz2_codec") Traceback (most recent call last): File "", line 1, in TypeError: decoder did not return a str object (type=bytes)
Author: Alyssa Coghlan (ncoghlan) *
Date: 2013-05-10 02:42
Ezio pointed out on IRC that the extra type checks in str.encode, bytes.decode and bytearray.decode should reference the appopriate codecs module function in addition to the codec in use.
So if str.encode produces something other than bytes, it should reference codecs.encode, while the binary decoding methods should mention codecs.decode if they produce something other than str.
Author: Ezio Melotti (ezio.melotti) *
Date: 2013-05-10 03:22
The attached patch changes the error message of str.encode/bytes.decode when the codec returns the wrong type:
import codecs 'example'.encode('rot_13') TypeError: encoder returned 'str' instead of 'bytes', use codecs.decode for str->str conversions codecs.encode('example', 'rot_13') 'rknzcyr'
b'000102'.decode('hex_codec') TypeError: decoder returned 'bytes' instead of 'str', use codecs.encode for bytes->bytes conversions codecs.decode(b'000102', 'hex_codec') b'\x00\x01\x02'
This only solves part of the problem though, because individual codecs might raise other errors if the input type is wrong:
'example'.encode('hex_codec') Traceback (most recent call last): File "/home/wolf/dev/py/py3k/Lib/encodings/hex_codec.py", line 16, in hex_encode return (binascii.b2a_hex(input), len(input)) TypeError: 'str' does not support the buffer interface
Author: Ezio Melotti (ezio.melotti) *
Date: 2013-05-10 03:52
To summarize:
- str.encode does only str->bytes;
- bytes.decode does only bytes-> str;
- codecs.encode/decode do obj->obj;
The things that could go wrong are:
- the input type is wrong (i.e. the codec doesn't accept the type of the input);
- the input value is invalid;
- for str.encode/bytes.decode only, the output type is wrong (i.e. the codec returned a non-bytes/non-str object);
My patch only covers 3. The four new exceptions suggested by Nick in would cover the first 2 cases. For str.encode/bytes.decode, if we knew the input type accepted by the codec we could also provide a better error message (e.g. "codecs accepts '...', not '...'; use ... instead"), but we don't.
Author: Ezio Melotti (ezio.melotti) *
Date: 2013-05-10 04:36
The attached proof of concept catches Type/ValueError in str.encode and raises another exception with a better message:
'example'.encode('hex_codec') Traceback (most recent call last): File "", line 1, in TypeError: invalid input type for hex_codec codec ('str' does not support the buffer interface)
(note: the patch doesn't handle the exception chaining yet and probably leaks.)
If Nick proposal in is accepted, this should become a codecs.EncodeTypeError. The same should then be done for bytes.decode and for codecs.encode/decode.
Author: Alyssa Coghlan (ncoghlan) *
Date: 2013-11-04 13:00
Updated patch. The results of this suggests to me that the input wrappers are likely infeasible at this point in time, but improving the errors for the wrong output type is entirely feasible. Since the main conversion we need to prompt is things like "binary_object.decode(binary_codec)" -> "codecs.decode(binary_object, binary_codec)", I suggest we limit the scope of this issue to that part of the problem.
import codecs codecs.encode(b"hello", "bz2_codec").decode("bz2_codec") Traceback (most recent call last): File "", line 1, in TypeError: 'bz2_codec' decoder returned 'bytes' instead of 'str'; use codecs.decode to decode to arbitrary types "hello".encode("bz2_codec") TypeError: 'str' does not support the buffer interface
The above exception was the direct cause of the following exception:
Traceback (most recent call last): File "", line 1, in TypeError: invalid input type for 'bz2_codec' codec (TypeError: 'str' does not support the buffer interface)
"hello".encode("rot_13") TypeError: 'rot_13' encoder returned 'str' instead of 'bytes'; use codecs.encode to encode to arbitrary types
The above exception was the direct cause of the following exception:
Traceback (most recent call last): File "", line 1, in TypeError: invalid input type for 'rot_13' codec (TypeError: 'rot_13' encoder returned 'str' instead of 'bytes'; use codecs.encode to encode to arbitrary types)
Author: Alyssa Coghlan (ncoghlan) *
Date: 2013-11-04 13:20
Ah, came up with a relatively simple solution based on an internal helper function with an optional output flag:
import codecs codecs.encode(b"hello", "bz2_codec").decode("bz2_codec") Traceback (most recent call last): File "", line 1, in TypeError: 'bz2_codec' decoder returned 'bytes' instead of 'str'; use codecs.decode to decode to arbitrary types
"hello".encode("bz2_codec") TypeError: 'str' does not support the buffer interface
The above exception was the direct cause of the following exception:
Traceback (most recent call last): File "", line 1, in TypeError: invalid input type for 'bz2_codec' codec (TypeError: 'str' does not support the buffer interface)
"hello".encode("rot_13") Traceback (most recent call last): File "", line 1, in TypeError: 'rot_13' encoder returned 'str' instead of 'bytes'; use codecs.encode to encode to arbitrary types
Author: Alyssa Coghlan (ncoghlan) *
Date: 2013-11-04 13:27
The other thing is that this patch doesn't wrap AttributeError. I'm OK with that, since I believe the only codec in the standard library that currently throws that for a bad input type is rot_13.
Author: STINNER Victor (vstinner) *
Date: 2013-11-04 13:30
It would be simpler to just drop these custom codecs (rot13, bz2, hex, etc.) instead of helping to use them :-)
Author: Marc-Andre Lemburg (lemburg) *
Date: 2013-11-04 14:46
On 04.11.2013 14:30, STINNER Victor wrote:
It would be simpler to just drop these custom codecs (rot13, bz2, hex, etc.) instead of helping to use them :-)
-1 for the same reasons I keep repeating over and over and over again :-)
The codec system was designed to work obj->obj. Python 3 limits the types for the bytes/str helper methods, but that limitation does not extend to the codec design.
+1 on having better error messages. In the long run, we should add supported input/output type information to codecs, so that error reporting and codec introspection becomes easier.
-- Marc-Andre Lemburg eGenix.com
Professional Python Services directly from the Source (#1, Nov 04 2013)
Python Projects, Consulting and Support ... http://www.egenix.com/ mxODBC.Zope/Plone.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
2013-11-19: Python Meeting Duesseldorf ... 15 days to go
::::: Try our mxODBC.Connect Python Database Interface for free ! ::::::
eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/
Author: Alyssa Coghlan (ncoghlan) *
Date: 2013-11-04 23:39
I think I figured out a better way to structure this that avoids the need for the output flag and is more easily expanded to whitelist additional exception types as safe to wrap.
I'll try to come up with a new patch tonight.
Author: Alyssa Coghlan (ncoghlan) *
Date: 2013-11-05 13:48
New and improved implementation attached that extracts the exception chaining to a helper functions and calls it only when it is the call in to the codecs machinery that failed (eliminating the need for the output flag, and covering decoding as well as encoding).
TypeError, AttributeError and ValueError are all wrapped with chained exceptions that mention the codec that failed.
(Annoyingly, bz2_codec throws OSError instead of ValueError for bad input data, but wrapping OSError safely is a pain due to the extra state potentially carried on instances. So letting it escape unwrapped is the simpler and more conservative option at this point)
import codecs codecs.encode(b"hello", "bz2_codec").decode("bz2_codec") Traceback (most recent call last): File "", line 1, in TypeError: 'bz2_codec' decoder returned 'bytes' instead of 'str'; use codecs.decode to decode to arbitrary types
b"hello".decode("rot_13") AttributeError: 'memoryview' object has no attribute 'translate'
The above exception was the direct cause of the following exception:
Traceback (most recent call last): File "", line 1, in AttributeError: decoding with 'rot_13' codec failed (AttributeError: 'memoryview' object has no attribute 'translate')
"hello".encode("bz2_codec") TypeError: 'str' does not support the buffer interface
The above exception was the direct cause of the following exception:
Traceback (most recent call last): File "", line 1, in TypeError: encoding with 'bz2_codec' codec failed (TypeError: 'str' does not support the buffer interface)
"hello".encode("rot_13") Traceback (most recent call last): File "", line 1, in TypeError: 'rot_13' encoder returned 'str' instead of 'bytes'; use codecs.encode to encode to arbitrary types
Author: Alyssa Coghlan (ncoghlan) *
Date: 2013-11-05 14:16
Checking the other binary<->binary and str<->str codecs with input type and value restrictions:
they all throw TypeError and get wrapped appropriately when asked to encode str input (rot_13 throws the output type error)
rot_13 throws an appropriately wrapped AttributeError when asked to decode bytes or bytearray object
For bad value input, "uu_codec" is the only one that throws a normal ValueError, I couldn't figure out a way to get "quopri_codec" to complain about the input value and the others throw a module specific error:
binascii (base64_codec, hex_codec) throws binascii.Error (a custom ValueError subclass)
zlib (zlib_codec) throws zlib.error (inherits directly from Exception)
As with the OSError that escapes from bz2_codec, I think the simplest and most conservative option is to not worry about those at this point.
Author: Alyssa Coghlan (ncoghlan) *
Date: 2013-11-05 15:08
Updated patch adds systematic tests for the new error handling to test_codecs.TransformTests
I also moved the codecs changes up to a "Codec handling improvements" section.
My rationale for doing that is that this is actually a pretty significant usability enhancement and Python 3 codec model clarification for heavy users of binary codecs coming from Python 2, and because I also plan to follow up on this issue by bringing back the shorthand aliases for these codecs that were removed in issue 10807 (thus closing issue 7475).
If issue 15216 gets finished (changing stream encodings after creation) that would also be a substantial enhancement worth mentioning here.
Author: Alyssa Coghlan (ncoghlan) *
Date: 2013-11-10 13:02
Updated patch (v5) with a more robust chaining mechanism provided as a private "_PyErr_TrySetFromCause" API. This version eliminates the previous whitelist in favour of checking directly for the ability to replace the exception with another instance of the same type without losing information.
This version also has more direct tests of the exception wrapping behaviour as a dedicated test class.
If I don't hear any objections in the next couple of days, I plan to commit this version.
Author: Marc-Andre Lemburg (lemburg) *
Date: 2013-11-10 13:21
On 10.11.2013 14:03, Nick Coghlan wrote:
Updated patch (v5) with a more robust chaining mechanism provided as a private "_PyErr_TrySetFromCause" API. This version eliminates the previous whitelist in favour of checking directly for the ability to replace the exception with another instance of the same type without losing information.
This version also has more direct tests of the exception wrapping behaviour as a dedicated test class.
If I don't hear any objections in the next couple of days, I plan to commit this version.
This doesn't look right:
diff -r 1ee45eb6aab9 Include/pyerrors.h --- a/Include/pyerrors.h Sat Nov 09 23:15:52 2013 +0200 +++ b/Include/pyerrors.h Sun Nov 10 22:54:04 2013 +1000 ... +PyAPI_FUNC(PyObject *) _PyErr_TrySetFromCause(
- const char prefix_format, / ASCII-encoded string */
- ...
- );
BTW: Why don't we make that API a public one ? It could be useful in C extensions as well.
In the error messages, I'd use "codecs.encode()" and "codecs.decode()" (ie. with parens) instead of "codecs.encode" and "codecs.decode".
Author: Alyssa Coghlan (ncoghlan) *
Date: 2013-11-10 14:34
On 10 November 2013 23:21, Marc-Andre Lemburg <report@bugs.python.org> wrote:
Marc-Andre Lemburg added the comment:
On 10.11.2013 14:03, Nick Coghlan wrote:
Updated patch (v5) with a more robust chaining mechanism provided as a private "_PyErr_TrySetFromCause" API. This version eliminates the previous whitelist in favour of checking directly for the ability to replace the exception with another instance of the same type without losing information.
This version also has more direct tests of the exception wrapping behaviour as a dedicated test class.
If I don't hear any objections in the next couple of days, I plan to commit this version.
This doesn't look right:
diff -r 1ee45eb6aab9 Include/pyerrors.h --- a/Include/pyerrors.h Sat Nov 09 23:15:52 2013 +0200 +++ b/Include/pyerrors.h Sun Nov 10 22:54:04 2013 +1000 ... +PyAPI_FUNC(PyObject *) _PyErr_TrySetFromCause(
- const char prefix_format, / ASCII-encoded string */
- ...
- );
The signature? That API doesn't currently let you change the exception type, only the message (since the codecs machinery doesn't need to change the exception type, and changing the exception type is fraught with peril from a backwards compatibility point of view).
BTW: Why don't we make that API a public one ? It could be useful in C extensions as well.
Because I'm not sure it's a good idea in general and hence am wary of promoting it too much at this point in time (especially given the severe limitations of what it can currently wrap). I'm convinced it's worth it in this particular case (since being told the codec involved directly makes the meaning of codec errors much clearer and even with the limitations it can still wrap most errors from standard library codecs), and the implementation has to be in exceptions.c since it pokes around comparing the exception details to the internals of BaseException to figure out if it can safely wrap the exception or not.
Issue 18861 also makes me wonder if there's an underlying structural problem in the way exception chaining currently works that could be better solved by making it possible to annotate traceback frames while unwinding the stack, which also makes me disinclined to add to the public C API in this area before 3.5.
Author: Alyssa Coghlan (ncoghlan) *
Date: 2013-11-10 14:39
On 10 November 2013 23:21, Marc-Andre Lemburg <report@bugs.python.org> wrote:
This doesn't look right:
diff -r 1ee45eb6aab9 Include/pyerrors.h --- a/Include/pyerrors.h Sat Nov 09 23:15:52 2013 +0200 +++ b/Include/pyerrors.h Sun Nov 10 22:54:04 2013 +1000 ... +PyAPI_FUNC(PyObject *) _PyErr_TrySetFromCause(
- const char prefix_format, / ASCII-encoded string */
- ...
- );
After sending my previous reply, I realised you may have been referring to the comment. I copied that from the PyErr_Format signature. According to http://docs.python.org/dev/c-api/unicode.html#PyUnicode_FromFormat, the format string still has to be ASCII-encoded, and if that's no longer true, it's a separate bug from this one that will require a docs fix as well.
In the error messages, I'd use "codecs.encode()" and "codecs.decode()" (ie. with parens) instead of "codecs.encode" and "codecs.decode".
Forgot to reply to this part - I like it, will switch it over before committing.
Author: Marc-Andre Lemburg (lemburg) *
Date: 2013-11-10 15:39
On 10.11.2013 15:39, Nick Coghlan wrote:
On 10 November 2013 23:21, Marc-Andre Lemburg <report@bugs.python.org> wrote:
This doesn't look right:
diff -r 1ee45eb6aab9 Include/pyerrors.h --- a/Include/pyerrors.h Sat Nov 09 23:15:52 2013 +0200 +++ b/Include/pyerrors.h Sun Nov 10 22:54:04 2013 +1000 ... +PyAPI_FUNC(PyObject *) _PyErr_TrySetFromCause(
- const char prefix_format, / ASCII-encoded string */
- ...
- );
Sorry about the false warning. After looking at those lines again, I realized that the "..." is the argument ellipsis, not some omitted code. At first this look like a function definition to me :-)
After sending my previous reply, I realised you may have been referring to the comment. I copied that from the PyErr_Format signature. According to http://docs.python.org/dev/c-api/unicode.html#PyUnicode_FromFormat, the format string still has to be ASCII-encoded, and if that's no longer true, it's a separate bug from this one that will require a docs fix as well.
Also note that it's not clear whether the "ASCII" refers to the format string or the resulting formatted string. For the format string, ASCII would probably be fine, but for the formatted string, UTF-8 should be allowed, since it's not uncommon to add e.g. parameter strings that caused the error to the error string.
That's a separate ticket, though.
In the error messages, I'd use "codecs.encode()" and "codecs.decode()" (ie. with parens) instead of "codecs.encode" and "codecs.decode".
Forgot to reply to this part - I like it, will switch it over before committing.
Thanks.
Author: Alyssa Coghlan (ncoghlan) *
Date: 2013-11-13 13:32
Patch for the final version that I'm about to commit.
I realised the exception chaining would only trigger for the encode() and decode() methods, when it was equally applicable to the codecs.encode() and codecs.decode() functions, so I updated the test cases and moved it accordingly.
reworded the What's New text to better clarify the historical confusion around the nature of the codecs module that these changes are intended to rectify (since the intent is clear from the existence of codecs.encode and codecs.decode and their coverage in the test suite since Python 2.4).
Author: Roundup Robot (python-dev)
Date: 2013-11-13 13:51
New changeset 854a2cea31b9 by Nick Coghlan in branch 'default': Close #17828: better handling of codec errors http://hg.python.org/cpython/rev/854a2cea31b9
Author: Roundup Robot (python-dev)
Date: 2013-11-14 00:39
New changeset 99ba1772c469 by Christian Heimes in branch 'default': Issue #17828: va_start() must be accompanied by va_end() http://hg.python.org/cpython/rev/99ba1772c469
Author: Roundup Robot (python-dev)
Date: 2013-11-14 00:48
New changeset 26121ae22016 by Christian Heimes in branch 'default': Issue #17828: _PyObject_GetDictPtr() may return NULL instead of a PyObject** http://hg.python.org/cpython/rev/26121ae22016
Author: Christian Heimes (christian.heimes) *
Date: 2013-11-14 00:49
Coverity has found two issues in your patch. I have fixed them both.
History
Date
User
Action
Args
2022-04-11 14:57:44
admin
set
github: 62028
2013-11-14 00:49:38
christian.heimes
set
nosy: + christian.heimes
messages: +
2013-11-14 00:48:41
python-dev
set
messages: +
2013-11-14 00:39:51
python-dev
set
messages: +
2013-11-13 13:51:51
python-dev
set
status: open -> closed
nosy: + python-dev
messages: +
resolution: fixed
stage: commit review -> resolved
2013-11-13 13:32:36
ncoghlan
set
files: + issue17828_improved_codec_errors_v7.diff
messages: +
stage: needs patch -> commit review
2013-11-10 15:39:51
lemburg
set
messages: +
2013-11-10 14:59:58
ncoghlan
set
files: + issue17828_improved_codec_errors_v6.diff
2013-11-10 14:39:38
ncoghlan
set
messages: +
2013-11-10 14:34:32
ncoghlan
set
messages: +
2013-11-10 13:21:30
lemburg
set
messages: +
2013-11-10 13:03:02
ncoghlan
set
files: + issue17828_improved_codec_errors_v5.diff
messages: +
2013-11-05 15:08:34
ncoghlan
set
files: + issue17828_improved_codec_errors_v4.diff
messages: +
2013-11-05 14:16:55
ncoghlan
set
messages: +
2013-11-05 13:48:40
ncoghlan
set
files: + issue17828_improved_codec_errors_v3.diff
messages: +
2013-11-04 23:39:59
ncoghlan
set
assignee: ncoghlan
messages: +
2013-11-04 14:46:13
lemburg
set
nosy: + lemburg
messages: +
2013-11-04 13:30:01
vstinner
set
nosy: + vstinner
messages: +
2013-11-04 13:27:10
ncoghlan
set
messages: +
2013-11-04 13:20:19
ncoghlan
set
files: + issue17828_improved_codec_errors_v2.diff
messages: +
2013-11-04 13:00:08
ncoghlan
set
files: + issue17828_improved_codec_errors.diff
messages: +
2013-05-10 04:36:13
ezio.melotti
set
files: + issue17828-2.diff
messages: +
2013-05-10 03:52:40
ezio.melotti
set
messages: +
2013-05-10 03:22:37
ezio.melotti
set
files: + issue17828-1.diff
keywords: + patch
messages: +
2013-05-10 02:42:35
ncoghlan
set
messages: +
2013-04-25 13:54:44
barry
set
nosy: + barry
2013-04-25 07:47:11
ncoghlan
set
messages: +
2013-04-25 07:34:34
ncoghlan
set
messages: +
2013-04-24 14:24:49
flox
set
nosy: + flox
2013-04-24 14:22:38
ncoghlan
link
2013-04-24 14:15:32
ncoghlan
set
messages: +
2013-04-24 14:11:54
ezio.melotti
set
nosy: + ezio.melotti
type: enhancement
stage: needs patch
2013-04-24 14:09:58
ncoghlan
create