Issue 17828: More informative error handling when encoding and decoding (original) (raw)

Created on 2013-04-24 14:09 by ncoghlan, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Messages (27)

msg187704 - (view)

Author: Alyssa Coghlan (ncoghlan) * (Python committer)

Date: 2013-04-24 14:09

Passing the wrong types to codecs can currently lead to rather confusing exceptions, like:

==================== >>> b"ZXhhbXBsZQ==\n".decode("base64_codec") Traceback (most recent call last): File "", line 1, in File "/usr/lib64/python3.2/encodings/base64_codec.py", line 20, in base64_decode return (base64.decodebytes(input), len(input)) File "/usr/lib64/python3.2/base64.py", line 359, in decodebytes raise TypeError("expected bytes, not %s" % s.class.name) TypeError: expected bytes, not memoryview

codecs.decode("example", "utf8") Traceback (most recent call last): File "", line 1, in File "/usr/lib64/python3.2/encodings/utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) TypeError: 'str' does not support the buffer interface

This situation could be improved by having the affected APIs use the exception chaining system to wrap these errors in a more informative exception that also display information on the codec involved. Note that UnicodeEncodeError and UnicodeDecodeError are not appropriate, as those are specific to text encoding operations, while these new wrappers will apply to arbitrary codecs, regardless of whether or not they use the unicode error handlers. Furthermore, for backwards compatibility with existing exception handling, it is probably necessary to limit ourselves to specific exception types and ensure that the wrapper exceptions are subclasses of those types.

These new wrappers would have cause set to the exception raised by the codec, but emit a message more along the lines of the following:

============== codecs.DecodeTypeError: encoding='utf8', details="TypeError: 'str' does not support the buffer interface"

Wrapping TypeError and ValueError should cover most cases, which would mean four new exception types in the codecs module:

Raised by codecs.decode, bytes.decode and bytearray.decode:

Raised by codecs.encode, str.encode:

Instances of UnicodeError wouldn't be wrapped, since they already contain codec information.

msg187706 - (view)

Author: Alyssa Coghlan (ncoghlan) * (Python committer)

Date: 2013-04-24 14:15

There may also be some specific improvement to be made to str.encode, bytes.decode and bytearray.decode in relation to the additional type checks they do to enforce the appropriate input and output types (see the bizarre "expected bytes, not memoryview" example)

msg187761 - (view)

Author: Alyssa Coghlan (ncoghlan) * (Python committer)

Date: 2013-04-25 07:34

I tracked down the proximate cause of the weird exception in the bytes.decode case: the base64 module only accepts bytes and bytearray objects, instead of using memoryview to accept anything that supports the buffer API and provides a C-contiguous 8-bit view of the underlying data. Raised as issue 17839.

msg187763 - (view)

Author: Alyssa Coghlan (ncoghlan) * (Python committer)

Date: 2013-04-25 07:47

Here's an example of the specific type errors raised by additional checks in the text-encoding specific methods. I believe the main improvement needed here is to mention the encoding name in the exception message:

"example".encode("rot_13") Traceback (most recent call last): File "", line 1, in TypeError: encoder did not return a bytes object (type=str)

b'BZh91AY&SY\xc1uvK\x00\x00\x01F\x80\x00\x10\x00"\x04\x00\x00\x10 \x000\xcd\x00\xc1\xa0P\xe2\xeeH\xa7\n\x12\x18.\xae\xc9`'.decode("bz2_codec") Traceback (most recent call last): File "", line 1, in TypeError: decoder did not return a str object (type=bytes)

msg188804 - (view)

Author: Alyssa Coghlan (ncoghlan) * (Python committer)

Date: 2013-05-10 02:42

Ezio pointed out on IRC that the extra type checks in str.encode, bytes.decode and bytearray.decode should reference the appopriate codecs module function in addition to the codec in use.

So if str.encode produces something other than bytes, it should reference codecs.encode, while the binary decoding methods should mention codecs.decode if they produce something other than str.

msg188807 - (view)

Author: Ezio Melotti (ezio.melotti) * (Python committer)

Date: 2013-05-10 03:22

The attached patch changes the error message of str.encode/bytes.decode when the codec returns the wrong type:

import codecs 'example'.encode('rot_13') TypeError: encoder returned 'str' instead of 'bytes', use codecs.decode for str->str conversions codecs.encode('example', 'rot_13') 'rknzcyr'

b'000102'.decode('hex_codec') TypeError: decoder returned 'bytes' instead of 'str', use codecs.encode for bytes->bytes conversions codecs.decode(b'000102', 'hex_codec') b'\x00\x01\x02'

This only solves part of the problem though, because individual codecs might raise other errors if the input type is wrong:

'example'.encode('hex_codec') Traceback (most recent call last): File "/home/wolf/dev/py/py3k/Lib/encodings/hex_codec.py", line 16, in hex_encode return (binascii.b2a_hex(input), len(input)) TypeError: 'str' does not support the buffer interface

msg188808 - (view)

Author: Ezio Melotti (ezio.melotti) * (Python committer)

Date: 2013-05-10 03:52

To summarize:

The things that could go wrong are:

  1. the input type is wrong (i.e. the codec doesn't accept the type of the input);
  2. the input value is invalid;
  3. for str.encode/bytes.decode only, the output type is wrong (i.e. the codec returned a non-bytes/non-str object);

My patch only covers 3. The four new exceptions suggested by Nick in would cover the first 2 cases. For str.encode/bytes.decode, if we knew the input type accepted by the codec we could also provide a better error message (e.g. "codecs accepts '...', not '...'; use ... instead"), but we don't.

msg188809 - (view)

Author: Ezio Melotti (ezio.melotti) * (Python committer)

Date: 2013-05-10 04:36

The attached proof of concept catches Type/ValueError in str.encode and raises another exception with a better message:

'example'.encode('hex_codec') Traceback (most recent call last): File "", line 1, in TypeError: invalid input type for hex_codec codec ('str' does not support the buffer interface)

(note: the patch doesn't handle the exception chaining yet and probably leaks.)

If Nick proposal in is accepted, this should become a codecs.EncodeTypeError. The same should then be done for bytes.decode and for codecs.encode/decode.

msg202125 - (view)

Author: Alyssa Coghlan (ncoghlan) * (Python committer)

Date: 2013-11-04 13:00

Updated patch. The results of this suggests to me that the input wrappers are likely infeasible at this point in time, but improving the errors for the wrong output type is entirely feasible. Since the main conversion we need to prompt is things like "binary_object.decode(binary_codec)" -> "codecs.decode(binary_object, binary_codec)", I suggest we limit the scope of this issue to that part of the problem.

import codecs codecs.encode(b"hello", "bz2_codec").decode("bz2_codec") Traceback (most recent call last): File "", line 1, in TypeError: 'bz2_codec' decoder returned 'bytes' instead of 'str'; use codecs.decode to decode to arbitrary types "hello".encode("bz2_codec") TypeError: 'str' does not support the buffer interface

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "", line 1, in TypeError: invalid input type for 'bz2_codec' codec (TypeError: 'str' does not support the buffer interface)

"hello".encode("rot_13") TypeError: 'rot_13' encoder returned 'str' instead of 'bytes'; use codecs.encode to encode to arbitrary types

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "", line 1, in TypeError: invalid input type for 'rot_13' codec (TypeError: 'rot_13' encoder returned 'str' instead of 'bytes'; use codecs.encode to encode to arbitrary types)

msg202129 - (view)

Author: Alyssa Coghlan (ncoghlan) * (Python committer)

Date: 2013-11-04 13:20

Ah, came up with a relatively simple solution based on an internal helper function with an optional output flag:

import codecs codecs.encode(b"hello", "bz2_codec").decode("bz2_codec") Traceback (most recent call last): File "", line 1, in TypeError: 'bz2_codec' decoder returned 'bytes' instead of 'str'; use codecs.decode to decode to arbitrary types

"hello".encode("bz2_codec") TypeError: 'str' does not support the buffer interface

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "", line 1, in TypeError: invalid input type for 'bz2_codec' codec (TypeError: 'str' does not support the buffer interface)

"hello".encode("rot_13") Traceback (most recent call last): File "", line 1, in TypeError: 'rot_13' encoder returned 'str' instead of 'bytes'; use codecs.encode to encode to arbitrary types

msg202131 - (view)

Author: Alyssa Coghlan (ncoghlan) * (Python committer)

Date: 2013-11-04 13:27

The other thing is that this patch doesn't wrap AttributeError. I'm OK with that, since I believe the only codec in the standard library that currently throws that for a bad input type is rot_13.

msg202133 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2013-11-04 13:30

It would be simpler to just drop these custom codecs (rot13, bz2, hex, etc.) instead of helping to use them :-)

msg202143 - (view)

Author: Marc-Andre Lemburg (lemburg) * (Python committer)

Date: 2013-11-04 14:46

On 04.11.2013 14:30, STINNER Victor wrote:

It would be simpler to just drop these custom codecs (rot13, bz2, hex, etc.) instead of helping to use them :-)

-1 for the same reasons I keep repeating over and over and over again :-)

The codec system was designed to work obj->obj. Python 3 limits the types for the bytes/str helper methods, but that limitation does not extend to the codec design.

+1 on having better error messages. In the long run, we should add supported input/output type information to codecs, so that error reporting and codec introspection becomes easier.

-- Marc-Andre Lemburg eGenix.com

Professional Python Services directly from the Source (#1, Nov 04 2013)

Python Projects, Consulting and Support ... http://www.egenix.com/ mxODBC.Zope/Plone.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/


2013-11-19: Python Meeting Duesseldorf ... 15 days to go

::::: Try our mxODBC.Connect Python Database Interface for free ! ::::::

eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

msg202178 - (view)

Author: Alyssa Coghlan (ncoghlan) * (Python committer)

Date: 2013-11-04 23:39

I think I figured out a better way to structure this that avoids the need for the output flag and is more easily expanded to whitelist additional exception types as safe to wrap.

I'll try to come up with a new patch tonight.

msg202211 - (view)

Author: Alyssa Coghlan (ncoghlan) * (Python committer)

Date: 2013-11-05 13:48

New and improved implementation attached that extracts the exception chaining to a helper functions and calls it only when it is the call in to the codecs machinery that failed (eliminating the need for the output flag, and covering decoding as well as encoding).

TypeError, AttributeError and ValueError are all wrapped with chained exceptions that mention the codec that failed.

(Annoyingly, bz2_codec throws OSError instead of ValueError for bad input data, but wrapping OSError safely is a pain due to the extra state potentially carried on instances. So letting it escape unwrapped is the simpler and more conservative option at this point)

import codecs codecs.encode(b"hello", "bz2_codec").decode("bz2_codec") Traceback (most recent call last): File "", line 1, in TypeError: 'bz2_codec' decoder returned 'bytes' instead of 'str'; use codecs.decode to decode to arbitrary types

b"hello".decode("rot_13") AttributeError: 'memoryview' object has no attribute 'translate'

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "", line 1, in AttributeError: decoding with 'rot_13' codec failed (AttributeError: 'memoryview' object has no attribute 'translate')

"hello".encode("bz2_codec") TypeError: 'str' does not support the buffer interface

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "", line 1, in TypeError: encoding with 'bz2_codec' codec failed (TypeError: 'str' does not support the buffer interface)

"hello".encode("rot_13") Traceback (most recent call last): File "", line 1, in TypeError: 'rot_13' encoder returned 'str' instead of 'bytes'; use codecs.encode to encode to arbitrary types

msg202212 - (view)

Author: Alyssa Coghlan (ncoghlan) * (Python committer)

Date: 2013-11-05 14:16

Checking the other binary<->binary and str<->str codecs with input type and value restrictions:

For bad value input, "uu_codec" is the only one that throws a normal ValueError, I couldn't figure out a way to get "quopri_codec" to complain about the input value and the others throw a module specific error:

binascii (base64_codec, hex_codec) throws binascii.Error (a custom ValueError subclass)
zlib (zlib_codec) throws zlib.error (inherits directly from Exception)

As with the OSError that escapes from bz2_codec, I think the simplest and most conservative option is to not worry about those at this point.

msg202215 - (view)

Author: Alyssa Coghlan (ncoghlan) * (Python committer)

Date: 2013-11-05 15:08

Updated patch adds systematic tests for the new error handling to test_codecs.TransformTests

I also moved the codecs changes up to a "Codec handling improvements" section.

My rationale for doing that is that this is actually a pretty significant usability enhancement and Python 3 codec model clarification for heavy users of binary codecs coming from Python 2, and because I also plan to follow up on this issue by bringing back the shorthand aliases for these codecs that were removed in issue 10807 (thus closing issue 7475).

If issue 15216 gets finished (changing stream encodings after creation) that would also be a substantial enhancement worth mentioning here.

msg202522 - (view)

Author: Alyssa Coghlan (ncoghlan) * (Python committer)

Date: 2013-11-10 13:02

Updated patch (v5) with a more robust chaining mechanism provided as a private "_PyErr_TrySetFromCause" API. This version eliminates the previous whitelist in favour of checking directly for the ability to replace the exception with another instance of the same type without losing information.

This version also has more direct tests of the exception wrapping behaviour as a dedicated test class.

If I don't hear any objections in the next couple of days, I plan to commit this version.

msg202524 - (view)

Author: Marc-Andre Lemburg (lemburg) * (Python committer)

Date: 2013-11-10 13:21

On 10.11.2013 14:03, Nick Coghlan wrote:

Updated patch (v5) with a more robust chaining mechanism provided as a private "_PyErr_TrySetFromCause" API. This version eliminates the previous whitelist in favour of checking directly for the ability to replace the exception with another instance of the same type without losing information.

This version also has more direct tests of the exception wrapping behaviour as a dedicated test class.

If I don't hear any objections in the next couple of days, I plan to commit this version.

This doesn't look right:

diff -r 1ee45eb6aab9 Include/pyerrors.h --- a/Include/pyerrors.h Sat Nov 09 23:15:52 2013 +0200 +++ b/Include/pyerrors.h Sun Nov 10 22:54:04 2013 +1000 ... +PyAPI_FUNC(PyObject *) _PyErr_TrySetFromCause(

BTW: Why don't we make that API a public one ? It could be useful in C extensions as well.

In the error messages, I'd use "codecs.encode()" and "codecs.decode()" (ie. with parens) instead of "codecs.encode" and "codecs.decode".

msg202528 - (view)

Author: Alyssa Coghlan (ncoghlan) * (Python committer)

Date: 2013-11-10 14:34

On 10 November 2013 23:21, Marc-Andre Lemburg <report@bugs.python.org> wrote:

Marc-Andre Lemburg added the comment:

On 10.11.2013 14:03, Nick Coghlan wrote:

Updated patch (v5) with a more robust chaining mechanism provided as a private "_PyErr_TrySetFromCause" API. This version eliminates the previous whitelist in favour of checking directly for the ability to replace the exception with another instance of the same type without losing information.

This version also has more direct tests of the exception wrapping behaviour as a dedicated test class.

If I don't hear any objections in the next couple of days, I plan to commit this version.

This doesn't look right:

diff -r 1ee45eb6aab9 Include/pyerrors.h --- a/Include/pyerrors.h Sat Nov 09 23:15:52 2013 +0200 +++ b/Include/pyerrors.h Sun Nov 10 22:54:04 2013 +1000 ... +PyAPI_FUNC(PyObject *) _PyErr_TrySetFromCause(

The signature? That API doesn't currently let you change the exception type, only the message (since the codecs machinery doesn't need to change the exception type, and changing the exception type is fraught with peril from a backwards compatibility point of view).

BTW: Why don't we make that API a public one ? It could be useful in C extensions as well.

Because I'm not sure it's a good idea in general and hence am wary of promoting it too much at this point in time (especially given the severe limitations of what it can currently wrap). I'm convinced it's worth it in this particular case (since being told the codec involved directly makes the meaning of codec errors much clearer and even with the limitations it can still wrap most errors from standard library codecs), and the implementation has to be in exceptions.c since it pokes around comparing the exception details to the internals of BaseException to figure out if it can safely wrap the exception or not.

Issue 18861 also makes me wonder if there's an underlying structural problem in the way exception chaining currently works that could be better solved by making it possible to annotate traceback frames while unwinding the stack, which also makes me disinclined to add to the public C API in this area before 3.5.

msg202529 - (view)

Author: Alyssa Coghlan (ncoghlan) * (Python committer)

Date: 2013-11-10 14:39

On 10 November 2013 23:21, Marc-Andre Lemburg <report@bugs.python.org> wrote:

This doesn't look right:

diff -r 1ee45eb6aab9 Include/pyerrors.h --- a/Include/pyerrors.h Sat Nov 09 23:15:52 2013 +0200 +++ b/Include/pyerrors.h Sun Nov 10 22:54:04 2013 +1000 ... +PyAPI_FUNC(PyObject *) _PyErr_TrySetFromCause(

After sending my previous reply, I realised you may have been referring to the comment. I copied that from the PyErr_Format signature. According to http://docs.python.org/dev/c-api/unicode.html#PyUnicode_FromFormat, the format string still has to be ASCII-encoded, and if that's no longer true, it's a separate bug from this one that will require a docs fix as well.

In the error messages, I'd use "codecs.encode()" and "codecs.decode()" (ie. with parens) instead of "codecs.encode" and "codecs.decode".

Forgot to reply to this part - I like it, will switch it over before committing.

msg202532 - (view)

Author: Marc-Andre Lemburg (lemburg) * (Python committer)

Date: 2013-11-10 15:39

On 10.11.2013 15:39, Nick Coghlan wrote:

On 10 November 2013 23:21, Marc-Andre Lemburg <report@bugs.python.org> wrote:

This doesn't look right:

diff -r 1ee45eb6aab9 Include/pyerrors.h --- a/Include/pyerrors.h Sat Nov 09 23:15:52 2013 +0200 +++ b/Include/pyerrors.h Sun Nov 10 22:54:04 2013 +1000 ... +PyAPI_FUNC(PyObject *) _PyErr_TrySetFromCause(

  • const char prefix_format, / ASCII-encoded string */
  • ...
  • );

Sorry about the false warning. After looking at those lines again, I realized that the "..." is the argument ellipsis, not some omitted code. At first this look like a function definition to me :-)

After sending my previous reply, I realised you may have been referring to the comment. I copied that from the PyErr_Format signature. According to http://docs.python.org/dev/c-api/unicode.html#PyUnicode_FromFormat, the format string still has to be ASCII-encoded, and if that's no longer true, it's a separate bug from this one that will require a docs fix as well.

Also note that it's not clear whether the "ASCII" refers to the format string or the resulting formatted string. For the format string, ASCII would probably be fine, but for the formatted string, UTF-8 should be allowed, since it's not uncommon to add e.g. parameter strings that caused the error to the error string.

That's a separate ticket, though.

In the error messages, I'd use "codecs.encode()" and "codecs.decode()" (ie. with parens) instead of "codecs.encode" and "codecs.decode".

Forgot to reply to this part - I like it, will switch it over before committing.

Thanks.

msg202744 - (view)

Author: Alyssa Coghlan (ncoghlan) * (Python committer)

Date: 2013-11-13 13:32

Patch for the final version that I'm about to commit.

msg202748 - (view)

Author: Roundup Robot (python-dev) (Python triager)

Date: 2013-11-13 13:51

New changeset 854a2cea31b9 by Nick Coghlan in branch 'default': Close #17828: better handling of codec errors http://hg.python.org/cpython/rev/854a2cea31b9

msg202807 - (view)

Author: Roundup Robot (python-dev) (Python triager)

Date: 2013-11-14 00:39

New changeset 99ba1772c469 by Christian Heimes in branch 'default': Issue #17828: va_start() must be accompanied by va_end() http://hg.python.org/cpython/rev/99ba1772c469

msg202811 - (view)

Author: Roundup Robot (python-dev) (Python triager)

Date: 2013-11-14 00:48

New changeset 26121ae22016 by Christian Heimes in branch 'default': Issue #17828: _PyObject_GetDictPtr() may return NULL instead of a PyObject** http://hg.python.org/cpython/rev/26121ae22016

msg202812 - (view)

Author: Christian Heimes (christian.heimes) * (Python committer)

Date: 2013-11-14 00:49

Coverity has found two issues in your patch. I have fixed them both.

History

Date

User

Action

Args

2022-04-11 14:57:44

admin

set

github: 62028

2013-11-14 00:49:38

christian.heimes

set

nosy: + christian.heimes
messages: +

2013-11-14 00:48:41

python-dev

set

messages: +

2013-11-14 00:39:51

python-dev

set

messages: +

2013-11-13 13:51:51

python-dev

set

status: open -> closed

nosy: + python-dev
messages: +

resolution: fixed
stage: commit review -> resolved

2013-11-13 13:32:36

ncoghlan

set

files: + issue17828_improved_codec_errors_v7.diff

messages: +
stage: needs patch -> commit review

2013-11-10 15:39:51

lemburg

set

messages: +

2013-11-10 14:59:58

ncoghlan

set

files: + issue17828_improved_codec_errors_v6.diff

2013-11-10 14:39:38

ncoghlan

set

messages: +

2013-11-10 14:34:32

ncoghlan

set

messages: +

2013-11-10 13:21:30

lemburg

set

messages: +

2013-11-10 13:03:02

ncoghlan

set

files: + issue17828_improved_codec_errors_v5.diff

messages: +

2013-11-05 15:08:34

ncoghlan

set

files: + issue17828_improved_codec_errors_v4.diff

messages: +

2013-11-05 14:16:55

ncoghlan

set

messages: +

2013-11-05 13:48:40

ncoghlan

set

files: + issue17828_improved_codec_errors_v3.diff

messages: +

2013-11-04 23:39:59

ncoghlan

set

assignee: ncoghlan
messages: +

2013-11-04 14:46:13

lemburg

set

nosy: + lemburg
messages: +

2013-11-04 13:30:01

vstinner

set

nosy: + vstinner
messages: +

2013-11-04 13:27:10

ncoghlan

set

messages: +

2013-11-04 13:20:19

ncoghlan

set

files: + issue17828_improved_codec_errors_v2.diff

messages: +

2013-11-04 13:00:08

ncoghlan

set

files: + issue17828_improved_codec_errors.diff

messages: +

2013-05-10 04:36:13

ezio.melotti

set

files: + issue17828-2.diff

messages: +

2013-05-10 03:52:40

ezio.melotti

set

messages: +

2013-05-10 03:22:37

ezio.melotti

set

files: + issue17828-1.diff
keywords: + patch
messages: +

2013-05-10 02:42:35

ncoghlan

set

messages: +

2013-04-25 13:54:44

barry

set

nosy: + barry

2013-04-25 07:47:11

ncoghlan

set

messages: +

2013-04-25 07:34:34

ncoghlan

set

messages: +

2013-04-24 14:24:49

flox

set

nosy: + flox

2013-04-24 14:22:38

ncoghlan

link

issue7475 dependencies

2013-04-24 14:15:32

ncoghlan

set

messages: +

2013-04-24 14:11:54

ezio.melotti

set

nosy: + ezio.melotti

type: enhancement
stage: needs patch

2013-04-24 14:09:58

ncoghlan

create