Issue 998993: Decoding incomplete unicode (original) (raw)

Issue998993

Created on 2004-07-27 20:35 by doerwalter, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
diff.txt doerwalter,2004-07-27 20:35
diff2.txt doerwalter,2004-08-10 19:22
diff3.txt doerwalter,2004-08-24 20:02
diff4.txt doerwalter,2004-08-27 19:00
Messages (9)
msg46471 - (view) Author: Walter Dörwald (doerwalter) * (Python committer) Date: 2004-07-27 20:35
Pythons unicode machinery currently has problems when decoding incomplete input. When codecs.StreamReader.read() encounters a decoding error it reads more bytes from the input stream and retries decoding. This is broken for two reasons: 1) The error might be due to a malformed byte sequence in the input, a problem that can't be fixed by reading more bytes. 2) There may be no more bytes available at this time. Once more data is available decoding can't continue because bytes from the input stream have already been read and thrown away. (sio.DecodingInputFilter has the same problems) To fix this, three changes are required: a) We need stateful versions of the decoding functions that don't raise "truncated data" exceptions, but decode as much as possible and return the position where decoding stopped. b) The StreamReader classes need to use those stateful versions of the decoding functions. c) codecs.StreamReader needs to keep an internal buffer with the bytes read from the input stream that haven't been decoded into unicode yet. For a) the Python API already exists: All decoding functions in the codecs module return a tuple containing the decoded unicode object and the number of bytes consumed. But this functionality isn't implemented in the decoders: codec.utf_8_decode(u"aä".encode("utf-8")[:-1]) raises an exception instead of returning (u"a", 1). This can be fixed by extending the UTF-8 and UTF-16 decoding functions like this: PyObject *PyUnicode_DecodeUTF8Stateful( const char *s, int size, const char *errors, int *consumed) If consumed == NULL PyUnicode_DecodeUTF8Stateful() behaves like PyUnicode_DecodeUTF8() (i.e. it raises a "truncated data" exception). If consumed != NULL it decodes as much as possible (raising exceptions for invalid byte sequences) and puts the number of bytes consumed into *consumed. Additionally for UTF-7 we need to pass the decoder state around. An implementation of c) looks like this: def read(self, size=-1): if size < 0: data = self.bytebuffer+self.stream.read() else: data = self.bytebuffer+self.stream.read(size) (object, decodedbytes) = self.decode(data, self.errors) self.bytebuffer = data[decodedbytes:] return object Unfortunately this changes the semantics. read() might return an empty string even if there would be more data available. But this can be fixed if we continue reading until at least one character is available. The patch implements a few additional features: read() has an additional argument chars that can be used to specify the number of characters that should be returned. readline() is supported on all readers derived from codecs.StreamReader(). readline() and readlines() have an additional option for dropping the u"\n". The patch is still missing changes to the escape codecs ("unicode_escape" and "raw_unicode_escape"), but it has test cases that check the new functionality for all affected codecs (UTF-7, UTF-8, UTF-16, UTF-16-LE, UTF-16-BE).
msg46472 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2004-07-27 20:56
Logged In: YES user_id=38388 Walter, I think you should split this into multiple feature requests. First of all, I agree that the current situation with StreamReader on malformed input is not really ideal. However, I don't think we need to fix anything in terms of adding new interfaces. Also, introducing state at the encode/decode breaks the design of the codecs functions -- only StreamReader/Writer may maintain state. Now, the situation is not that bad though: the case of a codec continuing as far as possible and then returning the decoded data as well as the number of bytes consumed is basically just another error handling scheme. Let's call it "break". If errors is set to "break", the codec will stop decoding/encoding and return the coded data as well as the number of input characters consumed. You could then use this scheme in the StreamWriter/Reader to implement the "read as far as possible" scheme.
msg46473 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2004-07-27 21:11
Logged In: YES user_id=21627 Marc-Andre, can you please specifically point to the places in the patch where it violates the principles you have stated? E.g. where does it maintain state outside the StreamReader/Writer?
msg46474 - (view) Author: Walter Dörwald (doerwalter) * (Python committer) Date: 2004-08-10 19:22
Logged In: YES user_id=89016 Here is a second version of the patch: It implements a final argument for read/write/decode/encode, with specifies whether this is the last call to the method, it adds a chunk reader/writer API to StreamReader/Writer and it unifies the stateless/stateful decoding functions in the codecs module again.
msg46475 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2004-08-24 10:15
Logged In: YES user_id=38388 Walter, please update the first version of the patch as outlined in my python-dev posting: * move the UTF-7 change to a separate patch (this won't get checked in for Python 2.4) * remove the extra APIs from the _codecs patches (these are not needed; instead the existing APIs should be updated to use the ...Stateful() C APIs and pass along the possibly changed consumed setting) Thanks.
msg46476 - (view) Author: Walter Dörwald (doerwalter) * (Python committer) Date: 2004-08-24 20:02
Logged In: YES user_id=89016 Here is a third version of the patch with the requested changes.
msg46477 - (view) Author: Walter Dörwald (doerwalter) * (Python committer) Date: 2004-08-27 19:00
Logged In: YES user_id=89016 diff4.txt includes patches to the documentation
msg46478 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2004-08-31 14:19
Logged In: YES user_id=38388 diff4.txt looks OK (even though I don't like the final argument in the _codecs module decode APIs). Please remove the UTF-7 #defines and then check it in. Thanks, Walter.
msg46479 - (view) Author: Walter Dörwald (doerwalter) * (Python committer) Date: 2004-09-07 20:30
Logged In: YES user_id=89016 Checked in as: Doc/api/concrete.tex 1.56 Doc/lib/libcodecs.tex 1.33 Include/unicodeobject.h 2.46 Lib/codecs.py 1.34 Lib/encodings/utf_16.py 1.5 Lib/encodings/utf_16_be.py 1.4 Lib/encodings/utf_16_le.py 1.4 Lib/encodings/utf_8.py 1.3 Lib/test/test_codecs.py 1.13 Misc/NEWS 1.1129 Modules/_codecsmodule.c 2.20 Objects/unicodeobject.c 2.224 I've added documentation for the chars and keepends argument. I've removed the #defines for the UTF7 codec, although I think they should be added back in: The C functions *do* exist, it's just the UCS2/UCS4 name mangling that's missing. > diff4.txt looks OK (even though I don't like the final > argument in the _codecs module decode APIs). I think the other alternatives are worse: 1) Implement two version of the decoding function that use a common PyUnicode_Decode???() (like the first patch does). 2) Implement two versions of the decoding functions, each one using a separate version of PyUnicode_Decode???(). I'll open the new report once 2.4 is out the door and we can start discussing the final argument and the feed API.
History
Date User Action Args
2022-04-11 14:56:06 admin set github: 40651
2004-07-27 20:35:29 doerwalter create