[Python-Dev] Decoding incomplete unicode (original) (raw)
Walter Dörwald walter at livinglogic.de
Wed Aug 18 23:17:31 CEST 2004
- Previous message: [Python-Dev] Decoding incomplete unicode
- Next message: [Python-Dev] Decoding incomplete unicode
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Martin v. Löwis wrote:
We do need to extend the API between the stream codec and the encode/decode functions, no doubt about that. However, this is an extension that is well hidden from the user of the codec and won't break code. So you agree to the part of Walter's change that introduces new C functions (PyUnicodeDecodeUTF7Stateful etc)? I think most of the patch can be discarded: there is no need for .encode and .decode to take an additional argument.
But then a file that contains the two bytes 0x61, 0xc3 will never generate an error when read via an UTF-8 reader. The trailing 0xc3 will just be ignored.
Another option we have would be to add a final() method to the StreamReader, that checks if all bytes have been consumed. Maybe this should be done by StreamReader.close()?
It is only necessary that the StreamReader and StreamWriter are stateful, and that only for a selected subset of codecs.
Marc-Andre, if the original patch (diff.txt) was applied: What specific change in that patch would break code? What specific code (C or Python) would break under that change? I believe the original patch can be applied as-is, and does not cause any breakage.
The first version has a broken implementation of the UTF-7 decoder. When decoding the byte sequence "+-" in two calls to decode() (i.e. pass "+" in one call and "-" in the next), no character got generated, because inShift (as a flag) couldn't remember whether characters where encountered between the "+" and the "-". Now inShift counts the number of characters (and the shortcut for a "+-" sequence appearing together has been removed.
It also introduces a change between the codec and the encode/decode functions that is well hidden from the user of the codec.
Would a version of the patch without a final argument but with a feed() method be accepted?
I'm imagining implementing an XML parser that uses Python's unicode machinery and supports the xml.sax.xmlreader.IncrementalParser interface.
With a feed() method in the stream reader this is rather simple:
init() { PyObject *reader = PyCodec_StreamReader(encoding, Py_None, NULL); self.reader = PyObject_CallObject(reader, NULL); }
int feed(char *bytes) { parse(PyObject_CallMethod(self.reader, "feed", "s", bytes); }
The feed method itself is rather simple (see the second version of the patch).
Without the feed method(), we need the following:
A StreamQueue class that a) supports writing at one end and reading at the other end b) has a method for pushing back unused bytes to be returned in the next call to read()
A StreamQueueWrapper class that a) gets passed the StreamReader factory in the constructor, creates a StreamQueue instance, puts it into an attribute and passes it to the StreamReader factory (which must also be put into an attribute). b) has a feed() method that calls write() on the stream queue and read() on the stream reader and returns the result
Then the C implementation of the parser looks something like this:
init() { PyObject *module = PyImport_ImportModule("whatever"); PyObject *wclass = PyObject_GetAttr(module, "StreamQueueWrapper"); PyObject *reader = PyCodec_StreamReader(encoding, Py_None, NULL); self.wrapper = PyObject_CallObject(wclass, reader); }
int feed(char *bytes) { parse(PyObject_CallMethod(self.wrapper, "feed", "s", bytes); }
I find this neither easier to implement nor easier to explain.
Bye, Walter Dörwald
- Previous message: [Python-Dev] Decoding incomplete unicode
- Next message: [Python-Dev] Decoding incomplete unicode
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]