[Python-Dev] Decoding incomplete unicode (original) (raw)
Walter Dörwald walter at livinglogic.de
Wed Aug 18 22:35:31 CEST 2004
- Previous message: [Python-Dev] Decoding incomplete unicode
- Next message: [Python-Dev] Decoding incomplete unicode
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
M.-A. Lemburg wrote:
Walter Dörwald wrote:
I've thought about this some more. Perhaps I'm still missing something, but wouldn't it be possible to add a feeding mode to the existing stream codecs by creating a new queue data type (much like the queue you have in the test cases of your patch) and using the stream codecs on these ?
No, because when the decode method encounters an incomplete chunk (and so return a size that is smaller then size of the input) read() would have to push the remaining bytes back into the queue. This would be code similar in functionality to the feed() method from the patch, with the difference that the buffer lives in the queue not the StreamReader. So we won't gain any code simplification by going this route. Maybe not code simplification, but the APIs will be well- separated.
They will not, because StreamReader.decode() already is a feed style API (but with state amnesia).
Any stream decoder that I can think of can be (and most are) implemented by overwriting decode().
If we require the queue type for feeding mode operation we are free to define whatever APIs are needed to communicate between the codec and the queue type, e.g. we could define a method that pushes a few bytes back onto the queue end (much like ungetc() in C).
That would of course be a possibility.
I think such a queue would be generally useful in other contexts as well, e.g. for implementing fast character based pipes between threads, non-Unicode feeding parsers, etc. Using such a type you could potentially add a feeding mode to stream or file-object API based algorithms very easily.
Yes, so we could put this Queue class into a module with string utilities. Maybe string.py? Hmm, I think a separate module would be better since we could then recode the implementation in C at some point (and after the API has settled). We'd only need a new name for it, e.g. StreamQueue or something.
Sounds reasonable.
We could then have a new class, e.g. FeedReader, which wraps the above in a nice API, much like StreamReaderWriter and StreamRecoder.
But why should we, when decode() does most of what we need, and the rest has to be implemented in both versions? To hide the details from the user. It should be possible to instantiate one of these StreamQueueReaders (named after the queue type) and simply use it in feeding mode without having to bother about the details behind the implementation. StreamReaderWriter and StreamRecoder exist for the same reason.
Let's compare example uses:
- Having feed() as part of the StreamReader API:
s = u"???".encode("utf-8") r = codecs.getreader("utf-8")() for c in s: print r.feed(c)
- Explicitely using a queue object:
from whatever import StreamQueue
s = u"???".encode("utf-8") q = StreamQueue() r = codecs.getreader("utf-8")(q) for c in s: q.write(c) print r.read()
- Using a special wrapper that implicitely creates a queue:
from whatever import StreamQueueWrapper s = u"???".encode("utf-8") r = StreamQueueWrapper(codecs.getreader("utf-8")) for c in s: print r.feed(c)
I very much prefer option 1).
"If the implementation is hard to explain, it's a bad idea."
Bye, Walter Dörwald
- Previous message: [Python-Dev] Decoding incomplete unicode
- Next message: [Python-Dev] Decoding incomplete unicode
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]