[Python-Dev] Decoding incomplete unicode (original) (raw)

Walter Dörwald walter at livinglogic.de
Thu Aug 19 20:09:17 CEST 2004


M.-A. Lemburg wrote:

Walter Dörwald wrote:

Let's compare example uses:

1) Having feed() as part of the StreamReader API: --- s = u"???".encode("utf-8") r = codecs.getreader("utf-8")() for c in s: print r.feed(c) --- I consider adding a .feed() method to the stream codec bad design. .feed() is something you do on a stream, not a codec.

I don't care about the name, we can call it stateful_decode_byte_chunk() or whatever. (In fact I'd prefer to call it decode(), but that name is already taken by another method. Of course we could always rename decode() to _internal_decode() like Martin suggested.)

2) Explicitely using a queue object: --- from whatever import StreamQueue

s = u"???".encode("utf-8") q = StreamQueue() r = codecs.getreader("utf-8")(q) for c in s: q.write(c) print r.read() --- This is probably how an advanced codec writer would use the APIs to build new stream interfaces.

3) Using a special wrapper that implicitely creates a queue: ---- from whatever import StreamQueueWrapper s = u"???".encode("utf-8") r = StreamQueueWrapper(codecs.getreader("utf-8")) for c in s: print r.feed(c) ----

This could be turned into something more straight forward, e.g. from codecs import EncodedStream # Load data s = u"???".encode("utf-8") # Write to encoded stream (one byte at a time) and print # the read output q = EncodedStream(inputencoding="utf-8", outputencoding="unicode")

This is confusing, because there is no encoding named "unicode". This should probably read:

q = EncodedQueue(encoding="utf-8", errors="strict")

for c in s: q.write(c) print q.read()

# Make sure we have processed all data: if q.haspendingdata(): raise ValueError, 'data truncated'

This should be the job of the error callback, the last part should probably be:

for c in s: q.write(c) print q.read() print q.read(final=True)

I very much prefer option 1). I prefer the above example because it's easy to read and makes things explicit.

"If the implementation is hard to explain, it's a bad idea." The user usually doesn't care about the implementation, only it's interfaces.

Bye, Walter Dörwald



More information about the Python-Dev mailing list