[Python-Dev] Decoding incomplete unicode (original) (raw)

Walter Dörwald walter at livinglogic.de
Thu Aug 19 17:45:26 CEST 2004


Martin v. Löwis wrote:

Walter Dörwald wrote:

They will not, because StreamReader.decode() already is a feed style API (but with state amnesia).

Any stream decoder that I can think of can be (and most are) implemented by overwriting decode(). I consider that an unfortunate implementation artefact. You either use the stateless encode/decode that you get from codecs.get(encoder/decoder) or you use the file API on the streams. You never ever use encode/decode on streams.

That is exactly the problem with the current API. StreamReader mixes two concepts:

  1. The stateful API, which allows decoding a byte input in chunk and the state of the decoder is kept between calls.
  2. A file API where the chunks to be decoded are read from a byte stream.

I would have preferred if the default .write implementation would have called self.internalencode, and the Writer would contain a Codec, rather than inheriting from Codec.

This would separate the two concepts from above.

Alas, for (I guess) simplicity, a more direct (and more confusing) approach was taken.

1) Having feed() as part of the StreamReader API: --- s = u"???".encode("utf-8") r = codecs.getreader("utf-8")() for c in s: print r.feed(c) Isn't that a totally unrelated issue? Aren't we talking about short reads on sockets etc?

We're talking about two problems:

  1. The current implementation does not really support the stateful API, because trailing incomplete byte sequences lead to errors.
  2. The current file API is not really convenient for decoding when the input is not read for a stream.

I would very much prefer to solve one problem at a time.

Bye, Walter Dörwald



More information about the Python-Dev mailing list