[Python-Dev] Decoding incomplete unicode (original) (raw)

"Martin v. Löwis" martin at v.loewis.de
Wed Aug 18 23:57:22 CEST 2004


Walter Dörwald wrote:

But then a file that contains the two bytes 0x61, 0xc3 will never generate an error when read via an UTF-8 reader. The trailing 0xc3 will just be ignored.

Another option we have would be to add a final() method to the StreamReader, that checks if all bytes have been consumed.

Alternatively, we could add a .buffer() method that returns any data that are still pending (either a Unicode string or a byte string).

Maybe this should be done by StreamReader.close()?

No. There is nothing wrong with only reading a part of a file.

Now inShift counts the number of characters (and the shortcut for a "+-" sequence appearing together has been removed.

Ok. I didn't actually check the correctness of the individual methods.

OTOH, I think time spent on UTF-7 is wasted, anyway.

Would a version of the patch without a final argument but with a feed() method be accepted?

I don't see the need for a feed method. .read() should just block until data are available, and that's it.

I'm imagining implementing an XML parser that uses Python's unicode machinery and supports the xml.sax.xmlreader.IncrementalParser interface.

I think this is out of scope of this patch. The incremental parser could implement a regular .read on a StringIO file that also supports .feed.

Without the feed method(), we need the following:

1) A StreamQueue class that

Why is that? I thought we are talking about "Decoding incomplete unicode"?

Regards, Martin



More information about the Python-Dev mailing list