[Python-Dev] Decoding incomplete unicode (original) (raw)

Walter Dörwald walter at livinglogic.de
Thu Aug 19 18:49:44 CEST 2004

Previous message: [Python-Dev] Decoding incomplete unicode
Next message: [Python-Dev] Decoding incomplete unicode
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Martin v. Löwis wrote:

Walter Dörwald wrote:

But then a file that contains the two bytes 0x61, 0xc3 will never generate an error when read via an UTF-8 reader. The trailing 0xc3 will just be ignored.

Another option we have would be to add a final() method to the StreamReader, that checks if all bytes have been consumed. Alternatively, we could add a .buffer() method that returns any data that are still pending (either a Unicode string or a byte string).

Both approaches have one problem: Error handling won't work for them. If the error handling is "replace", the decoder should return U+FFFD for the final trailing incomplete sequence in read(). This won't happen when read() never reads those bytes from the input stream.

Maybe this should be done by StreamReader.close()? No. There is nothing wrong with only reading a part of a file.

Yes, but if read() is called without arguments then everything from the input stream should be read and used.

Now inShift counts the number of characters (and the shortcut for a "+-" sequence appearing together has been removed. Ok. I didn't actually check the correctness of the individual methods. OTOH, I think time spent on UTF-7 is wasted, anyway.

;) But it's a good example of how complicated state management can get.

Would a version of the patch without a final argument but with a feed() method be accepted? I don't see the need for a feed method. .read() should just block until data are available, and that's it.

There are situations where this can never work: Take a look at xml.sax.xmlreader.IncrementalParser. This interface has a feed() method which the user can call multiple times to pass byte string chunks to the XML parser. These chunks have to be decoded by the parser. Now if the parser wants to use Python's StreamReader it has to wrap the bytes passed to the feed() method into a stream interface. This looks something like the Queue class from the patch:

class Queue(object): def init(self): self._buffer = ""

 def write(self, chars):
     self._buffer += chars

 def read(self, size=-1):
     if size<0:
         s = self._buffer
         self._buffer = ""
         return s
     else:
         s = self._buffer[:size]
         self._buffer = self._buffer[size:]
         return s

The parser creates such an object and passes it to the StreamReader constructor. Now when feed() is called for the XML parser the parser calls queue.write(bytes) to put the bytes into the queue. Now the parser can call read() on the StreamReader (which in turn will read from the queue (on the other end)) to get decoded data.

But this will fail when StreamReader.read() block until more data is available. This will never happen, because the data will be put in the queue explicitely by calls to the feed() method of the XML parser.

Or take a look at sio.DecodingInputFilter. This should be an alternative implementation of reading a stream an decoding bytes to unicode. But the current implementation is broken because it uses the stateless API. But once we switch to the stateful API DecodingInputFilter becomes useless: DecodingInputFilter.read() just looks like this: def read(): return self.stream.read() (with stream being the stateful stream reader from codecs.getreader()), because DecodingInputFilter is forced to use the stream API of StreamReader.

I'm imagining implementing an XML parser that uses Python's unicode machinery and supports the xml.sax.xmlreader.IncrementalParser interface. I think this is out of scope of this patch. The incremental parser could implement a regular .read on a StringIO file that also supports .feed.

This adds to much infrastructure, when the alternative implementation is trivial. Take a look at the first version of the patch. Implementing a feed() method just mean factoring the lines:

data = self.bytebuffer + newdata object, decodedbytes = self.decode(data, self.errors) self.bytebuffer = data[decodedbytes:]

into a separate method named feed():

def feed(newdata): data = self.bytebuffer + newdata object, decodedbytes = self.decode(data, self.errors) self.bytebuffer = data[decodedbytes:] return object

So the feed functionality does already exist. It's just not in a usable form.

A using StringIO wouldn't work because we need both a read and a write position.

Without the feed method(), we need the following:

1) A StreamQueue class that Why is that? I thought we are talking about "Decoding incomplete unicode"?

Well, I had to choose a subject. ;)

Bye, Walter Dörwald

Previous message: [Python-Dev] Decoding incomplete unicode
Next message: [Python-Dev] Decoding incomplete unicode
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list