[Python-Dev] Decoding incomplete unicode (original) (raw)

Hye-Shik Chang hyeshik at gmail.com
Wed Jul 28 14:46:47 CEST 2004


On Wed, 28 Jul 2004 11:38:16 +0200, Walter Dörwald <walter at livinglogic.de> wrote:

Hye-Shik Chang wrote:

> On Tue, 27 Jul 2004 22:39:45 +0200, Walter Dörwald > <walter at livinglogic.de> wrote: > >>Pythons unicode machinery currently has problems when decoding >>incomplete input. >> >>When codecs.StreamReader.read() encounters a decoding error it >>reads more bytes from the input stream and retries decoding. >>This is broken for two reasons: >>1) The error might be due to a malformed byte sequence in the input, >> a problem that can't be fixed by reading more bytes. >>2) There may be no more bytes available at this time. Once more >> data is available decoding can't continue because bytes from >> the input stream have already been read and thrown away. >>(sio.DecodingInputFilter has the same problems) > > StreamReaders and -Writers from CJK codecs are not suffering from > this problems because they have internal buffer for keeping states > and incomplete bytes of a sequence. In fact, CJK codecs has its > own implementation for UTF-8 and UTF-16 on base of its multibytecodec > system. It provides a "working" StreamReader/Writer already. :) Seems you had the same problems with the builtin stream readers! ;) BTW, how do you solve the problem that incomplete byte sequences are retained in the middle of a stream, but should generate errors at the end?

Rough pseudo code here: (it's written in C in CJKCodecs)

class StreamReader:

pending = '' # incomplete 

def read(self, size=-1):
    while True:
        r = fp.read(size)
        if self.pending:
            r = self.pending + r
            self.pending = ''

        if r:
            try:
                outputbuffer = r.decode('utf-8')
            except MBERR_TOOFEW: # incomplete multibyte sequence
                pass
            except MBERR_ILLSEQ: # illegal sequence
                raise UnicodeDecodeError, "illseq"

        if not r or size == -1: # end of the stream
            if r have not consumed up for the output:
                raise UnicodeDecodeError, "toofew"

        if r have not consumed up for the output:
            self.pending = remainders of r

        if (size == -1 or               # one time read up
            len(outputbuffer) > 0 or    # output buffer isn't empty
            original length of r == 0): # the end of the stream
                break

        size = 1 # read 1 byte in next try

    return outputbuffer

CJKcodecs' multibytecodec structure has distinguished internal error codes for "illegal sequence" and "incomplete sequence". And each internal codecs receive a flag that indicates if immediate flush is needed at time. (for the end of streams and simple decode functions)

Regards, Hye-Shik



More information about the Python-Dev mailing list