[Python-Dev] Decoding incomplete unicode (original) (raw)
Hye-Shik Chang hyeshik at gmail.com
Wed Jul 28 14:46:47 CEST 2004
- Previous message: [Python-Dev] Decoding incomplete unicode
- Next message: [Python-Dev] Decoding incomplete unicode
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Wed, 28 Jul 2004 11:38:16 +0200, Walter Dörwald <walter at livinglogic.de> wrote:
Hye-Shik Chang wrote:
> On Tue, 27 Jul 2004 22:39:45 +0200, Walter Dörwald > <walter at livinglogic.de> wrote: > >>Pythons unicode machinery currently has problems when decoding >>incomplete input. >> >>When codecs.StreamReader.read() encounters a decoding error it >>reads more bytes from the input stream and retries decoding. >>This is broken for two reasons: >>1) The error might be due to a malformed byte sequence in the input, >> a problem that can't be fixed by reading more bytes. >>2) There may be no more bytes available at this time. Once more >> data is available decoding can't continue because bytes from >> the input stream have already been read and thrown away. >>(sio.DecodingInputFilter has the same problems) > > StreamReaders and -Writers from CJK codecs are not suffering from > this problems because they have internal buffer for keeping states > and incomplete bytes of a sequence. In fact, CJK codecs has its > own implementation for UTF-8 and UTF-16 on base of its multibytecodec > system. It provides a "working" StreamReader/Writer already. :) Seems you had the same problems with the builtin stream readers! ;) BTW, how do you solve the problem that incomplete byte sequences are retained in the middle of a stream, but should generate errors at the end?
Rough pseudo code here: (it's written in C in CJKCodecs)
class StreamReader:
pending = '' # incomplete
def read(self, size=-1):
while True:
r = fp.read(size)
if self.pending:
r = self.pending + r
self.pending = ''
if r:
try:
outputbuffer = r.decode('utf-8')
except MBERR_TOOFEW: # incomplete multibyte sequence
pass
except MBERR_ILLSEQ: # illegal sequence
raise UnicodeDecodeError, "illseq"
if not r or size == -1: # end of the stream
if r have not consumed up for the output:
raise UnicodeDecodeError, "toofew"
if r have not consumed up for the output:
self.pending = remainders of r
if (size == -1 or # one time read up
len(outputbuffer) > 0 or # output buffer isn't empty
original length of r == 0): # the end of the stream
break
size = 1 # read 1 byte in next try
return outputbuffer
CJKcodecs' multibytecodec structure has distinguished internal error codes for "illegal sequence" and "incomplete sequence". And each internal codecs receive a flag that indicates if immediate flush is needed at time. (for the end of streams and simple decode functions)
Regards, Hye-Shik
- Previous message: [Python-Dev] Decoding incomplete unicode
- Next message: [Python-Dev] Decoding incomplete unicode
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]