[Python-Dev] Decoding incomplete unicode (original) (raw)
Walter Dörwald walter at livinglogic.de
Wed Jul 28 17:13:17 CEST 2004
- Previous message: [Python-Dev] Decoding incomplete unicode
- Next message: [Python-Dev] Fix import errors to have data
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hye-Shik Chang wrote:
[...]
BTW, how do you solve the problem that incomplete byte sequences are retained in the middle of a stream, but should generate errors at the end? Rough pseudo code here: (it's written in C in CJKCodecs) class StreamReader: pending = '' # incomplete def read(self, size=-1): while True: r = fp.read(size) if self.pending: r = self.pending + r self.pending = '' if r: try: outputbuffer = r.decode('utf-8') except MBERRTOOFEW: # incomplete multibyte sequence pass except MBERRILLSEQ: # illegal sequence raise UnicodeDecodeError, "illseq" if not r or size == -1: # end of the stream if r have not consumed up for the output: raise UnicodeDecodeError, "toofew"
Here's the problem: I'd like the streamreader to be able to continue even when there is no input available now. Perhaps there should be an additional argument to read() named final? If final is true, the stream reader makes sure that all pending bytes have been used up.
if r have not consumed up for the output: self.pending = remainders of r
if (size == -1 or # one time read up len(outputbuffer) > 0 or # output buffer isn't empty original length of r == 0): # the end of the stream break size = 1 # read 1 byte in next try return outputbuffer
CJKcodecs' multibytecodec structure has distinguished internal error codes for "illegal sequence" and "incomplete sequence". And each internal codecs receive a flag that indicates if immediate flush is needed at time. (for the end of streams and simple decode functions)
Bye, Walter Dörwald
- Previous message: [Python-Dev] Decoding incomplete unicode
- Next message: [Python-Dev] Fix import errors to have data
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]