[Python-Dev] Decoding incomplete unicode (original) (raw)

Walter Dörwald walter at livinglogic.de
Tue Aug 10 21:24:20 CEST 2004

Previous message: [Python-Dev] Re: Prothon on CPython intrepreter? [PROTHON]
Next message: [Python-Dev] Decoding incomplete unicode
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

OK, here a my current thoughts on the codec problem:

The optimal solution (ignoring backwards compatibility) would look like this: codecs.lookup() would return the following stuff (this could be done by replacing the 4 entry tuple with a real object):

decode: The stateless decoding function
encode: The stateless encocing function
chunkdecoder: The stateful chunk decoder
chunkencoder: The stateful chunk encoder
streamreader: The stateful stream decoder
streamwriter: The stateful stream encoder

The functions and classes look like this:

Stateless decoder: decode(input, errors='strict'): Function that decodes the (str) input object and returns a (unicode) output object. The decoder must decode the complete input without any remaining undecoded bytes.

Stateless encoder: encode(input, errors='strict'): Function that encodes the complete (unicode) input object and returns a (str) output object.

Stateful chunk decoder: chunkdecoder(errors='strict'): A factory function that returns a stateful decoder with the following method:

 decode(input, final=False):
     Decodes a chunk of input and return the decoded unicode
     object. This method can be called multiple times and
     the state of the decoder will be kept between calls.
     This includes trailing incomplete byte sequences
     that will be retained until the next call to decode().
     When the argument final is true, this is the last call
     to decode() and trailing incomplete byte sequences will
     not be retained, but a UnicodeError will be raised.

Stateful chunk encoder: chunkencoder(errors='strict'): A factory function that returns a stateful encoder with the following method: encode(input, final=False): Encodes a chunk of input and returns the encoded str object. When final is true this is the last call to encode().

Stateful stream decoder: streamreader(stream, errors='strict'): A factory function that returns a stateful decoder for reading from the byte stream stream, with the following methods:

 read(size=-1, chars=-1, final=False):
     Read unicode characters from the stream. When data
     is read from the stream it should be done in chunks of
     size bytes. If size == -1 all the remaining data
     from the stream is read. chars specifies the number
     of characters to read from the stream. read() may return
     less then chars characters if there's not enough data
     available in the byte stream. If chars == -1 as much
     characters are read as are available in the stream.
     Transient errors are ignored and trailing incomplete
     byte sequence are retained when final is false. Otherwise
     a UnicodeError is raised in the case of incomplete byte
     sequences.
 readline(size=-1):
         ...
 next():
         ...
 __iter__():
         ...

Stateful stream encoder: streamwriter(stream, errors='strict'): A factory function that returns a stateful encoder for writing unicode data to the byte stream stream, with the following methods:

 write(data, final=False):
     Encodes the unicode object data and writes it
     to the stream. If final is true this is the last
     call to write().
 writelines(data):
     ...

I know that this is quite a departure from the current API, and I'm not sure if we can get all of the functionality without sacrificing backwards compatibility.

I don't particularly care about the "bytes consumed" return value from the stateless codec. The codec should always have returned only the encoded/decoded object, but I guess fixing this would break too much code. And users who are only interested in the stateless functionality will probably use unicode.encode/str.decode anyway.

For the stateful API it would be possible to combine the chunk and stream decoder/encode into one class with the following methods (for the decoder):

 __init__(stream, errors='strict'):
     Like the current StreamReader constructor, but stream may be
     None, if only the chunk API is used.
 decode(input, final=False):
     Like the current StreamReader (i.e. it returns a (unicode, int)
     tuple.) This does not keep the remaining bytes in a buffer.
     This is the job of the caller.
 feed(input, final=False):
     Decodes input and returns a decoded unicode object. This method
     calls decode() internally and manages the byte buffer.
 read(size=-1, chars=-1, final=False):
 readline(size=-1):
 next():
 __iter__():
     See above.

As before implementers of decoders only need to implement decode().

To be able to support the final argument the decoding functions in _codecsmodule.c could get an additional argument. With this they could be used for the stateless codecs too and we can reduce the number of functions again.

Unfortunately adding the final argument breaks all of the current codecs, but dropping the final argument requires one of two changes:

When the input stream is exhausted, the bytes read are parsed as if final=True. That's the way the CJK codecs currently handle it, but unfortunately this doesn't work with the feed decoder.
Simply ignore any remaing undecoded bytes at the end of the stream.

If we really have to drop the final argument, I'd prefer 2).

I've uploaded a second version of the patch. It implements the final argument, adds the feed() method to StreamReader and again merges the duplicate decoding functions in the codecs module. Note that the patch isn't really finished (the final argument isn't completely supported in the encoders and the CJK and escape codecs are unchanged), but it should be sufficient as a base for discussion.

Bye, Walter Dörwald

Previous message: [Python-Dev] Re: Prothon on CPython intrepreter? [PROTHON]
Next message: [Python-Dev] Decoding incomplete unicode
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list