[Python-Dev] Stateful codecs [Was: str object going in Py3K] (original) (raw)

Walter Dörwald walter at livinglogic.de
Sat Feb 18 17:11:39 CET 2006

Previous message: [Python-Dev] Stateful codecs [Was: str object going in Py3K]
Next message: [Python-Dev] Stateful codecs [Was: str object going in Py3K]
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

M.-A. Lemburg wrote:

Walter Dörwald wrote:

I'd suggest we keep codecs.lookup() the way it is and instead add new functions to the codecs module, e.g. codecs.getencoderobject() and codecs.getdecoderobject().

Changing the codec registration is not much of a problem: we could simply allow 6-tuples to be passed into the registry. OK, so codecs.lookup() returns 4-tuples, but the registry stores 6-tuples and the search functions must return 6-tuples. And we add codecs.getencoderobject() and codecs.getdecoderobject() as well as new classes codecs.StatefulEncoder and codecs.StatefulDecoder. What about old search functions that return 4-tuples? The registry should then simply set the missing entries to None and the getencoderobject()/getdecoderobject() would then have to raise an error. Sounds simple enough and we don't loose backwards compatibility. Perhaps we should also deprecate codecs.lookup() in Py 2.5 ?! +1, but I'd like to have a replacement for this, i.e. a function that returns all info the registry has about an encoding: 1. Name 2. Encoder function 3. Decoder function 4. Stateful encoder factory 5. Stateful decoder factory 6. Stream writer factory 7. Stream reader factory and if this is an object with attributes, we won't have any problems if we extend it in the future. Shouldn't be a problem: just expose the registry dictionary via the codecs module. The rest can then be done in a Python function defined in codecs.py using a CodecInfo class.

This would require the Python code to call codecs.lookup() and then look into the codecs dictionary (normalizing the encoding name again). Maybe we should make a version of __PyCodec_Lookup() that allows 4- and 6-tuples available to Python and use that? The official PyCodec_Lookup() would then have to downgrade the 6-tuples to 4-tuples.

BTW, if we change the API, can we fix the return value of the stateless functions? As the stateless function always encodes/decodes the complete string, returning the length of the string doesn't make sense. codecs.getencoder() and codecs.getdecoder() would have to continue to return the old variant of the functions, but codecs.getinfo("latin-1").encoder would be the new encoding function. No: you can still write stateless encoders or decoders that do not process the whole input string. Just because we don't have any of those in Python, doesn't mean that they can't be written and used. A stateless codec might want to leave the work of buffering bytes at the end of the input data which cannot be processed to the caller.

But what would the call do with that info? It can't retry encoding/decoding the rejected input, because the state of the codec has been thrown away already.

It is also possible to write stateful codecs on top of such stateless encoding and decoding functions.

That's what the codec helper functions from Python/_codecs.c are for.

Anyway, I've started implementing a patch that just adds codecs.StatefulEncoder/codecs.StatefulDecoder. UTF8, UTF8-Sig, UTF-16, UTF-16-LE and UTF-16-BE are already working. Bye, Walter Dörwald

Previous message: [Python-Dev] Stateful codecs [Was: str object going in Py3K]
Next message: [Python-Dev] Stateful codecs [Was: str object going in Py3K]
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list