[Python-Dev] Stateful codecs [Was: str object going in Py3K] (original) (raw)
Walter Dörwald walter at livinglogic.de
Sat Feb 18 22:08:19 CET 2006
- Previous message: [Python-Dev] Stateful codecs [Was: str object going in Py3K]
- Next message: [Python-Dev] Stateful codecs [Was: str object going in Py3K]
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
M.-A. Lemburg wrote:
Walter Dörwald wrote:
M.-A. Lemburg wrote:
Walter Dörwald wrote:
[...]
Perhaps we should also deprecate codecs.lookup() in Py 2.5 ?! +1, but I'd like to have a replacement for this, i.e. a function that returns all info the registry has about an encoding:
1. Name 2. Encoder function 3. Decoder function 4. Stateful encoder factory 5. Stateful decoder factory 6. Stream writer factory 7. Stream reader factory and if this is an object with attributes, we won't have any problems if we extend it in the future. Shouldn't be a problem: just expose the registry dictionary via the codecs module. The rest can then be done in a Python function defined in codecs.py using a CodecInfo class. This would require the Python code to call codecs.lookup() and then look into the codecs dictionary (normalizing the _encoding name again). Maybe we should make a version of PyCodecLookup() that allows 4- and 6-tuples available to Python and use that? The official PyCodecLookup() would then have to downgrade the 6-tuples to 4-tuples. Hmm, you're right: the dictionary may not have the requested codec info yet (it's only used as cache) and only a call to PyCodecLookup() would fill it.
I'm now trying a different approach: codecs.lookup() returns a subclass of tuple. We could deprecate calling getitem() in 2.5/2.6 and then remove the tuple subclassing later.
BTW, if we change the API, can we fix the return value of the stateless functions? As the stateless function always encodes/decodes the complete string, returning the length of the string doesn't make sense. codecs.getencoder() and codecs.getdecoder() would have to continue to return the old variant of the functions, but codecs.getinfo("latin-1").encoder would be the new encoding function. No: you can still write stateless encoders or decoders that do not process the whole input string. Just because we don't have any of those in Python, doesn't mean that they can't be written and used. A stateless codec might want to leave the work of buffering bytes at the end of the input data which cannot be processed to the caller. But what would the call do with that info? It can't retry encoding/decoding the rejected input, because the state of the codec has been thrown away already. This depends a lot on the nature of the codec. It may well be possible to work on chunks of input data in a stateless way, e.g. say you have a string of 4-byte hex values, then the decode function would be able to work on 4 bytes each and let the caller buffer any remaining bytes for the next call. There'd be no need for keeping state in the decoder function.
So incomplete byte sequence would be silently ignored.
It is also possible to write stateful codecs on top of such stateless encoding and decoding functions.
That's what the codec helper functions from Python/codecs.c are for. I'm not sure what you mean here.
_codecs.utf_8_decode() etc. use (result, count) tuples as the return value, because those functions are the building blocks of the codecs themselves.
Anyway, I've started implementing a patch that just adds codecs.StatefulEncoder/codecs.StatefulDecoder. UTF8, UTF8-Sig, UTF-16, UTF-16-LE and UTF-16-BE are already working. Nice :-)
gencodec.py is updated now too. The rest should be manageble too. I'll leave updating the CJKV codecs to Hye-Shik though.
Bye, Walter Dörwald
- Previous message: [Python-Dev] Stateful codecs [Was: str object going in Py3K]
- Next message: [Python-Dev] Stateful codecs [Was: str object going in Py3K]
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]