[Python-Dev] Unicode charmap decoders slow (original) (raw)

Walter Dörwald walter at livinglogic.de
Thu Oct 6 09:28:05 CEST 2005


Martin v. Löwis wrote:

Walter Dörwald wrote:

OK, here's a patch that implements this enhancement to PyUnicodeDecodeCharmap(): http://www.python.org/sf/1313939 Looks nice! Creating the decodingmap as a string should probably be done by gencodec.py directly. This way the first import of the codec would be faster too. Hmm. How would you represent the string in source code? As a Unicode literal? With \u escapes,

Yes, simply by outputting repr(decoding_string).

or in a UTF-8 source file?

This might get unreadable, if your editor can't detect the coding header.

Or as a UTF-8 string, with an explicit decode call?

This is another possibility, but is unreadable too. But we might add the real codepoints as comments.

I like the current dictionary style for being readable, as it also adds the Unicode character names into comments.

We could use

decoding_string = ( u"\u009c" # 0x0004 -> U+009C: CONTROL u"\u0009" # 0x0005 -> U+000c: HORIZONTAL TABULATION ... )

However the current approach has the advantage, that only those byte values that differ from the identical mapping have to be specified.

Bye, Walter Dörwald



More information about the Python-Dev mailing list