[Python-Dev] Support of UTF-16 and UTF-32 source encodings (original) (raw)

Stephen J. Turnbull stephen at xemacs.org
Sun Nov 15 11:42:12 EST 2015


Random832 writes:

"Stephen J. Turnbull" <stephen at xemacs.org> writes:

I don't see any good reason for allowing non-ASCII-compatible encodings in the reference CPython interpreter.

There might be a case for having the tokenizer not care about encodings at all and just operate on a stream of unicode characters provided by a different layer.

That's exactly what the PEP 263 implementation does in Python 2 (with the caveat that Python 2 doesn't know anything about Unicode, it's a UTF-8 stream and the non-ASCII characters are treated as bytes of unknown semantics, so they can't be used in syntax). I don't know about Python 3, I haven't looked at the decoding of source programs. But I would assume it implements PEP 263 still, except that since str is now either widechars or PEP 393 encoding (ie, flexible widechars) that encoding is now used instead of UTF-8.

I'm sure that there are plenty of ASCII-isms in the tokenizer in the sense that it assumes the ASCII character (not byte) repertoire. But I'm not sure why Serhiy thinks that the tokenizer cares about the representation on-disk. But as I say, I haven't looked at the code so he might be right.

Steve



More information about the Python-Dev mailing list