[Python-3000] PEP 3120 (Was: PEP Parade) (original) (raw)

"Martin v. Löwis" martin at v.loewis.de
Thu May 3 09:19:04 CEST 2007

Previous message: [Python-3000] PEP Parade
Next message: [Python-3000] PEP 3120 (Was: PEP Parade)
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

S 3120 Using UTF-8 as the default source encoding von Löwis

The basic idea seems very reasonable. I expect that the changes to the parser may be quite significant though. Also, the parser ought to be weened of C stdio in favor of Python's own I/O library. I wonder if it's really possible to let the parser read the raw bytes though -- this would seem to rule out supporting encodings like UTF-16. Somehow I wonder if it wouldn't be easier if the parser operated on Unicode input? That way parsing unicode strings (which we must support as all strings will become unicode) will be simpler.

Actually, changes should be fairly minimal. The parser already transforms all input (no matter what source encoding) to UTF-8 before doing the parsing; this has worked well (as all keywords continue to be one-byte characters). The parser also already special-cases UTF-8 as the input encoding, by not putting it through a codec. That can also stay, except that it should now check that any non-ASCII bytes are well-formed UTF-8.

Untangling the parser from stdio - sure. I also think it would be desirable to read the whole source into a buffer, rather than applying a line-by-line input. That might be a bigger change, making the tokenizer a multi-stage algorithm:

read input into a buffer
determine source encoding (looking at a BOM, else a declaration within the first two lines, else default to UTF-8)
if the source encoding is not UTF-8, pass it through a codec (decode to string, encode to UTF-8). Otherwise, check that all bytes are really well-formed UTF-8.
start parsing

As for UTF-16: the lexer currently does not support UTF-16 as a source encoding, as we require an ASCII superset.

I'm not sure whether UTF-16 needs to be supported as a source encoding, but with above changes, it would be fairly easy to support, assuming we detect UTF-16 from the BOM (can't use the encoding declaration, because that works only for ASCII supersets).

Regards, Martin

Previous message: [Python-3000] PEP Parade
Next message: [Python-3000] PEP 3120 (Was: PEP Parade)
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-3000 mailing list