msg202331 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2013-11-07 12:40 |
Python parser (Parser/tokenizer.c) has a translate_into_utf8() function to decode a string from the input encoding and encode it to UTF-8. This function is unnecessary if the input string is already encoded to UTF-8, which is something common nowadays. Linux, Mac OS X and many other operating systems are now using UTF-8 as the default locale encoding, UTF-8 is the default encoding for Python scripts, etc. compile(), eval() and exec() functions pass UTF-8 encoded strings to the parser. Attached patch adds an input_is_utf8 flag to the tokenizer to skip translate_into_utf8() if the input string is already encoded to UTF-8. |
|
|
msg202334 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2013-11-07 12:56 |
The patch has an issue, importing test.bad_coding2 (UTF-8 with a BOM) does not raise a SyntaxError anymore. |
|
|
msg202339 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) *  |
Date: 2013-11-07 13:48 |
The parser should check that the input is actually valid UTF-8 data. |
|
|
msg202340 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2013-11-07 14:03 |
> The parser should check that the input is actually valid UTF-8 data. Ah yes, correct. It looks like input data is still checked for valid UTF-8 data. I suppose that the byte strings should be decoded from UTF-8 because Python 3 manipulates Unicode strings, not byte strings. The patch only skips calls to translate_into_utf8(str, tok->encoding), calls to translate_into_utf8(str, tok->enc) are unchanged (notice: encoding != enc :-)). But it looks like translate_into_utf8(str, tok->enc) is not called if tok->enc is NULL. If tok->encoding is "utf-8" and tok->enc is NULL, maybe the input string is not decoded from UTF-8. But it sounds strange, because Python uses Unicode strings. Don't trust me, I would prefer an explanation of Benjamin who knows better than me the parser internals :-) |
|
|
msg202346 - (view) |
Author: Martin v. Löwis (loewis) *  |
Date: 2013-11-07 15:44 |
tok->enc and tok->encoding should always have the same value, except that tok->enc gets set earlier. tok->enc is used when parsing from strings, to remember what codec to use. For file based parsing, the codec object created knows what encoding to use; for string-based parsing, tok->enc stores the encoding. If the code is to be simplified, unifying the cases of string-based parsing and file-based parsing might be a worthwhile goal. |
|
|
msg202700 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2013-11-12 15:47 |
> If the code is to be simplified, unifying the cases of string-based parsing and file-based parsing might be a worthwhile goal. Ah yes, it enc and encoding attributes are almost the same, it would be nice to merge them! But I'm not sure that I understand, do you prefer to merge them in this issue or in a new issue? |
|
|