Issue 19519: Parser: don't transcode input string to UTF-8 if it is already encoded to UTF-8 (original) (raw)

Created on 2013-11-07 12:40 by vstinner, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
input_is_utf8.patch	vstinner,2013-11-07 12:56

Messages (6)
msg202331 - (view)	Author: STINNER Victor (vstinner) *	Date: 2013-11-07 12:40
Python parser (Parser/tokenizer.c) has a translate_into_utf8() function to decode a string from the input encoding and encode it to UTF-8. This function is unnecessary if the input string is already encoded to UTF-8, which is something common nowadays. Linux, Mac OS X and many other operating systems are now using UTF-8 as the default locale encoding, UTF-8 is the default encoding for Python scripts, etc. compile(), eval() and exec() functions pass UTF-8 encoded strings to the parser. Attached patch adds an input_is_utf8 flag to the tokenizer to skip translate_into_utf8() if the input string is already encoded to UTF-8.
msg202334 - (view)	Author: STINNER Victor (vstinner) *	Date: 2013-11-07 12:56
The patch has an issue, importing test.bad_coding2 (UTF-8 with a BOM) does not raise a SyntaxError anymore.
msg202339 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2013-11-07 13:48
The parser should check that the input is actually valid UTF-8 data.
msg202340 - (view)	Author: STINNER Victor (vstinner) *	Date: 2013-11-07 14:03
> The parser should check that the input is actually valid UTF-8 data. Ah yes, correct. It looks like input data is still checked for valid UTF-8 data. I suppose that the byte strings should be decoded from UTF-8 because Python 3 manipulates Unicode strings, not byte strings. The patch only skips calls to translate_into_utf8(str, tok->encoding), calls to translate_into_utf8(str, tok->enc) are unchanged (notice: encoding != enc :-)). But it looks like translate_into_utf8(str, tok->enc) is not called if tok->enc is NULL. If tok->encoding is "utf-8" and tok->enc is NULL, maybe the input string is not decoded from UTF-8. But it sounds strange, because Python uses Unicode strings. Don't trust me, I would prefer an explanation of Benjamin who knows better than me the parser internals :-)
msg202346 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2013-11-07 15:44
tok->enc and tok->encoding should always have the same value, except that tok->enc gets set earlier. tok->enc is used when parsing from strings, to remember what codec to use. For file based parsing, the codec object created knows what encoding to use; for string-based parsing, tok->enc stores the encoding. If the code is to be simplified, unifying the cases of string-based parsing and file-based parsing might be a worthwhile goal.
msg202700 - (view)	Author: STINNER Victor (vstinner) *	Date: 2013-11-12 15:47
> If the code is to be simplified, unifying the cases of string-based parsing and file-based parsing might be a worthwhile goal. Ah yes, it enc and encoding attributes are almost the same, it would be nice to merge them! But I'm not sure that I understand, do you prefer to merge them in this issue or in a new issue?

History
Date	User	Action	Args
2022-04-11 14:57:53	admin	set	github: 63718
2015-10-02 21:09:19	vstinner	set	status: open -> closedresolution: out of date
2013-11-12 15:47:18	vstinner	set	messages: +
2013-11-07 15:44:39	loewis	set	nosy: + loewismessages: +
2013-11-07 14:03:11	vstinner	set	messages: +
2013-11-07 13:48:31	serhiy.storchaka	set	messages: +
2013-11-07 12:56:52	vstinner	set	files: + input_is_utf8.patchmessages: +
2013-11-07 12:43:31	vstinner	set	files: - input_is_utf8.patch
2013-11-07 12:40:55	vstinner	create