msg256952 - (view) |
Author: 王杰 (王杰) |
Date: 2015-12-24 03:49 |
|
|
I use CentOS 7.0 and change LANG=gbk. I has a file "gbk-utf-8.py" and it's encoding is GBK. # -*- coding:utf-8 -*- import chardet if __name__ == '__main__': s = '中文' print s, chardet.detect(s) I execute it and everything is ok. However it raise "SyntaxError" (as I expected) after I change "encoding:utf-8" to "encoding:utf8". File "gbk-utf8.py", line 2 SyntaxError: 'utf8' codec can't decode byte 0xd6 in position 0: invalid continuation byte Is this ok? Or where I wrong? |
|
|
|
|
msg257005 - (view) |
Author: Terry J. Reedy (terry.reedy) *  |
Date: 2015-12-25 18:03 |
|
|
What Python version? |
|
|
|
|
msg257020 - (view) |
Author: 王杰 (王杰) |
Date: 2015-12-26 08:57 |
|
|
Python 2.7 |
|
|
|
|
msg257023 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2015-12-26 10:59 |
|
|
Here is a fix with a patch. |
|
|
|
|
msg257024 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2015-12-26 11:00 |
|
|
> Here is a fix with a patch. Oops, I mean 'with an unit test', sorry ;-) |
|
|
|
|
msg257025 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2015-12-26 11:01 |
|
|
> I has a file "gbk-utf-8.py" and it's encoding is GBK. I don't understand why you use "# coding: utf-8" if the file is encoded to GBK. Why not using "# coding: gbk"? |
|
|
|
|
msg257026 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) *  |
Date: 2015-12-26 11:25 |
|
|
The problem is not that an error is raised with coding:utf8, but that it isn't raised with coding:utf-8. Here is an example with bad iso8859-3. An error is raised as expected. |
|
|
|
|
msg257028 - (view) |
Author: 王杰 (王杰) |
Date: 2015-12-26 12:27 |
|
|
I'm learning about Python's encoding rule and I write it as a test case. |
|
|
|
|
msg257047 - (view) |
Author: Marc-Andre Lemburg (lemburg) *  |
Date: 2015-12-26 20:35 |
|
|
Please fold these cases into one: if (strcmp(buf, "utf-8") == 0 | |
strncmp(buf, "utf-8-", 6) == 0) return "utf-8"; else if (strcmp(buf, "utf8") == 0 |
|
strncmp(buf, "utf8-", 6) == 0) return "utf-8"; -> if (strcmp(buf, "utf-8") == 0 |
|
msg257050 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2015-12-26 21:46 |
|
|
In Python, there are multiple implementations of the utf-8 codec with many shortcuts. I'm not surprised to see bugs depending on the exact syntax of the utf-8 codec name. Maybe we need to share even more code to normalize and compare codec names. (I think that py3 is better than py2 on this part.) |
|
|
|
|
msg257051 - (view) |
Author: Marc-Andre Lemburg (lemburg) *  |
Date: 2015-12-26 22:05 |
|
|
On 26.12.2015 22:46, STINNER Victor wrote: > > In Python, there are multiple implementations of the utf-8 codec with many > shortcuts. I'm not surprised to see bugs depending on the exact syntax of > the utf-8 codec name. Maybe we need to share even more code to normalize > and compare codec names. (I think that py3 is better than py2 on this part.) There's only one implementation (the one in unicodeobject.c), which is used directly or via the wrapper in the encodings package, but there are a few shortcuts to bypass the codec registry scattered around the code since UTF-8 is such a commonly used codec. In the case in question, the codec registry should trigger decoding via the encodings package (rather than going directly to C APIs), so will eventually end up using the same code. I wonder why this does not trigger the exception. |
|
|
|
|
msg257060 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) *  |
Date: 2015-12-27 01:05 |
|
|
> I wonder why this does not trigger the exception. Because in case of utf-8 and iso-8859-1 decoding and encoding steps are omitted. In general case the input is decoded from specified encoding and than encoded to UTF-8 for parser. But for utf-8 and iso-8859-1 encodings the parser gets the raw data. |
|
|
|
|
msg257074 - (view) |
Author: Marc-Andre Lemburg (lemburg) *  |
Date: 2015-12-27 12:33 |
|
|
On 27.12.2015 02:05, Serhiy Storchaka wrote: > >> I wonder why this does not trigger the exception. > > Because in case of utf-8 and iso-8859-1 decoding and encoding steps are omitted. > > In general case the input is decoded from specified encoding and than encoded to UTF-8 for parser. But for utf-8 and iso-8859-1 encodings the parser gets the raw data. Right, but since the tokenizer doesn't know about "utf8" it should reach out to the codec registry to get a properly encoded version of the source code (even though this is an unnecessary round-trip). There are few other aliases for UTF-8 which would likely trigger the same problem: # utf_8 codec 'u8' : 'utf_8', 'utf' : 'utf_8', 'utf8' : 'utf_8', 'utf8_ucs2' : 'utf_8', 'utf8_ucs4' : 'utf_8', |
|
|
|
|
msg259990 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) *  |
Date: 2016-02-10 07:54 |
|
|
I think the correct way is not add "utf8" to special case, but removes "utf-8". Here is a patch. |
|
|
|
|
msg260054 - (view) |
Author: Jim Jewett (Jim.Jewett) *  |
Date: 2016-02-10 22:57 |
|
|
Does (did?) the utf8 special case allow for a much faster startup time, by not requiring all of the codecs machinery? |
|
|
|
|
msg260078 - (view) |
Author: Marc-Andre Lemburg (lemburg) *  |
Date: 2016-02-11 08:16 |
|
|
Serhiy: Removing the shortcut would slow down the tokenizer a lot since UTF-8 encoded source code is the norm, not the exception. The "problem" here is that the tokenizer trusts the source code in being in the correct encoding when you use one of utf-8 or iso-8859-1 and then skips the usual "decode into unicode, then encode to utf-8" step. From a purist point of view, you are right, Python should always pass through those steps to detect encoding errors, but from a practical point of view, I think the optimization is fine. |
|
|
|
|