msg106129 - (view) |
Author: Dan Buch (meatballhat) |
Date: 2010-05-20 03:17 |
I noticed while running ``python3 -m tabnanny -v Lib/*.py`` that the process died at heapq.py. The 0x37 char in "François Pinard" (in the ``__about__`` attr) was the culprit. The attached patch replaces it with '\xe7'. Changing the encoding cookie was not necessary to make it work, but seemed like a good idea at the time (I forget if it even matters... haven't worked much in py3k yet.) |
|
|
msg106134 - (view) |
Author: ysj.ray (ysj.ray) |
Date: 2010-05-20 09:08 |
This is the problem with module tabnanny, it always tries to read the py source file as a platform-dependent encoded text module, that is, open the file with builtin function "open()", and with no encoding parameters. It doesn't parse the encoding cookie at the beginning of the fource file! So if a python source file contains some character not encoded in that platform-dependent encoding, the tabnanny module will fail on checking that source file. Not only heapq.py, but also several other stander modules. That platform-dependent encoding is judged as following orders: 1. os.device_encoding(fd) 2. locale.preferredencoding() 3. ascii. I wonder why tabnanny works in this way. Is this the intended behaviour? On my flatform, if I use tabnanny to check a source file which contains some chinese characters and encoded in 'gbk', the UnicodeDecodedError will raise. If this is not the intended behaviour, I guess if we want to fix this problem, we have to change the way tabnanny read the source file. Just like the way python compiler works. First, open the file in "rb" module, then try to detect the encoding use tokenize.detect_encoding() method, then use the dected encoding to open the source file again in text module. |
|
|
msg106135 - (view) |
Author: ysj.ray (ysj.ray) |
Date: 2010-05-20 09:16 |
I add "tim_one" to nosy list since I found this name in Misc/maintainers:tabnanny. Sorry if I did something improper. If this is really a problem, I'm glad to apply a patch for it. Thanks! |
|
|
msg106137 - (view) |
Author: Éric Araujo (eric.araujo) *  |
Date: 2010-05-20 11:04 |
PEP 8, section “encodings”, tells that stdlib source code in 3.x should always use ASCII or UTF-8, without encoding magic comment (since UTF-8 is the default now and ASCII is a subset of UTF-8); it explicitly mentions author names in comments or docstrings as the use case for UTF-8 bytes instead of escapes. tl;dr: Don’t mangle people’s names, fix tabnanny. |
|
|
msg106151 - (view) |
Author: Dan Buch (meatballhat) |
Date: 2010-05-20 13:38 |
removed patch because the fix should be made to tabnanny itself |
|
|
msg106201 - (view) |
Author: Benjamin Peterson (benjamin.peterson) *  |
Date: 2010-05-20 22:29 |
The correct fix is to use tokenize.detect_encoding, if anyone wants to provide a patch. |
|
|
msg106202 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2010-05-20 22:33 |
> The correct fix is to use tokenize.detect_encoding, > if anyone wants to provide a patch. done :-) Attached patch opens the file in binary mode to call tokenize.detect_encoding() and then use the encoding to open the file a second time (in text (unicode) mode). |
|
|
msg106203 - (view) |
Author: Benjamin Peterson (benjamin.peterson) *  |
Date: 2010-05-20 22:42 |
You should handle the case of encoding being None. |
|
|
msg106204 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2010-05-20 23:02 |
> You should handle the case of encoding being None. detect_encoding() never returns None for the encoding. If there is no cookie, utf8 is returned by default. |
|
|
msg106205 - (view) |
Author: Benjamin Peterson (benjamin.peterson) *  |
Date: 2010-05-20 23:12 |
2010/5/20 STINNER Victor <report@bugs.python.org>: > > STINNER Victor <victor.stinner@haypocalc.com> added the comment: > >> You should handle the case of encoding being None. > > detect_encoding() never returns None for the encoding. If there is no cookie, utf8 is returned by default. Ah, right. looks ok then |
|
|
msg106225 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2010-05-21 10:53 |
Commited: r81393 (py3k), r81394 (3.1). |
|
|