msg175200 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2012-11-08 22:52 |
Attached patch changes how support.TESTFN_UNDECODABLE is computed on UNIX: use the filesystem encoding in *strict* mode, not using the surrogateescape error handler. So we can use support.TESTFN_UNDECODABLE to check if a function uses correctly the surrogateescape error handler and/or check if it behaves correctly with non-ASCII characters. The patch uses also support.TESTFN_UNDECODABLE (only on UNIX) in test_cmd_line_script.test_non_ascii() to also check that the fix for #16218 works with UTF-8 locale encoding. Please test the patch on UNIX, Windows and Mac OS X. We may also use support.TESTFN_UNDECODABLE in test_cmd_line_script.test_non_ascii() on Windows, I will check. Windows has some strange behaviour with undecodable characters: some of them are replaced a character with a similar glyph. |
|
|
msg175201 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2012-11-08 22:53 |
The patch contains two print to help debugging the patch itself, these print statements must be removed later. +print("TESTFN_UNDECODABLE = %a" % TESTFN_UNDECODABLE) +print("TESTFN_NONASCII = %a" % TESTFN_NONASCII) |
|
|
msg175202 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2012-11-08 23:04 |
> We may also use support.TESTFN_UNDECODABLE > in test_cmd_line_script.test_non_ascii() on Windows Oh, subprocess doesn't support passing bytes arguments to a program anymore (since Python 3.0). http://bugs.python.org/issue4036#msg100376 So it's better to use TESTFN_NONASCII instead for this test ;-) It confirms that we need two constants depending on the context. It dependson the platform and how the data is read/written: sometimes undecodable characters are supported on any platform (ex: base64 encoder), sometimes undecodable characters are not supported (ex: distutils expects valid metadata), sometimes it depends on the platform (ex: this test). |
|
|
msg175209 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2012-11-08 23:50 |
> Please test the patch on UNIX, Windows and Mac OS X. The full test suite pass on: * Linux with UTF-8 locale encoding * Linux with ASCII locale encoding * Windows with cp932 ANSI code page * Mac OS 10.8 with ASCII locale encoding (and utf-8/surrogateescape for the filesystem encoding) ($LANG, LCALL,LC_ALL, LCALL,LC_CTYPE are not set) |
|
|
msg175221 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) *  |
Date: 2012-11-09 11:00 |
Try b'\x81', b'\x98', b'\xae', b'\xd5', b'\xff'. They are undecodable in all 1-byte encodings. b'\x81' : shift_jis_2004 shift_jis shift_jisx0213 cp869 cp874 cp932 cp1250 cp1252 cp1253 cp1254 cp1255 cp1257 cp1258 b'\x98' : shift_jis_2004 shift_jis shift_jisx0213 cp874 cp932 cp1250 cp1251 cp1253 cp1257 b'\xae' : iso8859-3 iso8859-6 iso8859-7 cp424 b'\xd5' : iso8859-8 cp856 cp857 b'\xff' : hp-roman8 iso8859-6 iso8859-7 iso8859-8 iso8859-11 shift_jis_2004 shift_jis shift_jisx0213 tis-620 cp864 cp874 cp1253 cp1255 |
|
|
msg175222 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) *  |
Date: 2012-11-09 11:09 |
Try b'\xed\xb2\x80' and b'\xed\xb4\x80' for UTF-8 (on Unix and Mac OS X). b'\xed\xb2\x80' is b'\x80'.decode('utf-8', 'surrogateescape').encode('utf-8', 'surrogatepass'). b'\xed\xb4\x80' is '\udd00'.encode('utf-8', 'surrogatepass') and '\udd00' can't be encoded with surrogateescape. |
|
|
msg175223 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) *  |
Date: 2012-11-09 11:14 |
> The full test suite pass on: The matter is not only in the fact that tests passed. They should fail if the original bug occurs again. Have you tried to restore the bugs? |
|
|
msg175271 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2012-11-10 10:50 |
> The matter is not only in the fact that tests passed. Right, but I don't want to introduce a regression :-) > They should fail if the original bug occurs again. Have you tried to restore the bugs? test_cmd_line_script.test_non_ascii() comes from the issue #16218, changeset 23ebe277e982. I checked this issue: support_undecodable.patch checks for non-regression with UTF-8 (and ASCI and ISO-8859-1) locale encoding on UNIX. test_genericpath.test_non_ascii() comes from the issue #3426, this fix comes from the issue #3187, changeset 8a7c930abab6. I don't want to spend time on trying the new test on this issue because this 8a7c930abab6 is a major change, I don't see how to revert it just to test the issue. I consider the issue has fixed, and the new test should not reduce the test coverage, but just increase it ;-) |
|
|
msg175272 - (view) |
Author: Roundup Robot (python-dev)  |
Date: 2012-11-10 11:07 |
New changeset 6b8a8bc6ba9c by Victor Stinner in branch 'default': Issue #16444, #16218: Use TESTFN_UNDECODABLE on UNIX http://hg.python.org/cpython/rev/6b8a8bc6ba9c |
|
|
msg175275 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) *  |
Date: 2012-11-10 12:21 |
TESTFN_UNDECODABLE is not detected for cp1250, cp1251, cp1252, cp1254, cp1257 and cp1258. Just add b'\x81\x98\xae\xd5\xff', at leas one of this bytes undecodable in some encoding which has any undecodable bytes. |
|
|
msg175291 - (view) |
Author: Antoine Pitrou (pitrou) *  |
Date: 2012-11-10 18:24 |
I suppose you noticed you broke a bunch of buildbots :) |
|
|
msg175296 - (view) |
Author: Roundup Robot (python-dev)  |
Date: 2012-11-10 21:31 |
New changeset 398f8770bf0d by Victor Stinner in branch 'default': Issue #16444: disable undecodable characters in test_non_ascii() test until http://hg.python.org/cpython/rev/398f8770bf0d |
|
|
msg175396 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2012-11-11 21:51 |
> TESTFN_UNDECODABLE is not detected for cp1250, cp1251, cp1252, cp1254, cp1257 and cp1258. The Python encoding and the real codec used by Windows are different: Python fails to decode bytes 0x80-0x9f, but Windows does decode them. I prefer to avoid these bytes to not rely too much on the Python codec. |
|
|
msg175399 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) *  |
Date: 2012-11-11 22:08 |
These encodings used not only on Windows. |
|
|
msg175402 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2012-11-11 22:15 |
> I suppose you noticed you broke a bunch of buildbots :) Failures occur on FreeBSD, OpenIndiana and some other buildbots which don't set a locale and so use the "C" locale. main() decodes command line arguments from the locale encoding using _Py_char2wchar(). On these OSes, the "C" locale uses the ISO-8859-1 encoding, but the problem is that nl_langinfo(CODESET) announces ASCII :-/ test_cmd_line.test_undecodable_code() handles this case. Extract of a comment: # _Py_char2wchar() decoded b'\xff' as '\xff' even if the locale is # C and the locale encoding is ASCII. It occurs on FreeBSD, Solaris # and Mac OS X. Mac OS X is now using UTF-8 to decode the command line arguments. I just created the issue #16455 to fix FreeBSD and OpenIndiana. I propose to close this issue because I consider it as fixed (#16455 will reenable TESTFN_UNDECODABLE in test_cmd_line_script). |
|
|
msg175406 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2012-11-11 23:12 |
> These encodings used not only on Windows. You can uses cpXXX encodings explictly to read or write a file, but these encodings are not used for sys.getfilesystemencoding() (or sys.stdout.encoding). |
|
|
msg175413 - (view) |
Author: Roundup Robot (python-dev)  |
Date: 2012-11-12 00:24 |
New changeset 6017f09ead53 by Victor Stinner in branch '3.3': Issue #16218, #16444: Backport improvment on tests for non-ASCII characters http://hg.python.org/cpython/rev/6017f09ead53 |
|
|
msg175423 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) *  |
Date: 2012-11-12 08:05 |
> You can uses cpXXX encodings explictly to read or write a file, but these > encodings are not used for sys.getfilesystemencoding() (or > sys.stdout.encoding). At least CP1251 has been used for many cyrillic locales in before-UTF8 age (I use it sometimes still). For now CP1251 is the default encoding for Byelorussian and Bulgarian: $ grep CP /usr/share/i18n/SUPPORTED be_BY CP1251 bg_BG CP1251 ru_RU.CP1251 CP1251 yi_US CP1255 |
|
|
msg176893 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) *  |
Date: 2012-12-04 10:40 |
Ping. |
|
|
msg176955 - (view) |
Author: Roundup Robot (python-dev)  |
Date: 2012-12-04 20:42 |
New changeset ed0ff4b3d1c4 by Victor Stinner in branch 'default': Issue #16444: test more bytes in support.TESTFN_UNDECODABLE to support more Windows code pages http://hg.python.org/cpython/rev/ed0ff4b3d1c4 |
|
|
msg176958 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2012-12-04 20:53 |
Ooook, all remaining issues about undecodable bytes should now be fixed (until someone opens a new one? :-)) |
|
|
msg178868 - (view) |
Author: Roundup Robot (python-dev)  |
Date: 2013-01-03 00:59 |
New changeset 41658a4fb3cc by Victor Stinner in branch '3.2': Issue #16218, #16414, #16444: Backport FS_NONASCII, TESTFN_UNDECODABLE, http://hg.python.org/cpython/rev/41658a4fb3cc New changeset 4d40c1ce8566 by Victor Stinner in branch '3.3': (Merge 3.2) Issue #16218, #16414, #16444: Backport FS_NONASCII, http://hg.python.org/cpython/rev/4d40c1ce8566 |
|
|