Issue 16258: test_local.TestEnUSCollection failures on Solaris 10 (original) (raw)
Issue16258
Created on 2012-10-17 02:19 by trent, last changed 2022-04-11 14:57 by admin.
Messages (17) | ||
---|---|---|
msg173124 - (view) | Author: Trent Nelson (trent) * ![]() |
Date: 2012-10-17 02:19 |
====================================================================== ERROR: test_strxfrm (test.test_locale.TestEnUSCollation) ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/cpython/buildslave/3.x.snakebite-solaris10-u10ga2-sparc64-1/build/Lib/test/test_locale.py", line 346, in test_strxfrm self.assertLess(locale.strxfrm('a'), locale.strxfrm('b')) ValueError: character U+101010e is not in range [U+0000; U+10ffff] ====================================================================== ERROR: test_strxfrm_with_diacritic (test.test_locale.TestEnUSCollation) ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/cpython/buildslave/3.x.snakebite-solaris10-u10ga2-sparc64-1/build/Lib/test/test_locale.py", line 367, in test_strxfrm_with_diacritic self.assertLess(locale.strxfrm('à'), locale.strxfrm('b')) ValueError: character U+101010e is not in range [U+0000; U+10ffff] ---------------------------------------------------------------------- Haven't investigated yet. | ||
msg173164 - (view) | Author: Trent Nelson (trent) * ![]() |
Date: 2012-10-17 12:56 |
With the caveat that I know absolutely nothing about locales, here's what I've been able to reduce the problem down to: zinc (alias s11, Solaris 11 x64): >>> locale.setlocale(locale.LC_ALL, 'C') 'C' >>> locale.strxfrm('a') 'a' >>> locale.setlocale(locale.LC_ALL, 'en_US.UTF-8') 'en_US.UTF-8' >>> locale.strxfrm('a') Traceback (most recent call last): File "", line 1, in ValueError: character U+10105a3 is not in range [U+0000; U+10ffff] >>> nitrogen (alias s10, Solaris 10 SPARC): >>> locale.setlocale(locale.LC_ALL, 'en_US.UTF-8') 'en_US.UTF-8' >>> locale.strxfrm('a') Traceback (most recent call last): File "", line 1, in ValueError: character U+101010e is not in range [U+0000; U+10ffff] Not sure how relevant it is, but on both those Solaris boxes, locale.LC_ALL returns 6, whereas on BSD and OS X it always seems to return 0. | ||
msg173166 - (view) | Author: Jesús Cea Avión (jcea) * ![]() |
Date: 2012-10-17 13:02 |
I can reproduce this on my x86 Solaris 10 update 10. | ||
msg173167 - (view) | Author: Antoine Pitrou (pitrou) * ![]() |
Date: 2012-10-17 13:03 |
With the system Python on s10: Python 2.6.8 (unknown, Apr 13 2012, 17:08:12) [C] on sunos5 Type "help", "copyright", "credits" or "license" for more information. >>> import locale >>> locale.strxfrm('a') 'a' >>> locale.setlocale(locale.LC_ALL, 'en_US.UTF-8') 'en_US.UTF-8' >>> locale.strxfrm('a') '\x01\x01\x01\x0e\x01\x01\x01\x01\x01\x01\x01\x02\x01\x01\x0fi\x01\x01\x01\x01' >>> locale.strxfrm('a').decode('utf-8') u'\x01\x01\x01\x0e\x01\x01\x01\x01\x01\x01\x01\x02\x01\x01\x0fi\x01\x01\x01\x01' The difference between Python 2 and Python 3 is that Python 3 uses wcsxfrm, not strxfrm. Apparently Solaris' wcsxfrm is some broken thing that returns the same thing as strxfrm, cast to a wchar_t *, hence the character U+101010e (corresponding to the '\x01\x01\x01\x0e' bytestring above). | ||
msg173168 - (view) | Author: Jesús Cea Avión (jcea) * ![]() |
Date: 2012-10-17 13:05 |
BTW, this works in python 3.2: x86, 32 bit python, Solaris 10 update 10: """ Python 3.2.3 (default, Apr 12 2012, 13:29:13) [GCC 4.7.0] on sunos5 Type "help", "copyright", "credits" or "license" for more information. >>> import locale >>> locale.setlocale(locale.LC_ALL, 'en_US.UTF-8') 'en_US.UTF-8' >>> locale.strxfrm('a') '���\U00010f69�' """ | ||
msg173171 - (view) | Author: Antoine Pitrou (pitrou) * ![]() |
Date: 2012-10-17 13:34 |
It only works on Python 3.2 because PyUnicode_FromWideChar is more permissive, it seems. The first character in the wchar_t string returned by Solaris is still 0x101010e. | ||
msg173172 - (view) | Author: Antoine Pitrou (pitrou) * ![]() |
Date: 2012-10-17 13:44 |
(by the way, I also tried a memset() before calling wcsxfrm(): no change) | ||
msg173199 - (view) | Author: STINNER Victor (vstinner) * ![]() |
Date: 2012-10-17 19:28 |
Python 3.2 rejects characters outside the range U+0000-U+10ffff in some operations, but not everywhere. I fixed Python 3.3 to be more strict and always reject characters outside this range. I noticed the Solaris issue with mbstowcs() on locale encodings different than UTF-8: #13441. I asked if it's more important to be strict on Unicode, or if we need to handle the wcsxfrm() issue on python-dev: http://mail.python.org/pipermail/python-dev/2011-December/114759.html Stefan Krah answered: "Yes, if the cause is a broken mbstowcs() that sounds good." http://mail.python.org/pipermail/python-dev/2011-December/114781.html I asked for help on OpenIndiana IRC channel, but nobody had a locale encoding different than UTF-8. I didn't have access to a Solaris box, so I chose to skip failing tests on Solaris. My commit 2a2d0872d993 (and 7ffe3d304487) skips many locales to workaround this issue in test__locale. | ||
msg289382 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * ![]() |
Date: 2017-03-10 15:51 |
May be is related to this issue. Is this issue still reproduced? | ||
msg296414 - (view) | Author: Peter (petriborg) | Date: 2017-06-20 12:16 |
I'm getting the same 2 errors in Python 3.4.6 on Solaris 11. Comes up when you run 'gmake test' or ./python -W default -bb -E -W error::BytesWarning -m test -r -w -j 0 -v test_locale.py | ||
msg296415 - (view) | Author: STINNER Victor (vstinner) * ![]() |
Date: 2017-06-20 12:23 |
A solution for that would be to return the raw byte string or to return a list of integers, rather than an unicode string. I don't think that locale.strxfrm() result is supposed to be displayed in a terminal, it should only be used to sort two strings, or to be used as a key function for list.sort() for example. | ||
msg296416 - (view) | Author: STINNER Victor (vstinner) * ![]() |
Date: 2017-06-20 12:26 |
Currently, the function is documented to return a string: https://docs.python.org/dev/library/locale.html#locale.strxfrm "Transforms a string to one that can be used in locale-aware comparisons." The problem is that we don't have enough developers who care of Solaris/Illimios to fix these issues (propose patches). test_locale is just *one* example. The curses module is broken for years on Solaris if I recall correctly... | ||
msg296418 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * ![]() |
Date: 2017-06-20 12:47 |
It is possible to use the special "encoding" for transformed strings on platforms with broken wcsxfrm(). All codes < 0x10000 are not changed. Codes >= 0x10000 are encoded as a pair: 0x10000 + (code >> 16), code & 0xffff. | ||
msg296435 - (view) | Author: STINNER Victor (vstinner) * ![]() |
Date: 2017-06-20 14:20 |
> It is possible to use the special "encoding" for transformed strings on platforms with broken wcsxfrm(). I wouldn't say that the function is wrong. wchar_t is 32-bit long, the function is free to use numbers > 0x10ffff. It's more a Python limitation, no? | ||
msg296440 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * ![]() |
Date: 2017-06-20 14:36 |
Agree, it's more a Python limitation. | ||
msg296441 - (view) | Author: STINNER Victor (vstinner) * ![]() |
Date: 2017-06-20 14:38 |
> Agree, it's more a Python limitation. Why do you think of changing locale.strxfrm() from str to bytes or tuple? I prefer a tuple. But again, I'm not super motivated by this change. IMHO there are more severe issues that should be fixed in Solaris. | ||
msg296445 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * ![]() |
Date: 2017-06-20 14:54 |
This will change the documented behavior. Even if allow this change in a new feature release, it can't be made in maintained releases. A tuple of integers is memory excessive and slow. A bytes object is more compact (but may be less compact than a string) and faster. But on little-endian platform every wchar_t should be converted to big-endian for supporting comparison of bytes objects. |
History | |||
---|---|---|---|
Date | User | Action | Args |
2022-04-11 14:57:37 | admin | set | github: 60462 |
2017-06-20 14:54:41 | serhiy.storchaka | set | messages: + |
2017-06-20 14:38:53 | vstinner | set | messages: + |
2017-06-20 14:36:13 | serhiy.storchaka | set | messages: + |
2017-06-20 14:20:32 | vstinner | set | messages: + |
2017-06-20 12:48:36 | serhiy.storchaka | set | components: + Extension Modules, - Interpreter Core |
2017-06-20 12:47:47 | serhiy.storchaka | set | type: behaviormessages: + components: + Interpreter Coreversions: + Python 3.5, Python 3.6, Python 3.7, - Python 3.3, Python 3.4 |
2017-06-20 12:26:30 | pitrou | set | nosy: - pitrou |
2017-06-20 12:26:12 | vstinner | set | messages: + |
2017-06-20 12:23:29 | vstinner | set | messages: + |
2017-06-20 12:16:35 | petriborg | set | nosy: + petriborgmessages: + |
2017-03-10 15:51:28 | serhiy.storchaka | set | nosy: + serhiy.storchakamessages: + |
2012-10-17 19:28:28 | vstinner | set | messages: + |
2012-10-17 14:36:26 | jcea | set | nosy: + vstinner |
2012-10-17 14:35:41 | jcea | link | issue13441 superseder |
2012-10-17 13:44:36 | pitrou | set | messages: + |
2012-10-17 13:34:00 | pitrou | set | messages: + |
2012-10-17 13:05:36 | jcea | set | keywords: + 3.3regressionmessages: + |
2012-10-17 13:03:20 | pitrou | set | nosy: + loewis, pitroumessages: + |
2012-10-17 13:02:59 | jcea | set | messages: + |
2012-10-17 12:56:34 | trent | set | messages: + |
2012-10-17 03:08:51 | jcea | set | nosy: + jcea |
2012-10-17 02:19:55 | trent | create |