[Python-Dev] Reject characters bigger than U+10FFFF and Solaris issues (original) (raw)
Victor Stinner victor.stinner at haypocalc.com
Thu Dec 8 02:43:40 CET 2011
- Previous message: [Python-Dev] [PSF-Members] Python Best Again
- Next message: [Python-Dev] Reject characters bigger than U+10FFFF and Solaris issues
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hi,
I would like to deny the creation of an Unicode string containing characters outside the range [U+0000; U+10FFFF]. The check is already present in some places (e.g. the builtin chr() function), but not everywhere. The last important function is PyUnicode_FromWideChar, function used to decode text from the OS.
The problem is that test_locale fails on Solaris with such checks. I would like to know how to handle Solaris issues. One possible solution is to not handle issues, and just raise exceptions and skip the failing tests on Solaris ;-) Another solution is to modify locale.strxfrm() on all platforms to return a list of int, instead of a str. The type of the result is not really important, we just have to be able to compare two results (equal, greater, lesser or equal, etc.). Another solution?
--
The two Solaris issues:
- in the hu_HU locale, localeconv() returns U+30000020 for the thousands separator
- locale.strxfrm() calls wcsxfrm() which returns characters in the range [0x1000000; 0x1FFFFFF]
For localeconv(), it is the b'\xA0' byte string decoded from an encoding looking like ISO-8859-?? (b'\xA0' is not decodable from UTF-8). It looks like a bug in the decoder. It also looks like OpenIndiana doesn't use ISO-8859 locale anymore, only UTF-8 locales (which is much better!). I'm unable to reproduce the issue on my OpenIndiana VM.
For wcsxfrm(), I'm not sure of the range. Example of a result: {0x1010163, 0x1010101, 0x1010103, 0x1010101, 0x1010103, 0x1010101, 0x1010101}. It looks like wcsxfrm() uses the result of strxfrm() by grouping bytes 3 by 3 and add 0x1000000 to each group. Example of strxfrm() output for the same input: {0x01, 0x01, 0x63, 0x01, 0x01, 0x01, ...}.
See http://bugs.python.org/issue13441 for more information.
Victor
- Previous message: [Python-Dev] [PSF-Members] Python Best Again
- Next message: [Python-Dev] Reject characters bigger than U+10FFFF and Solaris issues
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]