[Python-Dev] Reject characters bigger than U+10FFFF and Solaris issues (original) (raw)

Victor Stinner victor.stinner at haypocalc.com
Thu Dec 8 02:43:40 CET 2011

Previous message: [Python-Dev] [PSF-Members] Python Best Again
Next message: [Python-Dev] Reject characters bigger than U+10FFFF and Solaris issues
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi,

I would like to deny the creation of an Unicode string containing characters outside the range [U+0000; U+10FFFF]. The check is already present in some places (e.g. the builtin chr() function), but not everywhere. The last important function is PyUnicode_FromWideChar, function used to decode text from the OS.

The problem is that test_locale fails on Solaris with such checks. I would like to know how to handle Solaris issues. One possible solution is to not handle issues, and just raise exceptions and skip the failing tests on Solaris ;-) Another solution is to modify locale.strxfrm() on all platforms to return a list of int, instead of a str. The type of the result is not really important, we just have to be able to compare two results (equal, greater, lesser or equal, etc.). Another solution?

The two Solaris issues:

in the hu_HU locale, localeconv() returns U+30000020 for the thousands separator
locale.strxfrm() calls wcsxfrm() which returns characters in the range [0x1000000; 0x1FFFFFF]

For localeconv(), it is the b'\xA0' byte string decoded from an encoding looking like ISO-8859-?? (b'\xA0' is not decodable from UTF-8). It looks like a bug in the decoder. It also looks like OpenIndiana doesn't use ISO-8859 locale anymore, only UTF-8 locales (which is much better!). I'm unable to reproduce the issue on my OpenIndiana VM.

For wcsxfrm(), I'm not sure of the range. Example of a result: {0x1010163, 0x1010101, 0x1010103, 0x1010101, 0x1010103, 0x1010101, 0x1010101}. It looks like wcsxfrm() uses the result of strxfrm() by grouping bytes 3 by 3 and add 0x1000000 to each group. Example of strxfrm() output for the same input: {0x01, 0x01, 0x63, 0x01, 0x01, 0x01, ...}.

See http://bugs.python.org/issue13441 for more information.

Victor

Previous message: [Python-Dev] [PSF-Members] Python Best Again
Next message: [Python-Dev] Reject characters bigger than U+10FFFF and Solaris issues
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list