[Python-Dev] len(chr(i)) = 2? (original) (raw)
Alexander Belopolsky alexander.belopolsky at gmail.com
Tue Nov 23 20:11:06 CET 2010
- Previous message: [Python-Dev] len(chr(i)) = 2?
- Next message: [Python-Dev] len(chr(i)) = 2?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Mon, Nov 22, 2010 at 1:13 PM, Raymond Hettinger <raymond.hettinger at gmail.com> wrote: ..
Any explanation we give users needs to let them know two things: * that we cover the entire range of unicode not just BMP * that sometimes len(chr(i)) is one and sometimes two
This discussion motivated me to start looking into how well Python library itself is prepared to deal with len(chr(i)) = 2. I was not surprised to find that textwrap does not handle the issue that well:
len(wrap(' \U00010140' * 80, 20)) 12 len(wrap(' \U00000140' * 80, 20)) 8
That module should probably be rewritten to properly implement the Unicode line breaking algorithm <http://unicode.org/reports/tr14/tr14-22.html>.
Yet finding a bug in a str object method after a 5 min review was a bit discouraging:
'xyz'.center(20, '\U00010140') Traceback (most recent call last): File "", line 1, in TypeError: The fill character must be exactly one character long
Given the apparent difficulty of writing even basic text processing algorithms in presence of surrogate pairs, I wonder how wise it is to expose Python users to them. As Wikipedia explains, [1]
""" Because the most commonly used characters are all in the Basic Multilingual Plane, converting between surrogate pairs and the original values is often not tested thoroughly. This leads to persistent bugs, and potential security holes, even in popular and well-reviewed application software. """
Since UCS-2 (the Character Encoding Form (CEF)) is now defined [1] to cover only BMP, maybe rather than changing the terms used in the reference manual, we should tighten the code to conform to the updated standards?
Again, given that the str object itself has at least one non-BMP character bug as we are closing on the third major release of py3k, how likely are 3rd party developers to get their libraries right as they port to 3.x?
[1] http://en.wikipedia.org/wiki/UTF-16/UCS-2 [2] http://unicode.org/reports/tr17/#CharacterEncodingForm
- Previous message: [Python-Dev] len(chr(i)) = 2?
- Next message: [Python-Dev] len(chr(i)) = 2?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]