[Python-Dev] len(chr(i)) = 2? (original) (raw)
Terry Reedy tjreedy at udel.edu
Tue Nov 23 23:44:07 CET 2010
- Previous message: [Python-Dev] len(chr(i)) = 2?
- Next message: [Python-Dev] len(chr(i)) = 2?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On 11/23/2010 2:11 PM, Alexander Belopolsky wrote:
This discussion motivated me to start looking into how well Python library itself is prepared to deal with len(chr(i)) = 2. I was not
Good idea!
surprised to find that textwrap does not handle the issue that well:
len(wrap(' \U00010140' * 80, 20)) 12 len(wrap(' \U00000140' * 80, 20)) 8
How well does textwrap handles composable pairs (letter + accent)? Does is count two codepoints as one char space? and avoid putting line breaks between? I suspect textwrap should be regarded as (extended?)_ascii_textwrap.
That module should probably be rewritten to properly implement the Unicode line breaking algorithm <http://unicode.org/reports/tr14/tr14-22.html>.
Probably a good idea
Yet finding a bug in a str object method after a 5 min review was a bit discouraging:
'xyz'.center(20, '\U00010140') Traceback (most recent call last): File "", line 1, in TypeError: The fill character must be exactly one character long
Again, what does it do with letter + decorator combinations? It seems to me that the whole notion that one code point == one printed character space is broken once one leaves ascii. Perhaps we need an is_uchar function to recognize multi-code sequences, inclusing surrogate pairs, that represent one char for the purpose of character oriented functions.
Given the apparent difficulty of writing even basic text processing algorithms in presence of surrogate pairs, I wonder how wise it is to expose Python users to them. As Wikipedia explains, [1]
""" Because the most commonly used characters are all in the Basic Multilingual Plane, converting between surrogate pairs and the original values is often not tested thoroughly. This leads to persistent bugs, and potential security holes, even in popular and well-reviewed application software. """
So we did not test thoroughly enough and need to add appropriate unit tests as bugs are fixed.
-- Terry Jan Reedy
- Previous message: [Python-Dev] len(chr(i)) = 2?
- Next message: [Python-Dev] len(chr(i)) = 2?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]