[Python-Dev] Divorcing str and unicode (no more implicit conversions). (original) (raw)

"Martin v. Löwis" martin at v.loewis.de
Tue Oct 25 23:21:43 CEST 2005

Previous message: [Python-Dev] Divorcing str and unicode (no more implicit conversions).
Next message: [Python-Dev] make testall hanging on HEAD?
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Guido van Rossum wrote:

Yes but why? What does this invariant do for him?

I don't know about this person, but there are a few things that don't work properly in UTF-16 mode:

the Unicode character database fails to lookup things. u"\U0001D670".isupper() gives false, but should give true (since it denotes MATHEMATICAL MONOSPACE CAPITAL A). It gives true in UCS-4 mode
As a result, normalization on these doesn't work, either. It should normalize to "LATIN CAPITAL LETTER A" under NFKC, but doesn't.
regular expressions only have limited support. In particular, adding non-BMP characters to character classes is not possible. [\U0001D670] will match any character that is either \uD835 or \uDE70, whereas it only matches MATHEMATICAL MONOSPACE CAPITAL A in UCS-4 mode.

There might be more limitations, but those are the ones that come to mind easily. While I could imagine fixing the first two with some effort, the third one is really tricky (unless you would accept a "wide" representation of a character class even if the Unicode representation is only narrow).

Regards, Martin

Previous message: [Python-Dev] Divorcing str and unicode (no more implicit conversions).
Next message: [Python-Dev] make testall hanging on HEAD?
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list