[Python-Dev] Divorcing str and unicode (no more implicit conversions). (original) (raw)
"Martin v. Löwis" martin at v.loewis.de
Tue Oct 25 23:21:43 CEST 2005
- Previous message: [Python-Dev] Divorcing str and unicode (no more implicit conversions).
- Next message: [Python-Dev] make testall hanging on HEAD?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Guido van Rossum wrote:
Yes but why? What does this invariant do for him?
I don't know about this person, but there are a few things that don't work properly in UTF-16 mode:
- the Unicode character database fails to lookup things. u"\U0001D670".isupper() gives false, but should give true (since it denotes MATHEMATICAL MONOSPACE CAPITAL A). It gives true in UCS-4 mode
- As a result, normalization on these doesn't work, either. It should normalize to "LATIN CAPITAL LETTER A" under NFKC, but doesn't.
- regular expressions only have limited support. In particular, adding non-BMP characters to character classes is not possible. [\U0001D670] will match any character that is either \uD835 or \uDE70, whereas it only matches MATHEMATICAL MONOSPACE CAPITAL A in UCS-4 mode.
There might be more limitations, but those are the ones that come to mind easily. While I could imagine fixing the first two with some effort, the third one is really tricky (unless you would accept a "wide" representation of a character class even if the Unicode representation is only narrow).
Regards, Martin
- Previous message: [Python-Dev] Divorcing str and unicode (no more implicit conversions).
- Next message: [Python-Dev] make testall hanging on HEAD?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]