[Python-Dev] Divorcing str and unicode (no more implicit conversions). (original) (raw)
Neil Hodgson nyamatongwe at gmail.com
Wed Oct 26 07:49:39 CEST 2005
- Previous message: [Python-Dev] Divorcing str and unicode (no more implicit conversions).
- Next message: [Python-Dev] Divorcing str and unicode (no more implicit conversions).
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
M.-A. Lemburg:
You mean a slice that slices out the next ?
Yes.
This sounds a lot like you'd want iterators for the various index types. Should be possible to implement on top of the proposed APIs, e.g. itergraphemes(u), itercodepoints(u), etc.
Iterators may be helpful, but can also be too restrictive when the processing is not completely iterative, such as peeking ahead or looking behind to wrap at a word boundary in the display example. There should be
It was more that there may leave less scope for error if there was a move away from indexes to slices. The PEP provides ways to specify what you want to examine or modify but it looks to me like returning indexes will see code repetition or additional variables with an increase in fragility.
Note that what most people refer to as "character" is a grapheme in Unicode speak.
A grapheme-oriented string type may be worthwhile although you'd probably have to choose a particular normalisation form to ease processing.
Given that interpretation, "breaking" Unicode "characters" is something you won't ever work around with by using larger code units such as UCS4 compatible ones.
I still think we can reduce the scope for errors.
Furthermore, you should also note that surrogates (two code units encoding one code point) are part of Unicode life. While you don't need them when storing Unicode in UCS4 code units, they can still be part of the Unicode data and the programmer has to be aware of these.
Many programmers can and will ignore surrogates. One day that may bite them but we can't close off text processing to those who have no idea of what surrogates are, or directional marks, or that sorting is locale dependent, or have no understanding of the difference between NFC and NFKD normalization forms.
Neil
- Previous message: [Python-Dev] Divorcing str and unicode (no more implicit conversions).
- Next message: [Python-Dev] Divorcing str and unicode (no more implicit conversions).
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]