[Python-Dev] Divorcing str and unicode (no more implicit conversions). (original) (raw)
Antoine Pitrou solipsis at pitrou.net
Mon Oct 24 23:22:23 CEST 2005
- Previous message: [Python-Dev] Divorcing str and unicode (no more implicit conversions).
- Next message: [Python-Dev] Divorcing str and unicode (no more implicit conversions).
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
There are many design alternatives: one option would be to support three internal representations in a single type, generating the others from the one operation existing as needed. The default, initial representation might be UTF-8, with UCS-4 only being generated when indexing occurs, and UCS-2 only being generated when the API requires it. On concatenation, always concatenate just one represenation: either one that is already present in both operands, else UTF-8.
Wouldn't it be simpler to use:
- one-byte representation if every character <= 0xFF
- two-byte representation if every character <= 0xFFFF
- four-byte representation otherwise
Then combining several strings means using the larger representation as a result (*). In practice, most use cases will not involve the four-byte representation.
(*) a heuristic can be invented so that, when producing a smaller string (by stripping/slicing/etc.), it will "sometimes" check whether a narrower representation is possible. For example : store the length of the string when the last check occurred, and do a new check when the length falls below the half that value.
Regards
Antoine.
- Previous message: [Python-Dev] Divorcing str and unicode (no more implicit conversions).
- Next message: [Python-Dev] Divorcing str and unicode (no more implicit conversions).
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]