[Python-Dev] Divorcing str and unicode (no more implicit conversions). (original) (raw)

Antoine Pitrou solipsis at pitrou.net
Mon Oct 24 23:22:23 CEST 2005

Previous message: [Python-Dev] Divorcing str and unicode (no more implicit conversions).
Next message: [Python-Dev] Divorcing str and unicode (no more implicit conversions).
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

There are many design alternatives: one option would be to support three internal representations in a single type, generating the others from the one operation existing as needed. The default, initial representation might be UTF-8, with UCS-4 only being generated when indexing occurs, and UCS-2 only being generated when the API requires it. On concatenation, always concatenate just one represenation: either one that is already present in both operands, else UTF-8.

Wouldn't it be simpler to use:

one-byte representation if every character <= 0xFF
two-byte representation if every character <= 0xFFFF
four-byte representation otherwise

Then combining several strings means using the larger representation as a result (*). In practice, most use cases will not involve the four-byte representation.

(*) a heuristic can be invented so that, when producing a smaller string (by stripping/slicing/etc.), it will "sometimes" check whether a narrower representation is possible. For example : store the length of the string when the last check occurred, and do a new check when the length falls below the half that value.

Regards

Antoine.

Previous message: [Python-Dev] Divorcing str and unicode (no more implicit conversions).
Next message: [Python-Dev] Divorcing str and unicode (no more implicit conversions).
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list