[Python-Dev] Divorcing str and unicode (no more implicit conversions). (original) (raw)

Guido van Rossum guido at python.org
Tue Oct 25 00:47:22 CEST 2005

Previous message: [Python-Dev] Divorcing str and unicode (no more implicit conversions).
Next message: [Python-Dev] Divorcing str and unicode (no more implicit conversions).
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 10/24/05, "Martin v. Löwis" <martin at v.loewis.de> wrote:

Guido van Rossum wrote: > Changing the APIs would be much work, although perhaps not impossible > of Python 3000. For example, Raymond Hettinger's partition() API > doesn't refer to indices at all, and can replace many uses of find() > or index().

I think Neil's proposal is not to make them go away, but to implement them less efficiently. For example, if the internal representation is UTF-8, indexing requires linear time, as opposed to constant time. If the internal representation is UTF-16, and you have a flag to indicate whether there are any surrogates on the string, indexing is constant if the flag is false, else linear.

I understand all that. My point is that it's a bad idea to offer an indexing operation that isn't O(1).

> Perhaps we could provide a different kind of API to support the > latter, perhaps based on a mutable character buffer data type without > direct indexing?

There are different design goals conflicting here: - some think: "all my data is ASCII, so I want to only use one byte per character". - others think: "all my data goes to the Windows API, so I want to use 2 byte per character". - yet others think: "I want all of Unicode, with proper, efficient indexing, so I want four bytes per char".

I doubt the last one though. Probably they really don't want efficient indexing, they want to perform higher-level operations that currently are only possible using efficient indexing or slicing. With the right API. perhaps they could work just as efficiently with an internal representation of UTF-8.

It's not so much a matter of API as a matter of internal representation. The API doesn't have to change (except for the very low-level C API that directly exposes PyUNICODE*, perhaps).

I think the API should reflect the representation to some extend, namely it shouldn't claim to have operations that are typically thought of as O(1) that can only be implemented as O(n). An internal representation of UTF-8 might make everyone happy except heavy Windows users; but it requires changes to the API so people won't be writing Python 2.x-style string slinging code.

-- --Guido van Rossum (home page: http://www.python.org/~guido/)

Previous message: [Python-Dev] Divorcing str and unicode (no more implicit conversions).
Next message: [Python-Dev] Divorcing str and unicode (no more implicit conversions).
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list