[Python-3000] Unicode and OS strings (original) (raw)
Guido van Rossum guido at python.org
Tue Sep 18 23:26:09 CEST 2007
- Previous message: [Python-3000] Unicode and OS strings
- Next message: [Python-3000] Unicode and OS strings
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On 9/18/07, Jim Jewett <jimjjewett at gmail.com> wrote:
On 9/18/07, Stephen J. Turnbull <stephen at xemacs.org> wrote:
> There's no UTF-8 in Python's internal string encoding. What are you > talking about? (At least as of a few days ago) In Python 3 there is; strings are unicode. A PyUnicodeObject object has two encodings that you can grab from a pointer (which means they have to be there; you don't have time to generate them like you would with a function pointer).
Incorrect. The pointer can be NULL. The API for getting the UTF-8 encoding is a function (moreover a function whose name starts with _Py).
One of these (str) is the "internal encoding" which is chosen at compile time, and the other (defenc) is now hard-coded to UTF-8.
Hashing is also based on the UTF-8 bytestring.
Not any more as of a few hours ago; the hashing based on UTF-8 was excessively expensive, and I rewrote it to directly use the code units(?) (or whatever they are called -- the Py_UNICODE values). For strings not using code units(?) > 216 this will give the same value on all platforms; if there are code units(?) >= 216 results vary since these will be represented as surrogates on 2-byte systems but not on 4-byte systems.
-- --Guido van Rossum (home page: http://www.python.org/~guido/)
- Previous message: [Python-3000] Unicode and OS strings
- Next message: [Python-3000] Unicode and OS strings
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]