peps: 65c5d45eab5f pep-0393.txt (original) (raw)

Add a note to README that this repo is now migrated to Git and will remain frozen.

line wrap: on

line source

PEP: 393 Title: Flexible String Representation Version: RevisionRevisionRevision Last-Modified: DateDateDate Author: Martin v. Löwis martin@v.loewis.de Status: Final Type: Standards Track Content-Type: text/x-rst Created: 24-Jan-2010 Python-Version: 3.3 Post-History: Abstract ======== The Unicode string type is changed to support multiple internal representations, depending on the character with the largest Unicode ordinal (1, 2, or 4 bytes). This will allow a space-efficient representation in common cases, but give access to full UCS-4 on all systems. For compatibility with existing APIs, several representations may exist in parallel; over time, this compatibility should be phased out. The distinction between narrow and wide Unicode builds is dropped. An implementation of this PEP is available at [1]_. Rationale ========= There are two classes of complaints about the current implementation of the unicode type: on systems only supporting UTF-16, users complain that non-BMP characters are not properly supported. On systems using UCS-4 internally (and also sometimes on systems using UCS-2), there is a complaint that Unicode strings take up too much memory - especially compared to Python 2.x, where the same code would often use ASCII strings (i.e. ASCII-encoded byte strings). With the proposed approach, ASCII-only Unicode strings will again use only one byte per character; while still allowing efficient indexing of strings containing non-BMP characters (as strings containing them will use 4 bytes per character). One problem with the approach is support for existing applications (e.g. extension modules). For compatibility, redundant representations may be computed. Applications are encouraged to phase out reliance on a specific internal representation if possible. As interaction with other libraries will often require some sort of internal representation, the specification chooses UTF-8 as the recommended way of exposing strings to C code. For many strings (e.g. ASCII), multiple representations may actually share memory (e.g. the shortest form may be shared with the UTF-8 form if all characters are ASCII). With such sharing, the overhead of compatibility representations is reduced. If representations do share data, it is also possible to omit structure fields, reducing the base size of string objects. Specification ============= Unicode structures are now defined as a hierarchy of structures, namely:: typedef struct { PyObject_HEAD Py_ssize_t length; Py_hash_t hash; struct { unsigned int interned:2; unsigned int kind:2; unsigned int compact:1; unsigned int ascii:1; unsigned int ready:1; } state; wchar_t *wstr; } PyASCIIObject; typedef struct { PyASCIIObject _base; Py_ssize_t utf8_length; char *utf8; Py_ssize_t wstr_length; } PyCompactUnicodeObject; typedef struct { PyCompactUnicodeObject _base; union { void *any; Py_UCS1 *latin1; Py_UCS2 *ucs2; Py_UCS4 *ucs4; } data; } PyUnicodeObject; Objects for which both size and maximum character are known at creation time are called "compact" unicode objects; character data immediately follow the base structure. If the maximum character is less than 128, they use the PyASCIIObject structure, and the UTF-8 data, the UTF-8 length and the wstr length are the same as the length of the ASCII data. For non-ASCII strings, the PyCompactObject structure is used. Resizing compact objects is not supported. Objects for which the maximum character is not given at creation time are called "legacy" objects, created through PyUnicode_FromStringAndSize(NULL, length). They use the PyUnicodeObject structure. Initially, their data is only in the wstr pointer; when PyUnicode_READY is called, the data pointer (union) is allocated. Resizing is possible as long PyUnicode_READY has not been called. The fields have the following interpretations:

| +---------+------+--------+-------+--------+-------+--------+-------+ | | 32-bit |64-bit| 32-bit |64-bit | 32-bit |64-bit | 32-bit |64-bit | +-------+---------+------+--------+-------+--------+-------+--------+-------+ |1 | 32 | 64 | 40 | 64 | 32 | 56 | 40 | 80 | +-------+---------+------+--------+-------+--------+-------+--------+-------+ |2 | 40 | 64 | 40 | 72 | 32 | 56 | 40 | 80 | +-------+---------+------+--------+-------+--------+-------+--------+-------+ |3 | 40 | 64 | 48 | 72 | 32 | 56 | 40 | 80 | +-------+---------+------+--------+-------+--------+-------+--------+-------+ |4 | 40 | 72 | 48 | 80 | 32 | 56 | 48 | 80 | +-------+---------+------+--------+-------+--------+-------+--------+-------+ |5 | 40 | 72 | 56 | 80 | 32 | 56 | 48 | 80 | +-------+---------+------+--------+-------+--------+-------+--------+-------+ |6 | 48 | 72 | 56 | 88 | 32 | 56 | 48 | 80 | +-------+---------+------+--------+-------+--------+-------+--------+-------+ |7 | 48 | 72 | 64 | 88 | 32 | 56 | 48 | 80 | +-------+---------+------+--------+-------+--------+-------+--------+-------+ |8 | 48 | 80 | 64 | 96 | 40 | 64 | 48 | 88 | +-------+---------+------+--------+-------+--------+-------+--------+-------+ The runtime effect is significantly affected by the API being used. After porting the relevant pieces of code to the new API, the iobench, stringbench, and json benchmarks see typically slowdowns of 1% to 30%; for specific benchmarks, speedups may happen as may happen significantly larger slowdowns. In actual measurements of a Django application ([2]_), significant reductions of memory usage could be found. For example, the storage for Unicode objects reduced to 2216807 bytes, down from 6378540 bytes for a wide Unicode build, and down from 3694694 bytes for a narrow Unicode build (all on a 32-bit system). This reduction came from the prevalence of ASCII strings in this application; out of 36,000 strings (with 1,310,000 chars), 35713 where ASCII strings (with 1,300,000 chars). The sources for these strings where not further analysed; many of them likely originate from identifiers in the library, and string constants in Django's source code. In comparison to Python 2, both Unicode and byte strings need to be accounted. In the test application, Unicode and byte strings combined had a length of 2,046,000 units (bytes/chars) in 2.x, and 2,200,000 units in 3.x. On a 32-bit system, where the 2.x build used 32-bit wchar_t/Py_UNICODE, the 2.x test used 3,620,000 bytes, and the 3.x build 3,340,000 bytes. This reduction in 3.x using the PEP compared to 2.x only occurs when comparing with a wide unicode build. Porting Guidelines ================== Only a small fraction of C code is affected by this PEP, namely code that needs to look "inside" unicode strings. That code doesn't necessarily need to be ported to this API, as the existing API will continue to work correctly. In particular, modules that need to support both Python 2 and Python 3 might get too complicated when simultaneously supporting this new API and the old Unicode API. In order to port modules to the new API, try to eliminate the use of these API elements:

.. [2] Django measurement results http://www.dcl.hpi.uni-potsdam.de/home/loewis/djmemprof/[](#l460) Copyright ========= This document has been placed in the public domain. .. Local Variables: mode: indented-text indent-tabs-mode: nil sentence-end-double-space: t fill-column: 70 coding: utf-8 End: