[Python-Dev] PEP 393 Summer of Code Project (original) (raw)

Guido van Rossum guido at python.org
Wed Aug 24 21:34:34 CEST 2011

Previous message: [Python-Dev] PEP 393 Summer of Code Project
Next message: [Python-Dev] PEP 393 Summer of Code Project
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Wed, Aug 24, 2011 at 11:52 AM, Glenn Linderman <v+python at g.nevcal.com> wrote:

On 8/24/2011 9:00 AM, Stefan Behnel wrote:

Nick Coghlan, 24.08.2011 15:06: On Wed, Aug 24, 2011 at 10:46 AM, Terry Reedy wrote: In utf16.py, attached to http://bugs.python.org/issue12729 I propose for consideration a prototype of different solution to the 'mostly BMP chars, few non-BMP chars' case. Rather than expand every character from 2 bytes to 4, attach an array cpdex of character (ie code point, not code unit) indexes. Then for indexing and slicing, the correction is simple, simpler than I first expected: code-unit-index = char-index + bisect.bisectleft(cpdex, charindex) where code-unit-index is the adjusted index into the full underlying double-byte array. This adds a time penalty of log2(len(cpdex)), but avoids most of the space penalty and the consequent time penalty of moving more bytes around and increasing cache misses. Interesting idea, but putting on my C programmer hat, I say -1. Non-uniform cell size = not a C array = standard C array manipulation idioms don't work = pain (no matter how simple the index correction happens to be). The nice thing about PEP 383 is that it gives us the smallest storage array that is both an ordinary C array and has sufficiently large individual elements to handle every character in the string. +1 Yes, this sounds like a nice benefit, but the problem is it is false. The correct statement would be: The nice thing about PEP 383 is that it gives us the smallest storage array that is both an ordinary C array and has sufficiently large individual elements to handle every Unicode codepoint in the string.

(PEP 393, I presume. :-)

As Tom eloquently describes in the referenced issue (is Tom ever non-eloquent?), not all characters can be represented in a single codepoint.

But this is also besides the point (except insofar where we have to remind ourselves not to confuse the two in docs).

It seems there are three concepts in Unicode, code units, codepoints, and characters, none of which are equivalent (and the first of which varies according to the encoding). It also seems (to me) that Unicode has failed in its original premise, of being an easy way to handle "big char" for "all languages" with fixed size elements, but it is not clear that its original premise is achievable regardless of the size of "big char", when mixed directionality is desired, and it seems that support of some single languages require mixed directionality, not to mention mixed language support.

I see nothing wrong with having the language's fundamental data types (i.e., the unicode object, and even the re module) to be defined in terms of codepoints, not characters, and I see nothing wrong with len() returning the number of codepoints (as long as it is advertised as such). After all UTF-8 also defines an encoding for a sequence of code points. Characters that require two or more codepoints are not represented special in UTF-8 -- they are represented as two or more encoded codepoints. The added requirement that UTF-8 must only be used to represent valid characters is just that -- it doesn't affect how strings are encoded, just what is considered valid at a higher level.

Given the required variability of character size in all presently Unicode defined encodings, I tend to agree with Tom that UTF-8, together with some technique of translating character index to code unit offset, may provide the best overall space utilization, and adequate CPU efficiency.

There is no doubt that UTF-8 is the most space efficient. I just don't think it is worth giving up O(1) indexing of codepoints -- it would change programmers' expectations too much.

OTOH I am sold on getting rid of the added complexities of "narrow builds" where not even all codepoints can be represented without using surrogate pairs (i.e. two code units per codepoint) and indexing uses code units instead of codepoints. I think this is an area where PEP 393 has a huge advantage: users can get rid of their exceptions for narrow builds.

On the other hand, there are large subsets of applications that simply do not require support for bidirectional text or composed characters, and for those that do not, it remains to be seen if the price to be paid for supporting those features is too high a price for such applications. So far, we don't have implementations to benchmark to figure that out!

I think you are saying that many apps can ignore the distinction between codepoints and characters. Given the complexity of bidi rendering and normalization (which will always remain an issue) I agree; this is much less likely to be a burden than the narrow-build issues with code units vs. codepoints.

What should the stdlib do? It should try to skirt the issue where it can (using the garbage-in-garbage-out principle) and advertise what it supports where there is a difference. I don't see why all the stdlib should be made aware of multi-codepoint-characters and other bidi requirements, but it should be clear to the user who has such requirements which stdlib operations they can safely use.

What does this mean for Python? Well, if Python is willing to limit its support for applications to the subset for which the "big char" solution sufficient, then PEP 393 provides a way to do that, that looks to be pretty effective for reducing memory consumption for those applications that use short strings most of which can be classified by content into the 1 byte or 2 byte representations. Applications that support long strings are more likely to bitten by the occasional "outlier" character that is longer than the average character, doubling or quadrupling the space needed to represent such strings, and eliminating a significant portion of the space savings the PEP is providing for other applications.

This seems more of an intuition than a fact. I could easily imagine the facts being that even for large strings, usually either there are no outliers, or there is a significant number of outliers. (E.g. Tom Christiansen's OSCON preso falls in the latter category :-).

As long as it works I don't really mind that there are some extreme cases that are slow. You'll always have that.

Benchmarks may or may not fully reflect the actual requirements of all applications, so conclusions based on benchmarking can easily be blind-sided the realities of other applications, unless the benchmarks are carefully constructed.

Yeah, it's a learning process.

It is possible that the ideas in PEP 393, with its support for multiple underlying representations, could be the basis for some more complex representations that would better support characters rather than only supporting code points, but Martin has stated he is not open to additional representations, so the PEP itself cannot be that basis (although with care which may or may not be taken in the implementation of the PEP, the implementation may still provide that basis).

There is always the possibility of representations that are defined purely by userland code and can only be manipulated by that specific code. But expecting C extensions to support new representations that haven't been defined yet sounds like a bad idea.

-- --Guido van Rossum (python.org/~guido)

Previous message: [Python-Dev] PEP 393 Summer of Code Project
Next message: [Python-Dev] PEP 393 Summer of Code Project
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list