[Python-Dev] PEP 393 Summer of Code Project (original) (raw)

Victor Stinner victor.stinner at haypocalc.com
Wed Aug 24 23:10:32 CEST 2011


Le mercredi 24 août 2011 20:52:51, Glenn Linderman a écrit :

Given the required variability of character size in all presently Unicode defined encodings, I tend to agree with Tom that UTF-8, together with some technique of translating character index to code unit offset, may provide the best overall space utilization, and adequate CPU efficiency.

UTF-8 can use more space than latin1 or UCS2:

text="abc"; len(text.encode("latin1")), len(text.encode("utf8")) (3, 3) text="ééé"; len(text.encode("latin1")), len(text.encode("utf8")) (3, 6) text="€€€"; len(text.encode("utf-16-le")), len(text.encode("utf8")) (6, 9) text="北京"; len(text.encode("utf-16-le")), len(text.encode("utf8")) (4, 6)

UTF-8 uses less space than PEP 393 only if you have few non-ASCII characters (or few non-BMP characters).

About speed, I guess than O(n) (UTF8 indexing) is slower than O(1) (PEP 393 indexing).

... Applications that support long strings are more likely to bitten by the occasional "outlier" character that is longer than the average character, doubling or quadrupling the space needed to represent such strings, and eliminating a significant portion of the space savings the PEP is providing for other applications.

In these worst cases, the PEP 393 is not worse than the current implementation: it just as much memory than Python in wide mode (mode used on Linux and Mac OS X because wchar_t is 32 bits). But it uses the double of Python in narrow mode (Windows).

I agree than UTF-8 is better in these corner cases, but I also bet than most Python programs will use less memory and will be faster with the PEP 393. You can already try the pep-393 branch on your own programs.

Benchmarks may or may not fully reflect the actual requirements of all applications, so conclusions based on benchmarking can easily be blind-sided the realities of other applications, unless the benchmarks are carefully constructed.

I used stringbench and "./python -m test test_unicode". I plan to try iobench.

Which other benchmark tool should be used? Should we write a new one?

It is possible that the ideas in PEP 393, with its support for multiple underlying representations, could be the basis for some more complex representations that would better support characters rather than only supporting code points, ...

I don't think that the default Unicode type is the best place for this. The base Unicode type has to be very efficient.

If you have unusual needs, write your own type. Maybe based on the base type?

Victor



More information about the Python-Dev mailing list