[Python-Dev] Optimize Unicode strings in Python 3.3 (original) (raw)

Victor Stinner victor.stinner at gmail.com
Fri May 4 01:45:15 CEST 2012

Previous message: [Python-Dev] [Python-checkins] cpython: unicode_writer: add finish() method and assertions to write_str() method
Next message: [Python-Dev] Optimize Unicode strings in Python 3.3
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi,

Different people are working on improving performances of Unicode strings in Python 3.3. This Python version is very different from Python 3.2 because of the PEP 393, and it is still unclear to me what is the best way to create a new Unicode string.

There are different approachs:

Use the legacy (Py_UNICODE) API, PyUnicode_READY() converts the result to the canonical form. CJK codecs are still using this API.
Use a Py_UCS4 buffer and then convert to the canonical form (ASCII, UCS1 or UCS2). Approach taken by io.StringIO. io.StringIO is not only used to write, but also to read and so a Py_UCS4 buffer is a good compromise.
PyAccu API: optimized version of chunks=[]; for ...: ... chunks.append(text); return ''.join(chunks).
Two steps: compute the length and maximum character of the output string, allocate the output string and then write characters. str%args was using it.
Optimistic approach. Start with a ASCII buffer, enlarge and widen (to UCS2 and then UCS4) the buffer when new characters are written. Approach used by the UTF-8 decoder and by str%args since today.

The optimistic approach uses realloc() to resize the string. It is faster than the PyAccu approach (at least for short ASCII strings), maybe because it avoids the creating of temporary short strings. realloc() looks to be efficient on Linux and Windows (at least Seven).

Various notes:

PyUnicode_READ() is slower than reading a Py_UNICODE array.
Some decoders unroll the main loop to process 4 or 8 bytes (32 or 64 bits CPU) at each step.

I am interested if you know other tricks to optimize Unicode strings in Python, or if you are interested to work on this topic.

There are open issues related to optimizing Unicode:

#11313: Speed up default encode()/decode() #12807: Optimization/refactoring for {bytearray, bytes, unicode}.strip() #14419: Faster ascii decoding #14624: Faster utf-16 decoder #14625: Faster utf-32 decoder #14654: More fast utf-8 decoding #14716: Use unicode_writer API for str.format()

Victor

Previous message: [Python-Dev] [Python-checkins] cpython: unicode_writer: add finish() method and assertions to write_str() method
Next message: [Python-Dev] Optimize Unicode strings in Python 3.3
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list