[Python-Dev] RFC: Add a new builtin strarray type to Python? (original) (raw)

Victor Stinner victor.stinner at haypocalc.com
Sat Oct 1 19:17:56 CEST 2011


Hi,

Since the integration of the PEP 393, str += str is not more super-fast (but just fast). For example, adding a single character to a string has to copy all characters to a new string. I suppose that performances of a lot of applications manipulating text may be affected by this issue, especially text templating libraries.

io.StringIO has also been changed to store characters as Py_UCS4 (4 bytes) instead of Py_UNICODE (2 or 4 bytes). This class doesn't benefit from the new PEP 393.

I propose to add a new builtin type to Python to improve both issues (cpu and memory): strarray. This type would have the same API than str, except:

I'm writing this email to ask you if this type solves a real issue, or if we can just prove the super-fast str.join(list of str).

--

strarray is similar to bytearray, but different: strarray('abc')[0] is 'a', not 97, and strarray can store any Unicode character (not only integers in range 0-255).

I wrote a quick and dirty implementation in Python just to be able to play with the API, and to have an idea of the quantity of work required to implement it:

https://bitbucket.org/haypo/misc/src/tip/python/strarray.py

(Some methods are untested: see the included TODO list.)

--

Implement strarray in C is not trivial and it would be easier to implement it in 3 steps:

(a) Use Py_UCS4 array (b) The array type depends on the content: best memory footprint, as the PEP 393 (c) Use strarray to implement a new io.StringIO

Or we can just stop after step (a).

--

strarray API has to be discussed.

Most bytearray methods return a new object in most cases. I don't understand why, it's not efficient. I don't know if we can do in-place operations for strarray methods having the same name than bytearray methods (which are not in-place methods).

str has some more methods that bytes and bytearary don't have, like format. We may do in-place operation for these methods.

Victor



More information about the Python-Dev mailing list