[Python-Dev] [Python-checkins] cpython: Optimize string slicing to use the new API (original) (raw)

Victor Stinner victor.stinner at haypocalc.com
Wed Oct 5 01:59:35 CEST 2011


Le 04/10/2011 20:09, "Martin v. Löwis" a écrit :

Am 04.10.11 19:50, schrieb Antoine Pitrou:

On Tue, 04 Oct 2011 19:49:09 +0200 "Martin v. Löwis"<martin at v.loewis.de> wrote:

+ result = PyUnicodeNew(slicelength, PyUnicodeMAXCHARVALUE(self)); This is incorrect: the maxchar of the slice might be smaller than the maxchar of the input string. I thought that heuristic would be good enough. I'll try to fix it. No - strings must always be in the canonical form.

I added a check in _PyUnicode_CheckConsistency() (debug mode) to ensure that newly created strings always use the most efficient storage.

For example, PyUnicodeRichCompare considers string unequal if they have different kinds. As a consequence, your slice result may not compare equal to a canonical variant of itself.

I see this as a micro-optimization. IMO we should not rely on these assumptions because we cannot expect that all developers of third party modules will be able to write perfect code, and some (lazy developers!) may prefer to use a fixed maximum character (e.g. 0xFFFF).

To be able to rely on such assumption, we have to make sure that strings are in canonical forms (always check before using a string?). But it would slow down Python because you have to scan the whole string to get the maximum characters (see my change in _PyUnicode_CheckConsistency).

I would prefer to drop such micro-optimization and tolerate non-canonical strings (strings not using the most efficient storage).

Even if PEP 393 is fully backward compatibly (except that PyUnicode_AS_UNICODE and PyUnicode_AsUnicode may now return NULL), it's already a big change (developers may want to move to the new API to benefit of the advantages of the PEP 393), and very few developers understand correctly Unicode.

It's safer to see the PEP 393 as a best-effort method. Hopefuly, most (or all?) strings created by Python itself are in canonical form.

Victor



More information about the Python-Dev mailing list