[Python-Dev] Auto-str and auto-unicode in join (original) (raw)

Nick Coghlan ncoghlan at iinet.net.au
Fri Aug 27 01:53:19 CEST 2004

Previous message: [Python-Dev] Re: [Python-checkins] python/dist/src/Lib/testtest_string.py, 1.25, 1.26
Next message: [Python-Dev] Auto-str and auto-unicode in join
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Tim Peters wrote:

I needed a break from intractable database problems, and am almost done with PyUnicodeJoin(). I'm not doing auto-unicode(), though, so there will still be plenty of fun left for Nick!

I actually got that mostly working (off slightly out-of-date CVS though).

Joining a sequence of 10 integers with auto-str seems to take about 60% of the time of a str(x) list comprehension on that same sequence (and the PySequence_Fast call means that a generator is slightly slower than a list comp!). For a sequence which mixed strings and non-strings, the gains could only increase.

However, there is one somewhat curly problem I'm not sure what to do about.

To avoid slowing down the common case of string join (a list of only strings) it is necessary to do the promotion to string in the type-check & size-calculation pass.

That's fine in the case of a list that consists of only strings and non-basestrings, or the case of a unicode separator - every non-basestring is converted using either PyObject_Str or PyObject_Unicode.

Where it gets weird is something like this: ''.join([an_int, a_unicode_str]) u''.join([an_int, a_unicode_str])

In the first case, the int will first be converted to a string via PyObject_Str, and then that string representation is what will get converted to Unicode after the detection of the unicode string causes the join to be handed over to Unicode join.

In the latter case, the int is converted directly to Unicode.

So my question would be, is it reasonable to expect that PyObject_Unicode(PyObject_Str(some_object)) give the same answer as PyObject_Unicode(some_object)?

If not, then the string join would have to do something whereby it kept a 'pristine' version of the sequence around to hand over to the Unicode join.

My first attempt at implementing this feature had that property, but also had the effect of introducing about a 1% slowdown of the standard sequence-of-strings case (it introduced an extra if statement to see if a 'stringisation' pass was required after the initial type checking and sizing pass). For longer sequences than 10 strings, I imagine the relative slowdown would be much less.

Hmm. . . I think I see a way to implement this, while still avoiding adding any code to the standard path through the function. It'd be slower for the case where an iterator is passed in, and we automatically invoke PyObject_Str but don't end up delegating to Unicode join, though, as it involves making a copy of the sequence that only gets used if the Unicode join is invoked. (If the original object is a real sequence, rather than an iterator, there is no extra overhead - we have to make the copy anyway, to avoid mutating the user's sequence).

If people are definitely interested in this feature, I could probably put a patch together next week.

Regards, Nick.

Previous message: [Python-Dev] Re: [Python-checkins] python/dist/src/Lib/testtest_string.py, 1.25, 1.26
Next message: [Python-Dev] Auto-str and auto-unicode in join
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list