[Python-Dev] Auto-str and auto-unicode in join (original) (raw)

M.-A. Lemburg mal at egenix.com
Fri Aug 27 11:02:05 CEST 2004


Nick Coghlan wrote:

Tim Peters wrote:

I needed a break from intractable database problems, and am almost done with PyUnicodeJoin(). I'm not doing auto-unicode(), though, so there will still be plenty of fun left for Nick! I actually got that mostly working (off slightly out-of-date CVS though). Joining a sequence of 10 integers with auto-str seems to take about 60% of the time of a str(x) list comprehension on that same sequence (and the PySequenceFast call means that a generator is slightly slower than a list comp!). For a sequence which mixed strings and non-strings, the gains could only increase. However, there is one somewhat curly problem I'm not sure what to do about. To avoid slowing down the common case of string join (a list of only strings) it is necessary to do the promotion to string in the type-check & size-calculation pass. That's fine in the case of a list that consists of only strings and non-basestrings, or the case of a unicode separator - every non-basestring is converted using either PyObjectStr or PyObjectUnicode. Where it gets weird is something like this: ''.join([anint, aunicodestr]) u''.join([anint, aunicodestr])

This gives you a TypeError, so it's a non-issue (.join() does not do an implicit call to str(obj) on the list elements).

The real issue is the case where you have [a_str, a_unicode_obj] and for that the current implementation already does the right thing, namely to look for Unicode objects in the length checking pass.

In the first case, the int will first be converted to a string via PyObjectStr, and then that string representation is what will get converted to Unicode after the detection of the unicode string causes the join to be handed over to Unicode join.

In the latter case, the int is converted directly to Unicode. So my question would be, is it reasonable to expect that PyObjectUnicode(PyObjectStr(someobject)) give the same answer as PyObjectUnicode(someobject)?

If not, then the string join would have to do something whereby it kept a 'pristine' version of the sequence around to hand over to the Unicode join.

My first attempt at implementing this feature had that property, but also had the effect of introducing about a 1% slowdown of the standard sequence-of-strings case (it introduced an extra if statement to see if a 'stringisation' pass was required after the initial type checking and sizing pass). For longer sequences than 10 strings, I imagine the relative slowdown would be much less. Hmm. . . I think I see a way to implement this, while still avoiding adding any code to the standard path through the function. It'd be slower for the case where an iterator is passed in, and we automatically invoke PyObjectStr but don't end up delegating to Unicode join, though, as it involves making a copy of the sequence that only gets used if the Unicode join is invoked. (If the original object is a real sequence, rather than an iterator, there is no extra overhead - we have to make the copy anyway, to avoid mutating the user's sequence). If people are definitely interested in this feature, I could probably put a patch together next week. Regards, Nick.


Python-Dev mailing list Python-Dev at python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/mal%40egenix.com

-- Marc-Andre Lemburg eGenix.com

Professional Python Services directly from the Source (#1, Aug 27 2004)

Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/


::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::



More information about the Python-Dev mailing list