[Python-Dev] Re: [Python-checkins] python/dist/src/Lib/test test_string.py, 1.25, 1.26 (original) (raw)

Walter Dörwald walter at livinglogic.de
Thu Aug 26 21:54:34 CEST 2004


Tim Peters wrote:

[Walter Dörwald]

I'm working on it, however I discovered that unicode.join() doesn't optimize this special case:

s = "foo" assert "".join([s]) is s u = u"foo" assert u"".join([s]) is s The second assertion fails. Well, in that example it has to fail, because the input (s) wasn't a unicode string to begin with, but u"".join() must return a unicode string. Maybe you intended to say that assert u"".join([u]) is u fails

Argl, you're right.

(which is also true today, but doesn't need to be true tomorrow).

I've removed the test today, so it won't fail tomorrow. ;)

I'd say that this test (joining a one item sequence returns the item itself) should be removed because it tests an implementation detail. Neverthess, it's an important pragmatic detail. We should never throw away a test just because rearrangement makes a test less convenient.

So, should I put the test back in (in test_str.py)?

I'm not sure, whether the optimization should be added to unicode.find(). Believing you mean join(), yes.

Unfortunately the implementations of str.join and unicode.join look completely different. str.join does a PySequence_Fast() and then tests whether the sequence length is 0 or 1, unicode.join iterates through the argument via PyObject_GetIter/PyIter_Next.

Adding the optimization might result in a complete rewrite of PyUnicode_Join().

Doing common endcases efficiently in C code is an important quality-of-implementation concern, lest people need to add reams of optimization test-&-branch guesses in their own Python code. For example, the SpamBayes tokenizer has many passes that split input strings on magical separators of one kind or another, pasting the remaining pieces together again via string.join(). It's explicitly noted in the code that special-casing the snot out of "separator wasn't found" in Python is a lot slower than letting string.join(singleelementlist) just return the list element, so that simple, uniform Python code works well in all cases. It's expected that most of these SB passes won't find the separator they're looking for, and it's important not to make endless copies of unboundedly large strings in the expected case. The more heavily used unicode strings become, the more important that they treat users kindly in such cases too.

Seems like we have to rewrite PyUnicode_Join().

Bye, Walter Dörwald



More information about the Python-Dev mailing list