[Python-Dev] Re: [Python-checkins] python/dist/src/Lib/test test_string.py, 1.25, 1.26 (original) (raw)

Tim Peters tim.peters at gmail.com
Thu Aug 26 20:21:25 CEST 2004


[Walter Dörwald]

I'm working on it, however I discovered that unicode.join() doesn't optimize this special case:

s = "foo" assert "".join([s]) is s u = u"foo" assert u"".join([s]) is s The second assertion fails.

Well, in that example it has to fail, because the input (s) wasn't a unicode string to begin with, but u"".join() must return a unicode string. Maybe you intended to say that

assert u"".join([u]) is u

fails (which is also true today, but doesn't need to be true tomorrow).

I'd say that this test (joining a one item sequence returns the item itself) should be removed because it tests an implementation detail.

Neverthess, it's an important pragmatic detail. We should never throw away a test just because rearrangement makes a test less convenient.

I'm not sure, whether the optimization should be added to unicode.find().

Believing you mean join(), yes. Doing common endcases efficiently in C code is an important quality-of-implementation concern, lest people need to add reams of optimization test-&-branch guesses in their own Python code. For example, the SpamBayes tokenizer has many passes that split input strings on magical separators of one kind or another, pasting the remaining pieces together again via string.join(). It's explicitly noted in the code that special-casing the snot out of "separator wasn't found" in Python is a lot slower than letting string.join(single_element_list) just return the list element, so that simple, uniform Python code works well in all cases. It's expected that most of these SB passes won't find the separator they're looking for, and it's important not to make endless copies of unboundedly large strings in the expected case. The more heavily used unicode strings become, the more important that they treat users kindly in such cases too.



More information about the Python-Dev mailing list