In py3k, PyUnicode_Join inherits some complexity from the 2.x days. However, it seems some of the precautions taken there may not be needed anymore. Witness the following comment: /* Grrrr. A codec may be invoked to convert str objects to * Unicode, and so it's possible to call back into Python code * during PyUnicode_FromObject(), and so it's possible for a sick * codec to change the size of fseq (if seq is a list). Therefore * we have to keep refetching the size -- can't assume seqlen * is invariant. */ Perhaps it would also allow to preallocate the target buffer all at once (like bytes.join does) rather than resize it incrementally. Marc-Andre, what do you think?
The comment gives a wrong impression: The problem is not (only) that a codec might by evil, it's the fact that a codec may well execute Python code and thus allow the list to be changed by other threads during the operation. Now, since in Python 3.x codecs are no longer being invoked, it is probably safe to assume that Python code is not being executed while PyUnicode_Join() is running, but please double-check. It's also wise to apply a sanity check at the end of the loop to check whether the sequence length has indeed not changed (as assert maybe).
Well the potentially dangerous function would have been PyUnicode_FromObject, but in py3k it only accepts unicode instances (either exact or subclasses), and since we are only interested in the underlying buffer we can replace those calls with PyUnicode_Check. I'll work on a patch and keep you updated.