In the attachments is a testcase which does concatenate 100000 times a string and than 100000 times a bytes object. Here is my result: sworddragon@ubuntu:~/tmp$ ./test.py String: 0.03165316581726074 Bytes : 0.5805566310882568
It is definitely not a good idea to rely on that optimization of += for string. Obviously bytes doesn't have the same optimization. (String didn't either for a while in Python3, and there was some controversy around adding it back exactly because one should not rely on it.)
Indeed. If you want to concatenate a lot of bytes objects efficiently, there are three solutions: - concatenate to a bytearray - write to a io.BytesIO object - use b''.join to concatenate all objects at once
I have extended the benchmark a little and here are my new results: concatenate_string() : 0.037489 concatenate_bytes() : 2.920202 concatenate_bytearray() : 0.157311 concatenate_string_io() : 0.035397 concatenate_bytes_io() : 0.032835 concatenate_string_join() : 0.170623 concatenate_string_and_encode(): 0.037280 - As we already know concatenating bytes is much slower then concatenating strings. - concatenate_bytearray() shows that doing this with bytearrays is 5 times slower than concatenating strings. Also it will return a bytearray and I couldn't figure out how to convert it simply to a bytes object in this short time. - Interestingly concatenate_string_io() shows that using a StringIO object is faster than concatenating strings directly. - Even more interesting is that concatenate_bytes_io() shows that a BytesIO object is the fastest solution of all. - Using .join in concatenate_string_join() shows that it is slow too. - Curiously I couldn't test concatenate_bytes_join() as it will result in an exception. Searching the documentation resulted that I can't find a join method for bytes objects to look what is wrong. - I have also tested in concatenate_string_and_encode() how fast it is to concatenate strings and then simply encode them. The performance impact compared to concatenating strings directly is low enough that the test couldn't measure it anymore. Summary: BytesIO is the fastest solution but needs to import an extra library. Concatenating strings and then encode them seems to be the most practicable solution if io is not already imported. But I'm wondering why Python can't simply have the string optimization on bytes too.
Please take these observations and questions to python-list. They aren't really appropriate for the bug tracker. We aren't going to add the optimization shortcut for bytes unless someone does a bunch of convincing on python-ideas, which seems unlikely (but not impossible).