Issue 1569040: Speed up using + for string concatenation (original) (raw)

Created on 2006-10-02 04:04 by larry, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
python.string.concatenation.object.txt larry,2006-10-02 04:04 Patch with respect to revision 52084.
lch.python.concatenation.2.patch larry,2006-10-05 22:13 Patch against revision 52195. Cleaned up patch slightly: reduced memory usage slightly, removed marker comments, fixed some comments and some whitespace. review
concat.benchmarks.zip larry,2006-10-13 07:26 Zip file containing output from pybench and stringbench tests.
lch.python.concatenation.4.patch larry,2006-10-13 07:28 Patch against revision 52323. Added max() for platforms that don't have it, purged // comments, trimmed lines to below 80 characters. review
concat.test.py larry,2006-10-13 07:32 My hacked-together benchmark. test #12 is the case I optimized for.
linux.concat.pybench.results.zip larry,2006-10-13 08:05 Zip file containing output from pybench run on a Linux 2.6 system.
Messages (12)
msg51181 - (view) Author: Larry Hastings (larry) * (Python committer) Date: 2006-10-02 04:04
The core concept: adding two strings together no longer returns a pure "string" object. Instead, it returns a "string concatenation" object which holds references to the two strings but does not actually concatenate them... yet. The strings are concatenated only when someone requests the string's value, at which point it allocates all the space it needs and renders the concatenated string all at once. More to the point, if you add multiple strings together (a + b + c), it *doesn't* compute the intermediate strings (a + b). Upsides to this approach: * String concatenation using + is now the fastest way to concatenate strings (that I know of). * In particular, prepending is *way* faster than it used to be. It used to be a pathological case, n! or something. Now it's linear. * Throw off the shackles of "".join([]), you don't need it anymore. * Did I mention it was faster? Downsides to this approach: * Changes how PyStringObjects are stored internally; ob_sval is no longer a char[1], but a char *. This makes each StringObject four bytes larger. * Adds another memory dereference in order to get the value of a string, which is a teensy-weensy slowdown. * Would force a recompile of all C modules that deal directly with string objects (which I imagine is most of them). * Also, *requires* that C modules use the PyString_AS_STRING() macro, rather than casting the object and grabbing ob_sval directly. (I was pleased to see that the Python source was very good about using this macro; if all Python C modules are this well-behaved, this point is happily moot.) * On a related note, the file Mac/Modules/MacOS.c implies that there are Mac-specific Python scripts that peer directly into string objects. These would have to be changed to understand the new semantics. * String concatenation objects are 36 bytes larger than string objects, and this space will often go unreclaimed after the string is rendered. * When rendered, string concatenation objects storing long strings will allocate a second buffer from the heap to store the string. So this adds some minor allocation overhead (though this is offset by the speed gain from the approach overall). * Will definitely need some heavy review before it could go in, in particular I worry I got the semantics surrounding "interned" strings wrong.
msg84466 - (view) Author: Daniel Diniz (ajaksu2) * (Python triager) Date: 2009-03-30 02:34
IIRC, this was rejected as part of a larger string views proposal. Leaving open so that current performance optimizers can take a look at this :)
msg84467 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2009-03-30 02:43
I'm rejecting this because previous string "views" have been rejected.
msg84504 - (view) Author: Larry Hastings (larry) * (Python committer) Date: 2009-03-30 04:56
I'm not saying that killing a two-year-old DOA patch is the wrong move--though I hold out hope that lazy string concatenation in CPython will yet happen. But you shouldn't kill this patch just because you think it has something to do with "string views"--it doesn't. "String views" were suggested by Josiah Carlson. A "string view" was a separate object with a reference to a string (or strings); this patch changed the implementation of strings directly. Josiah Carlson would be the first to point out that "lazy strings" and "string views" were different: http://mail.python.org/pipermail/python-3000/2007-January/005546.html Here's my take on what happened with "lazy string concatenation". GvR asked me to port it to Py3k. I optimistically combined "lazy string concatenation" with another optimization, "lazy string slices", and submitted the combined patch. GvR examined the two patches together and disliked it because of the behavior of the "lazy string slices". http://bugs.python.org/issue1629305 I subsequently tried to drum up interest in lazy concatenation without lazy string slices: http://mail.python.org/pipermail/python-3000/2007-January/005641.html but there was no strong interest. Finally, GvR officially rejected the patch. I hold out hope that "lazy string concatenation", separate from "lazy string slices", could still make it in. (Truthfully, I hope they *both* could make it in someday, if I did a better job.) I wouldn't claim these are a no-brainer--see my original description of the patch for caveats--but I still think they're worth the tradeoffs. FWIW, ajaksu2, another implementations of Python *has* already implemented this approach: PyPy. http://codespeak.net/pypy/dist/pypy/doc/interpreter-optimizations.html#string-join-objects Writing for the sake of posterity, /larry/
msg84540 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2009-03-30 12:18
In that case, I think it's probably ok to reopen until we have a more definite accept or reject.
msg84541 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-03-30 12:23
For information, I haven't done any benchmarks, but the StringIO type in Python 3.1 should be very efficient for this kind of purposes.
msg84756 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2009-03-31 07:26
IIRC, this has already been rejected on python-dev in a number of discussions (check for "ropes" in the search). Also, Armin has long ago implemented some optimizations for string concatenation in a number of contexts.
msg84789 - (view) Author: Larry Hastings (larry) * (Python committer) Date: 2009-03-31 14:38
rhettinger: It's a bit unfair to paint the lazy string concatenation patch with the adjective "ropes", then point out ropes have been rejected many times. Lazy string concatenation objects are a form of specialized rope but they don't share the downsides of these other "ropes" proposals. The major problems with conventional rope implementations are a) slowdown, b) complexity, and c) you must use a new API to interact with them: http://mail.python.org/pipermail/python-dev/2000-February/002321.html Lazy string concatenation makes Python faster, it isolates its complexity locally inside the string object implementation, and it makes only two changes to the API. Those two changes are: one, you may no longer access the string directly, and two, APIs that returned the internal string (PyString_AsString, PyString_AS_STRING*) may fail in low-memory conditions. You don't need to use a new API to interact with the string; traditional APIs like strchr work fine. * Those were the names in 2.6 anyway. I don't know what the modern names would be in 3.1.
msg84834 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2009-03-31 16:49
I think lazy evaluation was discussed in the same thread. Either I or someone else suggested it and there was some issue because the string struct had long been exposed and people were accessing it directly.
msg84842 - (view) Author: Larry Hastings (larry) * (Python committer) Date: 2009-03-31 17:15
Thanks for pointing that out! I think I found it; the discussion is actually about lazy string slices, and it starts here: http://mail.python.org/pipermail/python-dev/2000-February/002317.html If I were to resubmit lazy string slices (which wouldn't be in this patch), the slices would only have weakrefs on the original string and render when the original string was destroyed. That addresses some of the concerns about lazy string slices, though they are still uncomfortably (internally) complex. As for the second concern: the official way is to use the accessors, and CPython consistently uses them itself. I don't know what the right answer is. We *could* make PyStringObject a private type to force the issue, though that would add overhead. (Well, PyUnicodeObject really, this would never be accepted in the 2.x series at this point.)
msg176356 - (view) Author: Ramchandra Apte (Ramchandra Apte) * Date: 2012-11-25 14:21
Buuump.
msg176357 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2012-11-25 14:26
This issue hasn't been touched in years and would certainly need a completely new patch.
History
Date User Action Args
2022-04-11 14:56:20 admin set github: 44063
2012-11-25 14:26:00 benjamin.peterson set status: open -> closedresolution: rejectedmessages: +
2012-11-25 14:21:15 Ramchandra Apte set nosy: + Ramchandra Aptemessages: +
2010-08-11 20🔞44 eric.araujo link issue1590352 superseder
2009-03-31 17:15:26 larry set messages: +
2009-03-31 16:49:25 rhettinger set messages: +
2009-03-31 14:38:28 larry set messages: +
2009-03-31 07:26:11 rhettinger set nosy: + rhettingermessages: +
2009-03-30 12:23:50 pitrou set nosy: + pitroumessages: +
2009-03-30 12🔞28 benjamin.peterson set status: closed -> openresolution: rejected -> (no value)messages: +
2009-03-30 04:56:54 larry set messages: +
2009-03-30 02:43:55 benjamin.peterson set status: open -> closednosy: + benjamin.petersonmessages: + resolution: rejected
2009-03-30 02:34:26 ajaksu2 set priority: normal -> lowversions: - Python 2.6nosy: + ajaksu2messages: + type: performance
2006-10-02 04:04:17 larry create