Issue 1613130: str.split creates new string even if pattern not found (original) (raw)

Issue1613130

Created on 2006-12-11 13:03 by pitrou, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Messages (6)
msg30782 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2006-12-11 13:03
Hello, Several string methods avoid allocating a new string when the operation result is trivially the same as one of the parameters (e.g. replacing a non-existing substring). However, split() does not exhibit this optimization, it always constructs a new string even if no splitting occurs: $ python Python 2.5 (r25:51908, Oct 6 2006, 15:22:41) [GCC 4.1.2 20060928 (prerelease) (Ubuntu 4.1.1-13ubuntu4)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> s = "abcde" * 2 >>> id(s) 3084139400L >>> id(str(s)) 3084139400L >>> id("" + s) 3084139400L >>> id(s.strip()) 3084139400L >>> id(s.replace("g", "h")) 3084139400L >>> [id(x) for x in s.partition("h")] [3084139400L, 3084271768L, 3084271768L] >>> [id(x) for x in s.split("h")] [3084139360L]
msg30783 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2006-12-12 11:35
Ok, I did a patch which partially adds the optimization (the patch is at home, I can't post it right now). I have a few questions though: - there is a USE_FAST flag which can bring some speedups when a multicharacter separator is used; however, it is not enabled by default, is there a reason for this? - where and by whom is maintained stringbench.py, so that I can propose additional tests for it (namely, tests for unmatched split())? - split() implementation is duplicated between str and unicode (the unicode versions having less optimizations), would it be useful to "stringlib'ify" split()? - rsplit() does quite similar things as split(), has anyone tried to factor similar parts? do you see any caveats doing so?
msg30784 - (view)	Author: Georg Brandl (georg.brandl) *	Date: 2006-12-12 16:21
Sounds like this is best assigned to Fredrik.
msg30785 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2007-04-11 17:09
Dropping the priority. This pay-off is near zero and likely not worth the cost of making the code more complex than it already is.
msg30786 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2007-04-12 09:19
Hi, > Dropping the priority. This pay-off is near zero and likely not worth the > cost of making the code more complex than it already is. No problem! The more interesting question actually was whether it made any sense to factor out the split() implementation in "stringlib" so as to share the implementation between str and unicode. Also, as for the USE_FAST question you asked on python-dev, I may have an answer: if you try to enable USE_FAST you'll see that some operations are indeed faster on large strings (say 100s or 1000s of characters), but they become slower on small strings because of the larger overhead of the search algorithm. Thus USE_FAST could negatively impact Python programs which process a lot of small strings.
msg102082 - (view)	Author: Florent Xicluna (flox) *	Date: 2010-04-01 11:26
Fixed with #7622, and USE_FAST flag removed.

History
Date	User	Action	Args
2022-04-11 14:56:21	admin	set	github: 44326
2010-04-01 11:26:43	flox	set	superseder: [patch] improve unicode methods: split() rsplit() and replace()
2010-04-01 11:26:14	flox	set	status: open -> closednosy: + floxmessages: + resolution: out of datestage: needs patch -> resolved
2009-03-14 01:45:12	pitrou	set	versions: + Python 3.1, Python 2.7, - Python 2.6, Python 2.5, Python 3.0type: enhancement -> performancestage: needs patch
2007-09-20 16:47:01	georg.brandl	set	type: enhancementversions: + Python 2.6, Python 3.0
2006-12-11 13:03:23	pitrou	create