Issue 1123: split(None, maxsplit) does not strip whitespace correctly (original) (raw)

Created on 2007-09-07 01:18 by nirs, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Messages (11)
msg55720 - (view) Author: Nir Soffer (nirs) * Date: 2007-09-07 01:18
string object .split doc say (http://docs.python.org/lib/string- methods.html): "If sep is not specified or is None, a different splitting algorithm is applied. First, whitespace characters (spaces, tabs, newlines, returns, and formfeeds) are stripped from both ends." If the maxsplit argument is set and is smaller then the number of possible parts, whitespace is not removed. Examples: >>> 'k: v\n'.split(None, 1) ['k:', 'v\n'] Expected: ['k:', 'v'] >>> u'k: v\n'.split(None, 1) [u'k:', u'v\n'] Expected: [u'k:', u'v'] With larger values of maxsplits, it works correctly: >>> 'k: v\n'.split(None, 2) ['k:', 'v'] >>> u'k: v\n'.split(None, 2) [u'k:', u'v'] This looks like implementation bug, because there it does not make sense that the striping depends on the maxsplit argument, and it will be hard to explain such behavior. Maybe the striping should be removed in Python 3? It does not make sense to strip a string behind your back when you want to split it, and the caller can easily strip the string if needed.
msg55806 - (view) Author: Fredrik Lundh (effbot) * (Python committer) Date: 2007-09-10 22:13
Looks like a *documentation* bug to me; at the implementation level, None just means "no empty parts, treat runs of whitespace as separators".
msg55807 - (view) Author: Nir Soffer (nirs) * Date: 2007-09-10 22:32
I did not look into the source, but obviously there is striping of leading and trailing whitespace. When you specify a separator you get: >>> ' '.split(' ') ['', '', ''] >>> ' a b '.split(' ') ['', 'a', 'b', ''] So one would expect to get this without striping: >>> ' a b '.split() ['', 'a', 'b', ''] But you get this: >>> ' a b '.split() ['a', 'b'] So the documentation is correct.
msg55809 - (view) Author: Fredrik Lundh (effbot) * (Python committer) Date: 2007-09-10 22:41
But wasn't your complaint that the implementation didn't match the documentation? As I said, the *implementation* treats "runs of whitespace" as separators, except for whitespace at the beginning or end (or in other words, it never returns empty strings). That matches the documentation, except for the "first" in "first, whitespace characters are stripped from both ends". As far as I can tell, the documentation has never matched the implementation here.
msg55819 - (view) Author: Nir Soffer (nirs) * Date: 2007-09-11 11:12
There is a problem only when maxsplit is smaller than the available splits. In other cases, the docs and the behavior match.
msg55962 - (view) Author: Sean Reifschneider (jafo) * (Python committer) Date: 2007-09-17 11:05
I believe this is just a place where the documentation could be cleared up. Seems to me the confusion is from the document saying (paraphrased): "white space is removed from both ends". Perhaps it should say something like "runs of 1 or more whitespace are collapsed (up to the maximum split), and then split on" or simply "split on runs of 1 or more whitespace. In other words, 3 spaces together would be treated as a single split-point instead of 3 0-length fields separated by spaces." So, in the first example provided by "nirs" in this issue, "both ends" refers to both the left and right side of "k:". Since maxsplit is 1, the second part (v) is left untouched. This is the intended operation. This is a documentation bug, not a library bug. Fred: Thoughts on wording?
msg56021 - (view) Author: Brett Cannon (brett.cannon) * (Python committer) Date: 2007-09-19 02:22
The algorithm is actually kind of odd:: >>> " a b".split(None, 0) ['a b'] >>> "a b ".split(None, 0) ['a b '] >>> "a b ".split(None, 1) ['a', 'b '] So trailing whitespace on the original string is stripped only if the number of splits is great enough to lead to a possible split past the last element. But leading whitespace is always removed. Basically the algorithm stops looking for whitespace once it has encountered maxsplit instances of contiguous whitespace plus leading whitespace.
msg56024 - (view) Author: Sean Reifschneider (jafo) * (Python committer) Date: 2007-09-19 02:42
In looking at the current documentation: http://docs.python.org/dev/library/string.html#string.split I don't see the wording the original poster mentions. The current documentation of the separator is clear and reasonable. I'm going to call this closed, unless someone can suggest specific wording changes to the document let's call this done.
msg56026 - (view) Author: Nir Soffer (nirs) * Date: 2007-09-19 04:12
I quoted str.split docs: - http://docs.python.org/lib/string-methods.html - http://docs.python.org/dev/library/stdtypes.html - http://docs.python.org/dev/3.0/library/stdtypes.html string.split doc does it explain this: >>> ' a b '.split(None, 1) ['a', 'b '] >>> ' a b '.split(None, 2) ['a', 'b'] .split method docs is more clear and describe this in a very simple way. This is a better description of the current behavior: "If sep is not specified or is None, a different splitting algorithm is applied. First, whitespace characters (spaces, tabs, newlines, returns, and formfeeds) are stripped from the start of the string. Then, words are separated by arbitrary length strings of whitespace characters. Consecutive whitespace delimiters are treated as a single delimiter ("' 1 \t 2 \n 3 '.split()" returns "['1', '2', '3']"). If maxsplit is nonzero, at most maxsplit number of splits occur, and the remainder of the string is returned as the final element of the list, unless it is empty. Splitting an empty string or a string consisting of just whitespace returns an empty list."
msg56257 - (view) Author: Brett Cannon (brett.cannon) * (Python committer) Date: 2007-10-07 20:05
Re-opening as jafo was referring to the string module's function implementation which is deprecated. The real issue is that the built-in types docs are bad.
msg56272 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2007-10-08 07:50
This should now be fixed in r58368.
History
Date User Action Args
2022-04-11 14:56:26 admin set github: 45464
2007-10-08 07:50:36 georg.brandl set status: open -> closednosy: + georg.brandlresolution: fixedmessages: +
2007-10-07 20:05:53 brett.cannon link issue1240 superseder
2007-10-07 20:05:30 brett.cannon set status: closed -> openassignee: fdrake -> messages: + resolution: not a bug -> (no value)versions: + Python 2.6
2007-09-19 04:12:51 nirs set messages: +
2007-09-19 02:42:36 jafo set status: open -> closedresolution: not a bugmessages: +
2007-09-19 02:22:39 brett.cannon set nosy: + brett.cannonmessages: +
2007-09-17 11:05:20 jafo set priority: lowassignee: fdrakemessages: + components: + Documentation, - Library (Lib)nosy: + fdrake, jafo
2007-09-11 11:12:44 nirs set messages: +
2007-09-10 22:41:05 effbot set messages: +
2007-09-10 22:32:47 nirs set messages: +
2007-09-10 22:13:35 effbot set nosy: + effbotmessages: +
2007-09-07 17:31:54 gvanrossum set messages: -
2007-09-07 17:31:49 gvanrossum set messages: -
2007-09-07 02:04:13 nirs set type: behaviormessages: +
2007-09-07 01:19:23 nirs set messages: + title: split(None, maxplit) does not strip whitespace correctly -> split(None, maxsplit) does not strip whitespace correctly
2007-09-07 01🔞40 nirs create