Issue 1105286: Undocumented implicit strip() in split(None) string method (original) (raw)

Issue1105286

Created on 2005-01-19 15:04 by yohell, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Messages (12)
msg23981 - (view) Author: YoHell (yohell) Date: 2005-01-19 15:04
Hi! I noticed that the string method split() first does an implicit strip() before splitting when it's used with no arguments or with None as the separator (sep in the docs). There is no mention of this implicit strip() in the docs. Example 1: s = " word1 word2 " s.split() then returns ['word1', 'word2'] and not ['', 'word1', 'word2', ''] as one might expect. WHY IS THIS BAD? 1. Because it's undocumented. See: http://www.python.org/doc/current/lib/string-methods.html#l2h-197 2. Because it may lead to unexpected behavior in programs. Example 2: FASTA sequence headers are one line descriptors of biological sequences and are on this form: ">" + Identifier + whitespace + free text description. Let sHeader be a Python string containing a FASTA header. One could then use the following syntax to extract the identifier from the header: sID = sHeader[1:].split(None, 1)[0] However, this does not work if sHeader contains a faulty FASTA header where the identifier is missing or consists of whitespace. In that case sID will contain the first word of the free text description, which is not the desired behavior. WHAT SHOULD BE DONE? The implicit strip() should be removed, or at least should programmers be given the option to turn it off. At the very least it should be documented so that programmers have a chance of adapting their code to it. Thank you for an otherwise splendid language! /Joel Hedlund Ph.D. Student IFM Bioinformatics Linköping University
msg23982 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2005-01-19 16:56
Logged In: YES user_id=31435 I think the docs for split() under "String Methods" are quite clear: """ ... If sep is not specified or is None, a different splitting algorithm is applied. Words are separated by arbitrary length strings of whitespace characters (spaces, tabs, newlines, returns, and formfeeds). Consecutive whitespace delimiters are treated as a single delimiter ("'1 2 3'.split()" returns "['1', '2', '3']"). Splitting an empty string returns "['']". """ This won't change, because mountains of code rely on this behavior -- it's probably the single most common use case for .split().
msg23983 - (view) Author: YoHell (yohell) Date: 2005-01-20 10:15
Logged In: YES user_id=1008220 In RE to tim_one: > I think the docs for split() under "String Methods" are quite > clear: On the countrary, my friend, and here's why: > """ > ... > If sep is not specified or is None, a different splitting > algorithm is applied. This sentecnce does not say that whitespace will be implicitly stripped from the edges of the string. > Words are separated by arbitrary length strings of whitespace > characters (spaces, tabs, newlines, returns, and formfeeds). Neither does this one. > Consecutive whitespace delimiters are treated as a single delimiter ("'1 > 2 3'.split()" returns "['1', '2', '3']"). And not that one. > Splitting an empty string returns "['']". > """ And that last one does not mention it either. In fact, there is no mention in the docs of how separators on edges of strings are treated by the split method. And furthermore, there is no mention of that s.split(sep) treats them differrently when sep is None than it does otherwise. Example: >>> ",2,".split(',') ['', '2', ''] >>> " 2 ".split() ['2'] This inconsistent behavior is not in line with how beautifully thought out the Python language is otherwise, and how brilliantly everything else is documented on the http://python.org/doc/ documentation pages. > This won't change, because mountains of code rely on this > behavior -- it's probably the single most common use case > for .split(). I thought as much. However - it's would be Really easy for an admin to add a line of documentation to .split() to explain this. That would certainly help make me a happier man, and hopefully others too. Cheers guys! /Joel
msg23984 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2005-01-20 14:04
Logged In: YES user_id=80475 What new wording do you propose to be added?
msg23985 - (view) Author: Jim Jewett (jimjjewett) Date: 2005-01-20 14:28
Logged In: YES user_id=764593 Replacing the quoted line: """ ... If sep is not specified or is None, a different splitting algorithm is applied. First whitespace (spaces, tabs, newlines, returns, and formfeeds) is stripped from both ends. Then words are separated by arbitrary length strings of whitespace characters . Consecutive whitespace delimiters are treated as a single delimiter ("'1 2 3'.split()" returns "['1', '2', '3']"). Splitting an empty (or whitespace- only) string returns "['']". """
msg23986 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2005-01-20 14:50
Logged In: YES user_id=80475 The prosposed wording is fine. If there are no objections or concerns, I'll apply it soon.
msg23987 - (view) Author: YoHell (yohell) Date: 2005-01-20 14:59
Logged In: YES user_id=1008220 Brilliant, guys! Thanks again for a superb scripting language, and with documentation to match! Take care! /Joel Hedlund
msg23988 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2005-01-24 07:15
Logged In: YES user_id=593130 To me, the removal of whitespace at the ends (stripping) is consistent with the removal (or collapsing) of extra whitespace in between so that .split() does not return empty words anywhere. Consider: >>> ',1,,2,'.split(',') ['', '1', '', '2', ''] If ' 1 2 '.split() were to return null strings at the beginning and end of the list, then to be consistent, it should also put one in the middle. One can get this by being explicit (mixed WS can be handled by translation): >>> ' 1 2 '.split(' ') ['', '1', '', '2', ''] Having said this, I also agree that the extra words proposed by jj are helpful. BUG?? In 2.2, splitting an empty or whitespace only string produces an empty list [], not a list with a null word ['']. >>> ''.split() [] >>> ' '.split() [] which is what I see as consistent with the rest of the no-null- word behavior. Has this changed since? (Yes, must upgrade.) I could find no indication of such change in either the tracker or CVS.
msg23989 - (view) Author: Bastian Kleineidam (calvin) Date: 2005-01-24 12:51
Logged In: YES user_id=9205 This should probably also be added to rsplit()?
msg23990 - (view) Author: YoHell (yohell) Date: 2006-11-07 14:06
Logged In: YES user_id=1008220 I'm opening this again, since the docs still don't reflect the behavior of the method. from the docs: """ If sep is not specified or is None, a different splitting algorithm is applied. First, whitespace characters (spaces, tabs, newlines, returns, and formfeeds) are stripped from both ends. """ This is not true when maxsplit is given. Example: >>> " foo bar ".split(None) ['foo', 'bar'] >>> " foo bar ".split(None, 1) ['foo', 'bar '] Whitespace is obviously not stripping whitespace from the ends of the string before splitting the rest of the string.
msg23991 - (view) Author: YoHell (yohell) Date: 2006-11-07 14:11
Logged In: YES user_id=1008220 *resubmission: grammar corrected* I'm opening this again, since the docs still don't reflect the behavior of the method. from the docs: """ If sep is not specified or is None, a different splitting algorithm is applied. First, whitespace characters (spaces, tabs, newlines, returns, and formfeeds) are stripped from both ends. """ This is not true when maxsplit is given. Example: >>> " foo bar ".split(None) ['foo', 'bar'] >>> " foo bar ".split(None, 1) ['foo', 'bar '] Whitespace is obviously not stripped from the ends before the rest of the string is split.
msg23992 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2007-01-06 02:16
I think the current wording is clear enough and that further attempts to specify corner cases will only make the docs harder to understand and less useful.
History
Date User Action Args
2022-04-11 14:56:09 admin set github: 41462
2005-01-19 15:04:27 yohell create