Issue 3262: re.split doesn't split with zero-width regex (original) (raw)

Created on 2008-07-02 22:07 by mrabarnett, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
split_zero_width.diff mrabarnett,2008-07-03 00:59
Pull Requests
URL Status Linked Edit
PR 4471 merged serhiy.storchaka,2017-11-19 23:36
PR 4678 closed serhiy.storchaka,2017-12-02 17:32
Messages (15)
msg69134 - (view) Author: Matthew Barnett (mrabarnett) * (Python triager) Date: 2008-07-02 22:07
re.split doesn't split a string when the regex matches a zero characters. For example: re.split(r'\b', 'a b') returns ['a b'] instead of ['', 'a', ' ', 'b', '']. re.split(r'(?<!\w)(?=\w)', 'a b') returns ['a b'] instead of ['', 'a ', 'b'].
msg69139 - (view) Author: Matthew Barnett (mrabarnett) * (Python triager) Date: 2008-07-02 22:51
The attached patch appears to work.
msg69146 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2008-07-02 23:28
Probably by design. There's probably even a unittest for this behavior.
msg69150 - (view) Author: Matthew Barnett (mrabarnett) * (Python triager) Date: 2008-07-02 23:57
I've found that this issue has been discussed before: #988761.
msg69157 - (view) Author: Matthew Barnett (mrabarnett) * (Python triager) Date: 2008-07-03 00:59
New patch version after studying #988761 and doing more testing.
msg69408 - (view) Author: Mike Coleman (mkc) Date: 2008-07-08 02:36
I don't want to discourage you, but #852532, which is essentially the same bug report, was closed--without explanation--as 'wont fix' in April, after four-plus years. I wish you good luck--this is an important and irritating bug, in my opinion...
msg69438 - (view) Author: Matthew Barnett (mrabarnett) * (Python triager) Date: 2008-07-08 16:39
There appear to be 2 opinions on this issue: 1. It's a bug, a corner case that got missed. 2. It's always been like this, so it's probably a design decision, although no-one can't point to where or when the decision was made... Looking at the code, I think it's a bug. Expected behaviour: if 'pattern' is a non-capturing regex, then re.split(pattern, text) == re.sub(pattern, MARKER, text).split(MARKER).
msg69852 - (view) Author: Mike Coleman (mkc) Date: 2008-07-16 22:40
I think it's probably both. The original design was incorrect, though this probably wasn't apparent to the designer. But as a significant user of 're', it really stands out as a problem.
msg70749 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2008-08-05 16:08
I think it's better to leave this alone. Such a subtle change is likely to trip over more people in worse ways than the alleged "bug".
msg70752 - (view) Author: Mike Coleman (mkc) Date: 2008-08-05 16:18
Okay. For what it's worth, note that my original 2004 patch for this (#988761) is completely backward compatible (a flag must be set in the call to get the new behavior).
msg73523 - (view) Author: Matthew Barnett (mrabarnett) * (Python triager) Date: 2008-09-21 19:41
I wonder whether it could be put into Python 3 where certain breaks in backwards compatibility are to be expected.
msg73567 - (view) Author: Jeffrey C. Jacobs (timehorse) Date: 2008-09-22 11:54
I think Mike Coleman proposal of enabling this behaviour via flag is probably best and IMHO we should consider it under these circumstances. Intuitively, I think you're interpretation of what re.split should do under zero-width conditions is logical, and I almost think this should be a 2-minor number transition à la from __future__ import zeroWidthRegexpSplit if we are to consider it as the long-term 'right thing to do'. 3000 (3.0) seems a good place to also consider it for true overhaul / reexamination, especially as we are writing 'upgrade' scripts for many of the other Python features. However, I would say this, Guido has spoken and it may be too late for the pebbles to vote. I would like to add this patch as a new item to the general Regexp Enhancements thread of issue 2636 though, as I think it is an idea worth considering when overhauling Regexp.
msg73592 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2008-09-22 20:39
The problem with doing this per 3.0 is that it's impossible to write a conversion script. I'm okay with adding a flag to enable this behavior though. Please open a new bug with a new patch, preferably one that applies cleanly to the trunk, and a separate patch for the py3k branch unless the trunk patch merges cleanly. There should also be unittests and documentation. The patches should be marked for Python 2.7 and 3.1 -- it's way too late to get this into 2.6 and 3.0.
msg104226 - (view) Author: Tim Pietzcker (pietzcker) Date: 2010-04-26 12:29
Sorry to revive this dormant (?) topic - has anybody brought this any further? This "feature" has tripped me up a few times, and I would be all for adding a flag to enable the "split on zero-size matches" behavior, but I myself am not competent enough to code a patch.
msg104257 - (view) Author: Matthew Barnett (mrabarnett) * (Python triager) Date: 2010-04-26 17:31
You could try the regex module mentioned in issue 2636.
History
Date User Action Args
2022-04-11 14:56:36 admin set github: 47512
2021-11-04 14:19:04 eryksun set nosy: - ahmedsayeed1982
2021-11-04 14🔞56 eryksun set messages: -
2021-11-04 12:09:24 ahmedsayeed1982 set versions: - Python 2.6, Python 2.5, Python 3.1nosy: + ahmedsayeed1982, - gvanrossum, mkc, timehorse, filip, pietzcker, mrabarnettmessages: + components: + Tests, - Regular Expressions
2017-12-02 17:32:37 serhiy.storchaka set pull_requests: + <pull%5Frequest4589>
2017-11-19 23:36:58 serhiy.storchaka set pull_requests: + <pull%5Frequest4406>
2010-08-04 05:05:56 terry.reedy set status: open -> closed
2010-04-26 17:31:46 mrabarnett set messages: +
2010-04-26 12:29:45 pietzcker set nosy: + pietzckermessages: + versions: + Python 2.6, Python 3.1, Python 2.7
2008-09-22 20:40:00 gvanrossum set messages: +
2008-09-22 11:54:30 timehorse set messages: +
2008-09-21 19:41:19 mrabarnett set messages: +
2008-09-21 11:58:49 timehorse set nosy: + timehorse
2008-08-05 16🔞46 mkc set messages: +
2008-08-05 16:08:32 gvanrossum set resolution: rejectedmessages: +
2008-07-16 22:40:59 mkc set messages: +
2008-07-08 16:39:18 mrabarnett set messages: +
2008-07-08 02:36:23 mkc set messages: +
2008-07-08 02:20:49 mkc set nosy: + mkc
2008-07-07 11:40:01 filip set nosy: + filip
2008-07-03 00:59:38 mrabarnett set files: - split_zero_width.diff
2008-07-03 00:59:01 mrabarnett set files: + split_zero_width.diffmessages: +
2008-07-02 23:57:16 mrabarnett set messages: +
2008-07-02 23:28:53 gvanrossum set nosy: + gvanrossummessages: +
2008-07-02 22:51:51 mrabarnett set files: + split_zero_width.diffkeywords: + patchmessages: +
2008-07-02 22:07:48 mrabarnett create