msg69134 - (view) |
Author: Matthew Barnett (mrabarnett) *  |
Date: 2008-07-02 22:07 |
re.split doesn't split a string when the regex matches a zero characters. For example: re.split(r'\b', 'a b') returns ['a b'] instead of ['', 'a', ' ', 'b', '']. re.split(r'(?<!\w)(?=\w)', 'a b') returns ['a b'] instead of ['', 'a ', 'b']. |
|
|
msg69139 - (view) |
Author: Matthew Barnett (mrabarnett) *  |
Date: 2008-07-02 22:51 |
The attached patch appears to work. |
|
|
msg69146 - (view) |
Author: Guido van Rossum (gvanrossum) *  |
Date: 2008-07-02 23:28 |
Probably by design. There's probably even a unittest for this behavior. |
|
|
msg69150 - (view) |
Author: Matthew Barnett (mrabarnett) *  |
Date: 2008-07-02 23:57 |
I've found that this issue has been discussed before: #988761. |
|
|
msg69157 - (view) |
Author: Matthew Barnett (mrabarnett) *  |
Date: 2008-07-03 00:59 |
New patch version after studying #988761 and doing more testing. |
|
|
msg69408 - (view) |
Author: Mike Coleman (mkc) |
Date: 2008-07-08 02:36 |
I don't want to discourage you, but #852532, which is essentially the same bug report, was closed--without explanation--as 'wont fix' in April, after four-plus years. I wish you good luck--this is an important and irritating bug, in my opinion... |
|
|
msg69438 - (view) |
Author: Matthew Barnett (mrabarnett) *  |
Date: 2008-07-08 16:39 |
There appear to be 2 opinions on this issue: 1. It's a bug, a corner case that got missed. 2. It's always been like this, so it's probably a design decision, although no-one can't point to where or when the decision was made... Looking at the code, I think it's a bug. Expected behaviour: if 'pattern' is a non-capturing regex, then re.split(pattern, text) == re.sub(pattern, MARKER, text).split(MARKER). |
|
|
msg69852 - (view) |
Author: Mike Coleman (mkc) |
Date: 2008-07-16 22:40 |
I think it's probably both. The original design was incorrect, though this probably wasn't apparent to the designer. But as a significant user of 're', it really stands out as a problem. |
|
|
msg70749 - (view) |
Author: Guido van Rossum (gvanrossum) *  |
Date: 2008-08-05 16:08 |
I think it's better to leave this alone. Such a subtle change is likely to trip over more people in worse ways than the alleged "bug". |
|
|
msg70752 - (view) |
Author: Mike Coleman (mkc) |
Date: 2008-08-05 16:18 |
Okay. For what it's worth, note that my original 2004 patch for this (#988761) is completely backward compatible (a flag must be set in the call to get the new behavior). |
|
|
msg73523 - (view) |
Author: Matthew Barnett (mrabarnett) *  |
Date: 2008-09-21 19:41 |
I wonder whether it could be put into Python 3 where certain breaks in backwards compatibility are to be expected. |
|
|
msg73567 - (view) |
Author: Jeffrey C. Jacobs (timehorse) |
Date: 2008-09-22 11:54 |
I think Mike Coleman proposal of enabling this behaviour via flag is probably best and IMHO we should consider it under these circumstances. Intuitively, I think you're interpretation of what re.split should do under zero-width conditions is logical, and I almost think this should be a 2-minor number transition à la from __future__ import zeroWidthRegexpSplit if we are to consider it as the long-term 'right thing to do'. 3000 (3.0) seems a good place to also consider it for true overhaul / reexamination, especially as we are writing 'upgrade' scripts for many of the other Python features. However, I would say this, Guido has spoken and it may be too late for the pebbles to vote. I would like to add this patch as a new item to the general Regexp Enhancements thread of issue 2636 though, as I think it is an idea worth considering when overhauling Regexp. |
|
|
msg73592 - (view) |
Author: Guido van Rossum (gvanrossum) *  |
Date: 2008-09-22 20:39 |
The problem with doing this per 3.0 is that it's impossible to write a conversion script. I'm okay with adding a flag to enable this behavior though. Please open a new bug with a new patch, preferably one that applies cleanly to the trunk, and a separate patch for the py3k branch unless the trunk patch merges cleanly. There should also be unittests and documentation. The patches should be marked for Python 2.7 and 3.1 -- it's way too late to get this into 2.6 and 3.0. |
|
|
msg104226 - (view) |
Author: Tim Pietzcker (pietzcker) |
Date: 2010-04-26 12:29 |
Sorry to revive this dormant (?) topic - has anybody brought this any further? This "feature" has tripped me up a few times, and I would be all for adding a flag to enable the "split on zero-size matches" behavior, but I myself am not competent enough to code a patch. |
|
|
msg104257 - (view) |
Author: Matthew Barnett (mrabarnett) *  |
Date: 2010-04-26 17:31 |
You could try the regex module mentioned in issue 2636. |
|
|