[Python-Dev] Re: re.split on empty patterns (original) (raw)

Mike Coleman mkc at mathdogs.com
Sun Aug 8 00:36:17 CEST 2004


"A.M. Kuchling" <amk at amk.ca> writes:

>>> re.split('x*', 'abxxxcdefxxx') ['ab', 'cdef', ''] >>> re.split('x*', 'abxxxcdefxxx', emptyok=True) ['', 'a', 'b', '', 'c', 'd', 'e', 'f', '', '']

(I think the result of the second match points up a bug in the patch; the empty strings in the middle seem wrong to me. Assume that gets fixed.)

I believe it's correct in the sense that it matches the design I had in mind. Of course, this could be the wrong design choice.

I think (some of) the alternatives can be stated this way:

  1. Empty matches are not considered at all for splitting. (present behavior)

  2. Empty matches are only considered when they are not adjacent to non-empty matches. (what you have in mind, I think)

  3. Empty matches are always considered. (the patch behavior)

(It's understood that matches that fall within other matches are not considered. For empty matches, an adjacent match doesn't count as "within".)

I would say that alternatives 2 and 3 are better than 1 because they retain more information, and I would argue that alternative 3 is better than alternative 2 for the same reason. I don't have a killer example, but here is a somewhat abstract one that shows the difference between 2 and 3:

# alternative 2:
re.structmatch(r'xxx|(?=abc)', 'zzxxxabczz') --> ['zz', 'bbczz']
re.structmatch(r'xxx|(?=abc)', 'zzxxxbbczz') --> ['zz', 'bbczz']

# alternative 3:
re.structmatch(r'xxx|(?=abc)', 'zzxxxabczz') --> ['zz', '', 'bbczz']
re.structmatch(r'xxx|(?=abc)', 'zzxxxbbczz') --> ['zz', 'bbczz']

In English, the third alternative allows you to notice that there were two adjacent matches, even if the second match was empty. With the second alternative, this is missed.

Of course, because of the match algorithm, we cannot notice an empty match immediately followed by a non-empty match, so there's kind of an asymmetry here. With alternative 2 that asymmetry wouldn't be present, since we'd fail to notice empty matches on either side.

Alternative 2 does have the advantage of matching the expectations of a naive user a little better. Alternative 3 is more powerful, but perhaps a little less obvious.

(If you're reading this, what do you think?)

Anyway, we therefore can't just make this the default in 2.4. We could trigger a warning when emptyok is not supplied and a split pattern results in a zero-length match; users could supply emptyok=False to avoid the warning. Patterns that never have a zero-length match would never get the warning. 2.5 could then set emptyok to True.

I like this compromise. A variant would be to warn if the pattern could match empty, rather than warning when it does match empty. I'm not sure whether it would be easy to determine this, though.

Note: raising the warning might cause a serious performance hit for patterns that get zero-length matches a lot, which would make 2.4 slower in certain cases.

The above variant would ameliorate this, though in theory no one should be using patterns that can empty match anyway.

Cheers, Mike



More information about the Python-Dev mailing list