[Python-Dev] re.split on empty patterns (original) (raw)

A.M. Kuchling amk at amk.ca
Sat Aug 7 16:51:42 CEST 2004


The re.split() method ignores zero-length pattern matches. Patch #988761 adds an emptyok flag to split that causes zero-length matches to trigger a split. For example:

re.split(r'\b', 'this is a sentence')# does nothing; \b is always length 0 ['this is a sentence'] re.split(r'\b', 'this is a sentence', emptyok=True) ['', 'this', ' ', 'is', ' ', 'a', ' ', 'sentence', '']

Without the patch, the various zero-length assertions are pretty useless; with it, they can serve a purpose with split():

re.split(r'(?m)$', 'line1\nline2\n', emptyok=True) ['line1', '\nline2', '\n', ''] # Split file into sections re.split("(?m)(?=^[[])", """[section1] foo=bar

[section2] coyote=wiley """, emptyok=True) ['', '[section1]\nfoo=bar\n\n', '[section2]\ncoyote=wiley\n']

Zero-length matches often result in a '' at the beginning or end, or between characters, but I think users can handle that. IMHO this feature is clearly useful, and would be happy to commit the patch as-is.

Question: do we want to make this option the new default? Existing patterns that can produce zero-length matches would change their meanings:

re.split('x*', 'abxxxcdefxxx') ['ab', 'cdef', ''] re.split('x*', 'abxxxcdefxxx', emptyok=True) ['', 'a', 'b', '', 'c', 'd', 'e', 'f', '', '']

(I think the result of the second match points up a bug in the patch; the empty strings in the middle seem wrong to me. Assume that gets fixed.)

Anyway, we therefore can't just make this the default in 2.4. We could trigger a warning when emptyok is not supplied and a split pattern results in a zero-length match; users could supply emptyok=False to avoid the warning. Patterns that never have a zero-length match would never get the warning. 2.5 could then set emptyok to True.

Note: raising the warning might cause a serious performance hit for patterns that get zero-length matches a lot, which would make 2.4 slower in certain cases.

Thoughts? Does this need a PEP?

--amk



More information about the Python-Dev mailing list