[Python-Dev] pre-PEP [corrected]: Complete, Structured Regular Expression Group Matching (original) (raw)
Mike Coleman mkc at mathdogs.com
Sat Aug 7 07:08:08 CEST 2004
- Previous message: [Python-Dev] Re: pre-PEP: Complete, Structured Regular Expression Group Matching
- Next message: [Python-Dev] pre-PEP [corrected]: Complete, Structured Regular Expression Group Matching
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
The first version contained several errors. Here is a new version with corrections. It also incorporates the feedback received so far.
Mike
############################################################################## PEP: XXX Title: Complete, Structured Regular Expression Group Matching Version: Revision:1.4Revision: 1.4 Revision:1.4 Last-Modified: Date:2004/08/0704:57:57Date: 2004/08/07 04:57:57 Date:2004/08/0704:57:57 Author: Mike Coleman <mkc at mathdogs.com> Status: Draft Type: Standards Track Content-Type: text/x-rst Created: 1-Aug-2004 Python-Version: 2.4 Post-History: 1-Aug-2004
Abstract
This proposal is to add a new method re.structmatch
that fully
captures matches for groups that match repeatedly. The method returns
a structured match tree whose shape is determined by the pattern
groups. Ultimately this will allow a string that can be described by
a Python regular expressions (e.g., the contents of /etc/passwd
or
.ini
files, or the output of diff
) to be parsed into the
obvious parse tree with a single call to re.structmatch
.
Motivation
A notable limitation of the re.match
method is that it fails to
capture all group match information for repeatedly matched groups.
For example, in a call like this ::
m0 = re.match(r'([A-Z]+|[a-z]+)*', 'XxxxYzz')
one would like to see that the group which matched four times matched
the strings 'X'
, 'xxx'
, 'Y'
and 'zz'
. In the current
implementation, only the last match for each group ('zz'
in this
case) is available, even though the other matches are calculated
implicitly during the matching process. For simple cases, the missing
strings might be easily recovered by other means, but for more
complicated cases this is a significant annoyance.
The simplest extension would be to collect all of the matches for each group in a list, so that in the call above for example ::
m0.group(1) --> ['X', 'xxx', 'Y', 'zz']
(A mechanism like this is apparently to be added to Perl in version 6 [#PERL6]_, though that did not inspire this PEP.)
This immediately suggests the question of what is to be done for nested repeats, like so ::
m1 = re.match(r'("([A-Z]+|[a-z]+)*"\s*)*', '"Xx" "yy" "ZzZ"')
We could have ::
m1.group(2) --> ['X', 'x', 'yy', 'Z', 'z', 'Z' ]
but this flattened result would discard useful information about the structure of the match. A more useful result would be ::
m1.group(2) --> [['X', 'x'], ['yy'], ['Z', 'z', 'Z']]
This is definitely an improvement. Consider the following slightly more complicated case, though ::
m1 = re.match(r'("([A-Z]+|[a-z]+)*"(\s*))*', '"Xx" "yy" "ZzZ"')
which would give ::
m1.group(2) --> [['X', 'x'], ['yy'], ['Z', 'z', 'Z']]
m1.group(3) --> [' ', ' ', '']
This is less useful than the putative conceptual structure of the match, which would be something like ::
[[['X', 'x'], ' '],
[['yy'], ' '],
[['Z', 'z', 'Z'], '']]
In this simple case, this structure could be recovered using the
zip
function, but in the general case this is a non-trivial
transformation.
This PEP proposes a mechanism to address the general case directly.
Proposal
The new function re.structmatch
has the same arguments as
re.match
and will match using exactly the same algorithm. Instead
of returning a MatchObject upon success, re.structmatch
will
return a "tree" (or rather, a forest) in the form of a Python list.
First we will define some terms. Groups in the input pattern are
parenthesized subexpressions that "capture" strings when matched, and
are of the form '(abc)'
or '(?P<name>abc)'
. Leaf groups are
groups that do not contain any subgroups. So, for example, the group
'(abc)'
is a leaf group and the group '(a(xyz)b)'
is not a
leaf group (but it does contain the leaf group '(xyz)'
). We shall
call parenthesized expressions that are not groups "nongroups"; these
include constructs like '(?:abc)'
as well as lookahead and
lookbehind assertions.
The structure of the returned tree is determined from the grouping tree present in the pattern in the following manner:
Leaf groups not followed immediately by a repeat operator map to a single string::
re.structmatch(r'(...)', 'abcdef') --> ['abc'] re.structmatch(r'(...).(..)', 'abcdef') --> ['abc', 'ef']
Leaf groups followed immediately by a repeat operator map to a list of strings::
re.structmatch(r'([^d])', 'abcdef') --> [['a', 'b', 'c']] re.structmatch(r'([^d])(.)', 'abcdef') --> [['a', 'b', 'c'], ['d', 'e', 'f']] re.structmatch(r'(..)', 'abcdef') --> [['ab', 'cd', 'ef']] re.structmatch(r'(.){2}', 'abcdef') --> [['a', 'b']]
Non-leaf groups not followed immediately by a repeat operator map to a list of the mappings of their subgroups::
re.structmatch(r'(...)', 'abcdef') --> ['abc'] re.structmatch(r'((...))', 'abcdef') --> [['abc']] re.structmatch(r'(((...)))', 'abcdef') --> [[['abc']]] re.structmatch(r'((...)())', 'abcdef') --> [['abc'], []] re.structmatch(r'(.(.)(.(.)).(.))', 'abcdef') --> ['a', ['b'], ['c', ['d']], 'e', ['f']]
Non-leaf groups followed immediately by a repeat operator map to a list of the mappings of their repeated matches::
re.structmatch(r'((.).(.))', 'abcdef') --> [[['a', 'c'], ['d', 'f']]] re.structmatch(r'(([ab])(x))', 'baxbxx') --> [[['b', 'a'], ['x']], [['b'], ['x', 'x']]]
In the case of alternation, only the matched groups appear in the output::
re.structmatch(r'([^a])+|([^d])+', 'abcdef') --> [['a', 'b', 'c']]
If it's important to know which alternative matched, named groups can be used.
Named groups map to a pair where the first member is the name and the second member is what the unnamed group would have mapped to::
re.structmatch(r'([^d])(?P.)', 'abcdef') --> [['a', 'b', 'c'], ('rest', ['d', 'e', 'f'])]
Nongroups do not affect the output structure. Compared to non-leaf groups, nongroups have the effect of "flattening" the output, like so::
re.structmatch(r'((.).(.))', 'abcdef') --> [['a', 'c']] re.structmatch(r'(.).(.)', 'abcdef') --> ['a', 'c'] re.structmatch(r'(?:(.).(.))', 'abcdef') --> ['a', 'c']
re.structmatch(r'((.).(.))', 'abcdef') --> [[['a', 'c'], ['d', 'f']]] re.structmatch(r'(?:(.).(.))', 'abcdef') --> [['a', 'c', 'd', 'f']] (Nongroups that do not contain leaf groups have no effect whatsoever on the output structure.)
Additional Features
By analogy with 're.match'
, regular expression objects will also
have a 'structmatch'
method, with the same signature as the
'match'
method.
In many cases in which 're.structmatch'
fails to match, the cause will
be due to an unexpected error in the format of the string being
matched. In order to assist the calling program in generating a more
meaningful possible error message, 're.structmatch'
will return the
endpoint of the largest match against the searched string. So, for
example ::
re.structmatch('abcd', 'abxxx') --> 2
re.structmatch('abcde|z', 'abxxx') --> 2
re.structmatch('z|abcde', 'abxxx') --> 2
re.structmatch('x*?y', 'xxx') --> 3
(This return value should be considered advisory rather than exact, as future improvements in the match algorithm may make it difficult to calculate the exact value.)
An alternative suggested by Michael Hudson would be to throw an
exception in this case. This makes more sense, but would have the
disadvantage of diverging from the behavior of 're.match'
.
Examples
Parsing /etc/group
If /etc/group
contains the following lines ::
root:x:0:
daemon:x:1:
bin:x:2:
sys:x:3:
then it can be parsed as follows ::
p = r'''(?xm) # VERBOSE, MULTILINE
(
(?:
# one field, preceded by a ':' if
# the field is not the line's first
(?:^|:) ([^:\n]*)
)*
\n
)*
\Z # assert that the entire
# input was matched
'''
s = open('/etc/group').read()
tree = re.structmatch(p, s)
which will give tree
the following value::
[['root', 'x', '0', ''],
['daemon', 'x', '1', ''],
['bin', 'x', '2', ''],
['sys', 'x', '3', '']]
Note that the above pattern is written to allow any /etc/group
field to be empty. The pattern won't work correctly for versions of
Python prior to 2.4 because of the possibility of repeating empty
matches. This problem seems to have been fixed in Python 2.4. (A
slightly more complicated pattern would work for pre-2.4 versions.)
Parsing .ini
files
The Microsoft .ini
file format is found in various contexts (there
is one in the Python source distribution:
Tools/freeze/extensions_win32.ini
). Given a file with contents ::
[singers]
good=Annie Lennox
bad=William Shatner
[comedians]
good=Monty Python
the file can be parsed with pattern ::
r'''(?xm) # VERBOSE, MULTILINE
\s* # leading whitespace
( # begin sections
^\[ ([^]]+) \] \s* # section header
( # begin params
^([^=]+) = # param name
(.*) $ # param value
\s* # intervening whitespace
)* # end params
)* # end sections
\Z # assert that the entire
# input was matched
'''
to give ::
[['singers',
['good', 'Annie Lennox'],
['bad', 'William Shatner']],
['comedians',
['good', 'Monty Python']]]
The pattern given omits support for .ini
file comments for the
sake of simplicity, but this could be added.
Details
The proposal states that re.structmatch
will match using exactly
the same algorithm as re.match
. This might be a little odd for a
pattern like r'(x|y|z)*aaa\1'
, where the algorithm will require
that the \1
back-reference match (at most) one character. It's
not obvious whether there is any better option, though, and there are
advantages of implementation and simplicity for keeping the match
algorithms of these two methods identical. (A similar argument
applies to '(?P=NAME)'
.)
Discussion
Part of the reason the proposed method would be so useful is that
re.split
currently does not split on empty matches. If it had
this feature, one could split on lookahead and lookbehind assertions,
which would significantly ease parsing of strings with a recursive
regular structure (e.g., .ini
files). Patch #988761
_ will
correct this re.split
deficiency, if it is accepted.
.. _#988761: https://sourceforge.net/tracker/?func=detail&aid=988761&group_id=5470&atid=305470
For particularly complex patterns, the current 99 group limit might actually become a practical problem.
One could imagine a variation in which subtrees of named groups might be captured in dictionaries rather than lists, with the group names used as keys.
Rejected Alternatives
Several simpler alternatives are rejected in the Motivation
_ section
above. Although these alternatives would be better than nothing, they
would not adequately satisfy the goal of this proposal, which is to
allow the programmer to extract the structure of a string in an
immediate manner.
It would be possible to use tuples for some substructures in the return tree, rather than composing it strictly of lists. This alternative was rejected because it was thought useful to be able to modify the resulting tree in place, perhaps to add decorations, etc., and tuples would make this more difficult.
References
.. [#PERL6] Multiple regex matches in Perl Apocalypse 5 (http://www.perl.com/pub/a/2002/06/04/apo5.html?page=22#rfc%20360:%20allow%20multiply%20matched%20groups%20in%20regexes%20to%20return%20a%20listref%20of%20all%20matches)
Acknowledgements
Thanks to Raymond Hettinger, Michael Hudson and Edward Loper for feedback and improvements.
Copyright
This document has been placed in the public domain.
.. Local Variables: mode: indented-text indent-tabs-mode: nil sentence-end-double-space: t fill-column: 70 End:
- Previous message: [Python-Dev] Re: pre-PEP: Complete, Structured Regular Expression Group Matching
- Next message: [Python-Dev] pre-PEP [corrected]: Complete, Structured Regular Expression Group Matching
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]