[Python-Dev] pre-PEP [corrected]: Complete, Structured Regular Expression Group Matching (original) (raw)

Mike Coleman mkc at mathdogs.com
Sat Aug 7 07:08:08 CEST 2004


The first version contained several errors. Here is a new version with corrections. It also incorporates the feedback received so far.

Mike

############################################################################## PEP: XXX Title: Complete, Structured Regular Expression Group Matching Version: Revision:1.4Revision: 1.4 Revision:1.4 Last-Modified: Date:2004/08/0704:57:57Date: 2004/08/07 04:57:57 Date:2004/08/0704:57:57 Author: Mike Coleman <mkc at mathdogs.com> Status: Draft Type: Standards Track Content-Type: text/x-rst Created: 1-Aug-2004 Python-Version: 2.4 Post-History: 1-Aug-2004

Abstract

This proposal is to add a new method re.structmatch that fully captures matches for groups that match repeatedly. The method returns a structured match tree whose shape is determined by the pattern groups. Ultimately this will allow a string that can be described by a Python regular expressions (e.g., the contents of /etc/passwd or .ini files, or the output of diff) to be parsed into the obvious parse tree with a single call to re.structmatch.

Motivation

A notable limitation of the re.match method is that it fails to capture all group match information for repeatedly matched groups. For example, in a call like this ::

m0 = re.match(r'([A-Z]+|[a-z]+)*', 'XxxxYzz')

one would like to see that the group which matched four times matched the strings 'X', 'xxx', 'Y' and 'zz'. In the current implementation, only the last match for each group ('zz' in this case) is available, even though the other matches are calculated implicitly during the matching process. For simple cases, the missing strings might be easily recovered by other means, but for more complicated cases this is a significant annoyance.

The simplest extension would be to collect all of the matches for each group in a list, so that in the call above for example ::

m0.group(1) --> ['X', 'xxx', 'Y', 'zz']

(A mechanism like this is apparently to be added to Perl in version 6 [#PERL6]_, though that did not inspire this PEP.)

This immediately suggests the question of what is to be done for nested repeats, like so ::

m1 = re.match(r'("([A-Z]+|[a-z]+)*"\s*)*', '"Xx" "yy" "ZzZ"')

We could have ::

m1.group(2) --> ['X', 'x', 'yy', 'Z', 'z', 'Z' ]

but this flattened result would discard useful information about the structure of the match. A more useful result would be ::

m1.group(2) --> [['X', 'x'], ['yy'], ['Z', 'z', 'Z']]

This is definitely an improvement. Consider the following slightly more complicated case, though ::

m1 = re.match(r'("([A-Z]+|[a-z]+)*"(\s*))*', '"Xx" "yy" "ZzZ"')

which would give ::

m1.group(2) --> [['X', 'x'], ['yy'], ['Z', 'z', 'Z']]
m1.group(3) --> [' ', ' ', '']

This is less useful than the putative conceptual structure of the match, which would be something like ::

[[['X', 'x'], ' '],
 [['yy'], ' '], 
 [['Z', 'z', 'Z'], '']]

In this simple case, this structure could be recovered using the zip function, but in the general case this is a non-trivial transformation.

This PEP proposes a mechanism to address the general case directly.

Proposal

The new function re.structmatch has the same arguments as re.match and will match using exactly the same algorithm. Instead of returning a MatchObject upon success, re.structmatch will return a "tree" (or rather, a forest) in the form of a Python list.

First we will define some terms. Groups in the input pattern are parenthesized subexpressions that "capture" strings when matched, and are of the form '(abc)' or '(?P<name>abc)'. Leaf groups are groups that do not contain any subgroups. So, for example, the group '(abc)' is a leaf group and the group '(a(xyz)b)' is not a leaf group (but it does contain the leaf group '(xyz)'). We shall call parenthesized expressions that are not groups "nongroups"; these include constructs like '(?:abc)' as well as lookahead and lookbehind assertions.

The structure of the returned tree is determined from the grouping tree present in the pattern in the following manner:

Additional Features

By analogy with 're.match', regular expression objects will also have a 'structmatch' method, with the same signature as the 'match' method.

In many cases in which 're.structmatch' fails to match, the cause will be due to an unexpected error in the format of the string being matched. In order to assist the calling program in generating a more meaningful possible error message, 're.structmatch' will return the endpoint of the largest match against the searched string. So, for example ::

    re.structmatch('abcd', 'abxxx') --> 2
    re.structmatch('abcde|z', 'abxxx') --> 2
    re.structmatch('z|abcde', 'abxxx') --> 2
    re.structmatch('x*?y', 'xxx') --> 3

(This return value should be considered advisory rather than exact, as future improvements in the match algorithm may make it difficult to calculate the exact value.)

An alternative suggested by Michael Hudson would be to throw an exception in this case. This makes more sense, but would have the disadvantage of diverging from the behavior of 're.match'.

Examples

Parsing /etc/group

If /etc/group contains the following lines ::

root:x:0:
daemon:x:1:
bin:x:2:
sys:x:3:

then it can be parsed as follows ::

p = r'''(?xm)           # VERBOSE, MULTILINE
        (
          (?:
            # one field, preceded by a ':' if
            # the field is not the line's first
            (?:^|:) ([^:\n]*)
          )*
          \n
        )*
        \Z              # assert that the entire
                        #   input was matched
     '''
s = open('/etc/group').read()
tree = re.structmatch(p, s)

which will give tree the following value::

[['root', 'x', '0', ''],
 ['daemon', 'x', '1', ''],
 ['bin', 'x', '2', ''],
 ['sys', 'x', '3', '']]

Note that the above pattern is written to allow any /etc/group field to be empty. The pattern won't work correctly for versions of Python prior to 2.4 because of the possibility of repeating empty matches. This problem seems to have been fixed in Python 2.4. (A slightly more complicated pattern would work for pre-2.4 versions.)

Parsing .ini files

The Microsoft .ini file format is found in various contexts (there is one in the Python source distribution: Tools/freeze/extensions_win32.ini). Given a file with contents ::

[singers]
good=Annie Lennox
bad=William Shatner

[comedians]
good=Monty Python

the file can be parsed with pattern ::

r'''(?xm)           # VERBOSE, MULTILINE
    \s*             # leading whitespace
    (               # begin sections
      ^\[ ([^]]+) \] \s*  # section header
      (             # begin params
        ^([^=]+) =  # param name
        (.*) $      # param value
        \s*         # intervening whitespace
      )*            # end params
    )*              # end sections
    \Z              # assert that the entire
                    #   input was matched
 '''

to give ::

[['singers',
  ['good', 'Annie Lennox'],
  ['bad', 'William Shatner']],
 ['comedians',
  ['good', 'Monty Python']]]

The pattern given omits support for .ini file comments for the sake of simplicity, but this could be added.

Details

The proposal states that re.structmatch will match using exactly the same algorithm as re.match. This might be a little odd for a pattern like r'(x|y|z)*aaa\1', where the algorithm will require that the \1 back-reference match (at most) one character. It's not obvious whether there is any better option, though, and there are advantages of implementation and simplicity for keeping the match algorithms of these two methods identical. (A similar argument applies to '(?P=NAME)'.)

Discussion

Part of the reason the proposed method would be so useful is that re.split currently does not split on empty matches. If it had this feature, one could split on lookahead and lookbehind assertions, which would significantly ease parsing of strings with a recursive regular structure (e.g., .ini files). Patch #988761_ will correct this re.split deficiency, if it is accepted.

.. _#988761: https://sourceforge.net/tracker/?func=detail&aid=988761&group_id=5470&atid=305470

For particularly complex patterns, the current 99 group limit might actually become a practical problem.

One could imagine a variation in which subtrees of named groups might be captured in dictionaries rather than lists, with the group names used as keys.

Rejected Alternatives

Several simpler alternatives are rejected in the Motivation_ section above. Although these alternatives would be better than nothing, they would not adequately satisfy the goal of this proposal, which is to allow the programmer to extract the structure of a string in an immediate manner.

It would be possible to use tuples for some substructures in the return tree, rather than composing it strictly of lists. This alternative was rejected because it was thought useful to be able to modify the resulting tree in place, perhaps to add decorations, etc., and tuples would make this more difficult.

References

.. [#PERL6] Multiple regex matches in Perl Apocalypse 5 (http://www.perl.com/pub/a/2002/06/04/apo5.html?page=22#rfc%20360:%20allow%20multiply%20matched%20groups%20in%20regexes%20to%20return%20a%20listref%20of%20all%20matches)

Acknowledgements

Thanks to Raymond Hettinger, Michael Hudson and Edward Loper for feedback and improvements.

Copyright

This document has been placed in the public domain.

.. Local Variables: mode: indented-text indent-tabs-mode: nil sentence-end-double-space: t fill-column: 70 End:



More information about the Python-Dev mailing list