[Python-Dev] Re: pre-PEP [corrected]: Complete, Structured Regular Expression Group Matching (original) (raw)

Mike Coleman mkc at mathdogs.com
Thu Aug 12 06:25:38 CEST 2004


"Stephen J. Turnbull" <stephen at xemacs.org> writes:

I didn't have that in mind, but I can live with it. Regexp doesn't compare well even with that when you consider error-checking and program maintenance.

It's true that regexp matching is pretty much a binary thing--either it does match or it doesn't. For human generated files, this isn't so good, but my domain of interest is machine-generated files, where I "know" that the pattern will match. If the match fails in this case, I'm happy enough to just detect it and bail, possibly with an error message that says "looks like there's a problem somewhere around char x in line y".

Re maintenance, yeah regexp is pretty terse and ugly. Generally, though, I'd rather deal with a reasonably well-considered 80 char regexp than 100 lines of code that does the same thing.

It's not obvious to me how to make grammar rules pretty in Python, but implementing an SLR parser-generator should be at most a couple of days' work to get something you could live with.

There are several of these packages available and I'm all in favor of their use, particularly for problems of sufficient complexity. These are currently hindered, though, by the fact that none have been elected for inclusion in the standard library.

Furthermore, since we have 're', and it's not going away, I'd really like to fix this repeated match deficiency, one way or another.

I agree it would be useful. I think that a parser generator would be equally useful for this, as easy to learn as regexps are (assuming somebody comes up with a tasteful way to express grammar rules), and more powerful with even a basic implementation.

I will be +1 in favor if someone wants to pursue it.

BTW, do you have a sample implementation for re.structmatch? Fredrik seems to have some doubt about ease of implementation.

Erik Heneryd posted an intriguing Python prototype, which I'm still looking at.

Mike> I agree that, like list comprehensions (for example), it Mike> needs to be applied with good judgement.

I don't see the analogy to list comprehensions.

Just that although list comprehensions are cool, you can still make an unreadable mess with them, as I've found myself doing occasionally. (nesting complex comprehensions four levels deep, etc.)

My objection is that it throws away a lot of structure, and therefore is liable to return the wrong parse, or simply an error with no hint as to where the data is malformed.

Hmm. Regarding the lack of error diagnosis, I'm not too concerned about this, for the reason I mention above. When 're.structmatch' does fail, though, it returns a "farthest match position", which will usually be of some aid, I would think.

Regarding getting the parse wrong, sure, this could happen. Users will have to be judicious. The answer here is to write RE's with some care, realize that some matches may require further checking after being returned by re.structmatch, and further realize that some parsing problems are too complex for this method and require a grammar-based approach instead. This doesn't really seem worse than the situation for re.match, though, to me.

Because they're all too often not "multiple matches". They're matches for different objects that happen to match the same regular expression, and the notation for the parser should express that without requiring comments. re.structmatch encourages a style where users deliberately throw away structure; your /etc/group parser is an example. (See below.)

Hmm. You're quibbling with my example. Yes, I could have written the expression to "know" that there were only four fields and that the third was a number, etc. I chose not to for pedagogical purposes, but it could easily have been done the other way. And in practice the programmer can choose the specificity of the match depending on exactly what they're trying to accomplish. Is this really different than what would happen with grammars?

Furthermore, since this is going to be recursive, users are going to end up with "list of list of ..." nested to arbitrary depth, depending on how complex the regexp is. Isn't that what we have Lisp for?

The shape of the result follows directly from the shape of the grouping in the RE, yes. This is construed as a benefit. And yes, Lisp's backtick (? it's been a while) expressions were part of what inspired me on this.

I suppose you would be satisfied if you could represent

file := file line line := group ':' passwd ':' number ':' any '\n' group := '[A-Za-z0-9]+' passwd := '[A-Za-z0-9]+' number := '[0-9]+' any := '.*' by '(([A-Za-z0-9]+:)*.\n)'

This isn't really fair. If you know that that's what you want to parse, then you could use a more appropriate re.structmatch equivalent like

p = r'''(?xm)           # VERBOSE, MULTILINE
        (
          (?P<group> [A-Za-z0-9]+)
          (?P<passwd> [A-Za-z0-9]+)
          (?P<number> [0-9]+)
          (?P<any> .*)
          \n
        )*
        \Z              # assert that the entire
                        #   input was matched
     '''

which will return the submatches tagged by those group names.

improving 'passwd' to check for the characteristic signature of a crypted password or a "no password authentication" flag would be annoying with re.structmatch, while it's trivial with a parser.

If this signature can be expressed as an RE, then I suspect it would be easy enough to add. If not, then probably you'd have to check with code after the fact, even if you're using a grammar parser. Or am I missing something?

I grant that a full-blown parser generator is overkill for your target set of applications. But I stick with my judgement that regexps as they are are an attractive nuisance, and re.structmatch would make the situation worse in proportion to its usage. I hope your proposal will be superseded either by a scanner package, or even by a parser package.

I think you're saying that you just don't care for regexps, period. This reminds me of jwz's quip ("now you have two problems"), and I have some sympathy for this point of view, but practically speaking 're' is here and it's not going away any decade soon. Given that, I'd like to fix this multiple match deficiency.

I'm in favor of the scanner package, but it doesn't cover the same problem space as re.structmatch, so I don't see it as a replacement. A parser package might be, but I don't see any sign that that's coming any time soon. (I'd be happy enough to be wrong about this.)

Cheers, Mike



More information about the Python-Dev mailing list