bpo-433030: Add support of atomic grouping in regular expressions by serhiy-storchaka · Pull Request #31982 · python/cpython (original) (raw)

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation20 Commits9 Checks0 Files changed

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

serhiy-storchaka

@serhiy-storchaka

tim-one

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know anything about re's implementation. Does anyone? 😉

terryjreedy

successful, continues to match the rest of the pattern following it. If the
subsequent pattern fails to match, the stack can only be unwound to a point
*before* the ``(?>...)`` because once exited, the expression, known as an
:dfn:`Atomic Group`, has thrown away all stack points within itself. Thus,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is 'stack point' the same as 'backtrack point'?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure yet.

@tim-one

BTW, I'll add that I think these (atomic groups and possessive quantifiers) are, and by far, the most important things Python's engine is missing among "modern" features. Although, ya, I would rather see MRAB fold his regex engine in, since it has essentially all "modern" features already.

@serhiy-storchaka

@serhiy-storchaka

serhiy-storchaka

Comment on lines +696 to +697

capture = True
atomic = False

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could use a three-state flag (capturing/non-capturing/atomic) instead of two boolean flags. It could even be a tiny bit faster if use built-in singletons True/False/None. But the current code may be clearer.

@serhiy-storchaka

@serhiy-storchaka

I did have doubts about this feature, because it is an advanced feature not supported in older regular expression implementations, and it is difficult to learning.

But some regular expressions (e.g. in textwrap) can be simplified with atomic groups, and we have found recently that atomic groups can allow to fix some security issues. So there is a need of atomic groups in the stdlib. Slowly they will be used in user code too.

@tim-one

I'd say, to the contrary, that these are easy to learn. Indeed, that's why they're near-universal now among other regexp engines. The possibility for backtracking is rarely needed for lexical analysis, and gimmicks that promise not to backtrack save users from worlds of unintended quadratic, cubic, ..., even exponential time (non-)matching disaster cases.

@tim-one

Example: a while back, our fnmatch.py was found to take exponential time in some non-matching cases, a "security hole". This was entirely due to unintended and unwanted backtracking in re. But there's no straightforward way now to stop that from happening. Instead I redid the regexp construction in a highly convoluted way, relying on that matches done inside a lookahead assertion don't support backtracking. So do the "real" matching in a lookahead assertion, wrap the guts of the lookahead assertion inside a named group, then do the actual matching of the characters via a backreference to that named group.

But I expect there are only a relative handful of people now who could repair that if another bug turns up, because the workaround was so obscure and technical.

With atomic groups, though, the workaround could be entirely straightforward - indeed, it would be obvious.

@serhiy-storchaka

This is a really impressive example. I shudder at the thought that I would have to repeat this work in a modified copy of fnmatch.translate(). Now I think I'll just use regex and atomic grouping in that code.

If there are no objections I am going to merge this PR tomorrow. I can't say that I understand this code 100%, but on the other hand I don't see anything openly suspicious, the tests passed successfully, and the core has remained stable for many years.

@tim-one

I agree merging is the best thing to do, and especially since it's early in the release cycle.

Some encouragement: while I haven't studied Python's re engine, as a general principle engines "like this" use much the same implementation code for atomic groups and forward lookahead assertions. Both try to match starting from the current position, and once they find something that matches, they're done, period - if, after getting out of the group, something later fails to match, they do no internal backtracking. The only "real" difference is that the atomic group consumes the substring it matched (advances the current search position), but the assertion does not. That's why I expect Fredrick said (on the bpo report) that most of the code to do atomic groups was already there.

After you merge this, I intend to change fnmatch.py to exploit it. It would give me real joy to rip out the excruciating cleverness I added 😄.

@serhiy-storchaka

There are some differences in a code of atomic groups and lookahead assertions. I have tested them, and atomic groups seems work correctly. For example re.fullmatch(r'(?=a+)', 'a') returns None, but re.fullmatch(r'(?>a+)', 'a') returns a match. I'll continue to search further.

But I have found a bug in possessive repetition. For example, re.fullmatch(r'a++', 'ab') should return None, but returns a match.

@serhiy-storchaka

Other bug: re.findall(r'a*+', 'aab') and re.findall(r'a?+', 'aab') hang.

Well, I wrote the code which allows non-possessive repetition to work correctly in these cases, so I know how to fix it.

@serhiy-storchaka

@serhiy-storchaka

@thatbirdguythatuknownot

But I have found a bug in possessive repetition. For example, re.fullmatch(r'a++', 'ab') should return None, but returns a match.

@serhiy-storchaka Shouldn't that not be a bug and shouldn't that match a?

@tim-one

@thatbirdguythatuknownot, that's a bug. Note that fullmatch() is being used. That requires matching the entire string, not just a proper prefix. The trailing b in the string isn't matched.

@thatbirdguythatuknownot

@thatbirdguythatuknownot, that's a bug. Note that fullmatch() is being used. That requires matching the entire string, not just a proper prefix. The trailing b in the string isn't matched.

I see. Thanks for clarifying.

@ghost ghost mentioned this pull request

Apr 17, 2022