<regex>
: Revise caret parsing in basic and grep mode by muellerj2 · Pull Request #5165 · microsoft/STL (original) (raw)
I think I have grasped the underlying issue now, but I need maintainer direction to proceed because there are three possible solutions with different pros and cons.
- Section 9.3.8 of the POSIX standard says that a caret at the beginning of a subexpression (capture or non-capture group) may be interpreted as an anchor or as an ordinary character.
- Thus, both libc++ and libstdc++ conform to the POSIX standard, even though they behave completely differently: libc++ interprets a caret at the beginning of a subexpression is an ordinary character, while libstdc++ treats it as an anchor.
- The test case is non-portable because it assumes that a caret at the beginning of a subexpression is treated as an ordinary character. To make it portable, the caret would have to be escaped by a backslash.
- MSVC STL is peculiar: It treats a caret at the beginning of a subexpression as an ordinary character, but interprets it as an anchor if the subexpression in turn is also at the beginning of a subexpression. I don't think this violates the letter of the POSIX standard, but it is some very odd behavior. (I think the parser code clearly suggests that treatment as an anchor was intended, it's just that the implementation suffers from a bug.)
- (Even weirder, MSVC STL always interprets the dollar sign as an ordinary character at the end of a subexpression, never as an anchor.)
- This PR fixes the bug in the parser code as an unintended side effect: With this PR,
_Parser::_Beg_expr()
now returns the correct result because it is called after the node for the capture or non-capture group has been added to the NFA, while it was previously called before that node was created.
Given that the POSIX standard does not mandate how a caret at the beginning of a subexpression is to be handled, we have three options how to proceed:
- Always treat a caret as an ordinary character, aligning with libc++. This is unlikely to break users and leads to consistent treatment of carets, but restricts the expressiveness of the regular expressions more than the other options.
- Always treat a caret as an anchor, aligning with libstdc++. This is probably the option to most likely break some users, but yields consistent treatment of carets and greater expressiveness of regular expressions.
- Retain the old behavior. This won't break any users and is as expressive as always treating a caret as an anchor (with some weird rules attached to it), but the treatment of carets is just plain weird.
Which option do you consider appropriate? I'm willing to make the changes for any of them. (Even if we go for option 2, we should add tests to make sure we don't regress this behavior in the future. And I think there is still a minor bug in the handling of carets at the beginning of subexpressions, but it can be fixed by changing a single line.)