<regex>
: Properly parse backslashes in character classes of basic regexes by muellerj2 · Pull Request #5523 · microsoft/STL (original) (raw)
Fixes #5379.
This renames the parser to add a new member variable describing the lexer mode: Default or inside character class. This allows the lexer to correctly process a backslash when parsing a character class/bracket expression.
I also tried to do this without renaming the parser, but this would mean we would have to pass the lexer mode (in or outside a character class) as an argument to all the functions processing escapes in any way, which is a bit of a pain. By renaming the parser, we need the least changes to the logic itself.
Since the parser is renamed, this PR is also doing a number of minor cleanups to the parser and builder (which is also renamed to do these cleanups).
The PR is split into several commits to simplify reviewing:
- Rename
_Parser
to_Parser2
. - Since we have renamed the parser, we can strip any version numbers from member functions.
- Clean up the parse flags, which we can do now because there is no longer any chance of mix-and-matching the parser constructor and the parser member function
_Compile
. Specifically:_L_brk_bal
is assigned its own bit; previously it was_L_brk_bal = 0x20000000, // ']' special only after '[' (ERE, BRE); TRANSITION, ABI: same value as _L_brk_rstr - The
_L_grp_esc
flag is added to the awk flags so that the workaround in_ClassAtom
can be removed. - I also extended the
_Lang_flags
enum and the_L_flags
member variable tounsigned long long
so that we can add more flags more easily in the future. (This already adds_L_dsh_rstr
to signify that the dash-
cannot appear as the starting point of a character range in BREs and EREs, but doesn't perform the parser changes to support it yet.)
- Remove the unused member
_Begin
from the parser. - Slightly reorder the parser member variables to reduce padding a bit. (
_Char
is usually achar
orwchar_t
, so it [plus the single-byte_Mode
member variable added in the last commit] can usually fit into the four bytes the compiler must add after_Mchar
. - Rename
_Builder
to_Builder2
. - Strip version numbers from member functions of the builder.
- Remove obsolete members
_Bmax
and_Tmax
from the builder. - Actually fix : Backslashes in character classes are sometimes not matched in basic regular expressions #5379 essentially by making
_Is_esc()
always return false when not in default (read: outside-bracketed-character-class) mode. Note that it matters how we change the lexer mode in_Parser2::_Alternative()
:_Next()
and_Expect()
process the first token inside or outside the square brackets, so we must change the mode before calling these functions. The tests check that we didn't get this wrong.