<regex>
: Avoid generating unnecessary single-branch _N_if
nodes by muellerj2 · Pull Request #5539 · microsoft/STL (original) (raw)
The parser likes to generate single-branch _N_if
nodes in the NFA for no good reason. These nodes have a noticeable runtime cost in the matcher. This PR avoids generating these nodes.
This change has side effects:
- It strictly reduces the necessary recursion in the matcher to match a specific input string to a specific regex. This means that some strings that ran into a stack overflow or an
error_stack
exception before will match successfully after this PR. - Conversely, this means that some
_Matcher2::_Do_rep()
calls can now take the place previously occupied by recursive calls of_Matcher2::_Do_if()
._Matcher2::_Do_rep()
takes more bytes from the stack (I didn't check recently, but I remember it was about 100 bytes per call?), so there will be a very small set of input strings that will run into an actual stack overflow now instead of anerror_stack
exception.
Without further changes, there would be more side effects because some loops would get classified as simple now. However, I preempted this reclassification by adjusting _Parser2::_Calculate_loop_simplicity()
. There are two reasons for this.
- The simple loop optimization, as implemented, reduces the required amount of stack space but in exchange can square the running time when matching some input strings to a regex. I think, though, that we could take this risk because the input strings couldn't be longer than a few hundred characters anyway before running into a stack overflow, so it's unlikely to make a major runtime difference in practice when matching didn't run into a stack overflow or
error_stack
exception before, and it's more important to make more regexes resistant against stack overflow. For example, if we were to allow the reclassification, this would fix the stack overflow for the regex in : Grouping within repetition causes regex stack error #997. - The more important reason is that I'm not sure yet what the best subset of simple loops is for a non-recursive matcher, so I would like to retain more wiggle room here until the requirements become clearer while rewriting the matcher.
I added no new tests because (a) there is no obvious additional set of regexes to test and (b) this change affects the generated NFAs for all regexes without exception (all of them contained at least one single-branch _N_if
node before), so the whole existing test suite serves as test coverage for this change.
Benchmark
Benchmark | Before | After | Speedup |
---|---|---|---|
bm_lorem_search/"bibe"/2 | 31145 ns | 28878 ns | 1.08 |
bm_lorem_search/"bibe"/3 | 61384 ns | 57199 ns | 1.07 |
bm_lorem_search/"bibe"/4 | 122768 ns | 114397 ns | 1.07 |
bm_lorem_search/"(bibe)"/2 | 57812 ns | 43945 ns | 1.32 |
bm_lorem_search/"(bibe)"/3 | 141246 ns | 92072 ns | 1.53 |
bm_lorem_search/"(bibe)"/4 | 284934 ns | 187976 ns | 1.52 |
bm_lorem_search/"(bibe)+"/2 | 92072 ns | 62779 ns | 1.47 |
bm_lorem_search/"(bibe)+"/3 | 179983 ns | 122768 ns | 1.47 |
bm_lorem_search/"(bibe)+"/4 | 376607 ns | 256319 ns | 1.47 |
bm_lorem_search/"(?:bibe)+"/2 | 55804 ns | 48652 ns | 1.15 |
bm_lorem_search/"(?:bibe)+"/3 | 112305 ns | 96257 ns | 1.17 |
bm_lorem_search/"(?:bibe)+"/4 | 224609 ns | 196725 ns | 1.14 |