<regex>: Process minimum number of reps in simple loops non-recursively by muellerj2 · Pull Request #5762 · microsoft/STL (original) (raw)
Simple loops are a subset of all loops with the following properties:
- In each trajectory in the NFA, the surrounding loops will not be repeated.
- As a corollary, simple loops can be entered at most once in each trajectory.
- The repeated pattern is branchless (no
_Do_ifnode, no character class matching collating elements of various sizes). - The repeated pattern always matches strings of the same length.
- The repeated pattern does not introduce new captures. (But I'm not sure whether I want to lift this restriction in the future, so I haven't taken advantage of this property yet.)
Properties 1-3 mean that we don't have to worry about restoring loop state: Whenever we enter the loop, we can just overwrite it because the data relates to a trajectory the processing of which is complete. Moreover, the repeated pattern is branchless, so it can't happen that we try some first branch in the repeated pattern, have to backtrack and then have to restore the loop state (especially any counters) before trying another branch in the repeated pattern.
As long as we haven't reached the minimum number of reps, we must match more repetitions, so we do not try to match the tail of the regex. Thus, there is also no branch after matching a repetition until the minimum is reached. This means that we don't have to add any stack frames with non-trivial unwinding logic that try the other branch or restore the loop state during backtracking.
For now, this results in the following implementation to match the minimum number of reps in a simple loop:
- Matching
_N_repjust sets up the loop state for the first repetition if at least one repetition is necessary. No unwinding logic is added to restore the loop state because the state relates to an outdated trajectory (or is just the initial state). If this isn't a simple loop or a simple one which doesn't have to match at least one rep, we essentially delegate to the old code. - When matching
_N_end_rep, we first check whether this is a simple loop and whether we are still processing the minimum number of reps.- If this isn't a simple loop, we delegate to
_Do_rep(). - If this is a simple loop, but we aren't processing the minimum number of reps of a simple loop anymore, then this is a recursive
_Match_pat(_Node->_Next)call in_Do_rep0(), so we have to keep doing what_N_end_repdid previously in this case: Set_Nexttonullptrso that we exit_Match_pat(). - If we are still processing the minimum number of reps and we just completed the first repetition, we have to check whether the first repetition matched an empty string. If so, we know that this repeated pattern always matches an empty string, so we can just stop repeating the pattern and immediately try matching the tail of the regex (non-recursively).
- If we have to do one more repetition to reach the minimum, we reset the capturing groups, increase the loop counter and set
_Nextto the start of the repeated pattern. - If we have reached the minimum number of repetitions, we hand off the processing to the old
_Do_rep0()code.
- If this isn't a simple loop, we delegate to
We don't have to run any stack unwinding logic during backtracking to process the minimum number of reps of a simple loop, but we still have to add a stack frame to save the match state at the time when the simple loop was entered. For this reason, we add a new opcode for doing nothing during stack unwinding. Since we can leave this stack frame to the standard unwinding logic in _Match_pat() now, we can remove all explicit _Pop_frame() calls in _Do_rep0().
Repetitions of simple loops didn't really increase the stack usage counter (other than some temporary increase by 1) prior to this PR, so to reduce complexity I have decided to not replicate the updates to the stack usage counter in the new logic. Even so, this PR replicates the updates to the complexity counter. (This PR also adds an update to the complexity counter to the logic of _Disjunction_eval_alt_always which was missing from #5745.)