<regex>: Process minimum number of reps in simple loops non-recursively by muellerj2 · Pull Request #5762 · microsoft/STL (original) (raw)

Towards #997 and #1528.

Simple loops are a subset of all loops with the following properties:

  1. In each trajectory in the NFA, the surrounding loops will not be repeated.
  2. As a corollary, simple loops can be entered at most once in each trajectory.
  3. The repeated pattern is branchless (no _Do_if node, no character class matching collating elements of various sizes).
  4. The repeated pattern always matches strings of the same length.
  5. The repeated pattern does not introduce new captures. (But I'm not sure whether I want to lift this restriction in the future, so I haven't taken advantage of this property yet.)

Properties 1-3 mean that we don't have to worry about restoring loop state: Whenever we enter the loop, we can just overwrite it because the data relates to a trajectory the processing of which is complete. Moreover, the repeated pattern is branchless, so it can't happen that we try some first branch in the repeated pattern, have to backtrack and then have to restore the loop state (especially any counters) before trying another branch in the repeated pattern.

As long as we haven't reached the minimum number of reps, we must match more repetitions, so we do not try to match the tail of the regex. Thus, there is also no branch after matching a repetition until the minimum is reached. This means that we don't have to add any stack frames with non-trivial unwinding logic that try the other branch or restore the loop state during backtracking.

For now, this results in the following implementation to match the minimum number of reps in a simple loop:

We don't have to run any stack unwinding logic during backtracking to process the minimum number of reps of a simple loop, but we still have to add a stack frame to save the match state at the time when the simple loop was entered. For this reason, we add a new opcode for doing nothing during stack unwinding. Since we can leave this stack frame to the standard unwinding logic in _Match_pat() now, we can remove all explicit _Pop_frame() calls in _Do_rep0().

Repetitions of simple loops didn't really increase the stack usage counter (other than some temporary increase by 1) prior to this PR, so to reduce complexity I have decided to not replicate the updates to the stack usage counter in the new logic. Even so, this PR replicates the updates to the complexity counter. (This PR also adds an update to the complexity counter to the logic of _Disjunction_eval_alt_always which was missing from #5745.)