<regex>: Add multiline option and make non-multiline mode the default by muellerj2 · Pull Request #5535 · microsoft/STL (original) (raw)
Resolves #73 (also tracked by DevCom-268592 / VSO-629739) and implements LWG-2503. Arguably also resolves DevCom-436138 to the degree that it is reasonable (namely, when the anchor appears in the regex before it starts branching); see the benchmark. Unblocks four libcxx tests.
This PR also aligns the multiline mode with ECMAScript's specification. The anchors now match at any of ECMAScript's line terminators: carriage returns, line feeds, line separators and paragraph separators. Before, the anchors only matched at line feeds.
The PR provides _REGEX_MAKE_MULTILINE_MODE_DEFAULT as an escape hatch to return to default multiline mode; if so, non-multiline mode is not available.
For POSIX grammars, the new multiline option has no effect. While I find this unfortunate, this behavior appears to have been specified in [re.synopt].
To simplify the logic in the matcher and avoid some preprocessor #ifdefs, the matcher's internal copy of the regex syntax flags _Sflags is mutated before matching starts:
- The
multilineflag is set for all grammars when the escape hatch is defined. - The
multilineflag is cleared for POSIX grammars when the escape hatch is not defined.
These mutations ensure that multiline mode is enabled if and only if the multiline flag is set in _Sflags.
I see a potential concern with the implementation in this PR: Even if the escape hatch is set, the matcher still changes behavior and allows anchors to match not just line feeds but all ECMAScript line terminators. It can reasonably be argued that the behavior should be completely unchanged if the escape hatch is defined. Even so, I opted to submit the implementation with ECMAScript-conforming line terminators in this PR first because this simplifies the implementation a lot.
Benchmark
Only for pattern "^bibe" to show that this resolves DevCom-436138.
| Benchmark | Time | CPU | Iterations |
|---|---|---|---|
| bm_lorem_search/"^bibe"/2 | 56.4 ns | 57.8 ns | 10000000 |
| bm_lorem_search/"^bibe"/3 | 55.9 ns | 54.4 ns | 11200000 |
| bm_lorem_search/"^bibe"/4 | 55.5 ns | 56.2 ns | 10000000 |