MatchKind in regex_automata - Rust (original) (raw)


#[non_exhaustive]

pub enum MatchKind {
    All,
    LeftmostFirst,
}

Expand description

The kind of match semantics to use for a regex pattern.

The default match kind is LeftmostFirst, and this corresponds to the match semantics used by most backtracking engines, such as Perl.

§Leftmost first or “preference order” match semantics

Leftmost-first semantics determine which match to report when there are multiple paths through a regex that match at the same position. The tie is essentially broken by how a backtracker would behave. For example, consider running the regex foofoofoo|foofoo|foo on the haystack foofoo. In this case, both the foofoo and foo branches match at position 0. So should the end of the match be 3 or 6?

A backtracker will conceptually work by trying foofoofoo and failing. Then it will try foofoo, find the match and stop there. Thus, the leftmost-first match position is 6. This is called “leftmost-first” or “preference order” because the order of the branches as written in the regex pattern is what determines how to break the tie.

(Note that leftmost-longest match semantics, which break ties by always taking the longest matching string, are not currently supported by this crate. These match semantics tend to be found in POSIX regex engines.)

This example shows how leftmost-first semantics work, and how it even applies to multi-pattern regexes:

use regex_automata::{
    nfa::thompson::pikevm::PikeVM,
    Match,
};

let re = PikeVM::new_many(&[
    r"foofoofoo",
    r"foofoo",
    r"foo",
])?;
let mut cache = re.create_cache();
let got: Vec<Match> = re.find_iter(&mut cache, "foofoo").collect();
let expected = vec![Match::must(1, 0..6)];
assert_eq!(expected, got);

§All matches

The All match semantics report any and all matches, and generally will attempt to match as much as possible. It doesn’t respect any sort of match priority at all, so things like non-greedy matching don’t work in this mode.

The fact that non-greedy matching doesn’t work generally makes most forms of unanchored non-overlapping searches have unintuitive behavior. Namely, unanchored searches behave as if there is a (?s-u:.)*? prefix at the beginning of the pattern, which is specifically non-greedy. Since it will be treated as greedy in All match semantics, this generally means that it will first attempt to consume all of the haystack and is likely to wind up skipping matches.

Generally speaking, All should only be used in two circumstances:

This example demonstrates the counter-intuitive behavior of All semantics when using a standard leftmost unanchored search:

use regex_automata::{
    nfa::thompson::pikevm::PikeVM,
    Match, MatchKind,
};

let re = PikeVM::builder()
    .configure(PikeVM::config().match_kind(MatchKind::All))
    .build("foo")?;
let hay = "first foo second foo wat";
let mut cache = re.create_cache();
let got: Vec<Match> = re.find_iter(&mut cache, hay).collect();
// Notice that it completely skips the first 'foo'!
let expected = vec![Match::must(0, 17..20)];
assert_eq!(expected, got);

This second example shows how All semantics are useful for an overlapping search. Note that we use lower level lazy DFA APIs here since the NFA engines only currently support a very limited form of overlapping search.

use regex_automata::{
    hybrid::dfa::{DFA, OverlappingState},
    HalfMatch, Input, MatchKind,
};

let re = DFA::builder()
    // If we didn't set 'All' semantics here, then the regex would only
    // match 'foo' at offset 3 and nothing else. Why? Because the state
    // machine implements preference order and knows that the 'foofoo' and
    // 'foofoofoo' branches can never match since 'foo' will always match
    // when they match and take priority.
    .configure(DFA::config().match_kind(MatchKind::All))
    .build(r"foo|foofoo|foofoofoo")?;
let mut cache = re.create_cache();
let mut state = OverlappingState::start();
let input = Input::new("foofoofoo");
let mut got = vec![];
loop {
    re.try_search_overlapping_fwd(&mut cache, &input, &mut state)?;
    let m = match state.get_match() {
        None => break,
        Some(m) => m,
    };
    got.push(m);
}
let expected = vec![
    HalfMatch::must(0, 3),
    HalfMatch::must(0, 6),
    HalfMatch::must(0, 9),
];
assert_eq!(expected, got);

This enum is marked as non-exhaustive

Non-exhaustive enums could have additional variants added in future. Therefore, when matching against variants of non-exhaustive enums, an extra wildcard arm must be added to account for any future variants.

§

Report all possible matches.

§

Report only the leftmost matches. When multiple leftmost matches exist, report the match corresponding to the part of the regex that appears first in the syntax.

§

§

§

§

§

§