PikeVM in regex_automata::nfa::thompson::pikevm - Rust (original) (raw)

pub struct PikeVM { /* private fields */ }

Available on crate features nfa-thompson and nfa-pikevm only.

Expand description

A virtual machine for executing regex searches with capturing groups.

§Infallible APIs

Unlike most other regex engines in this crate, a PikeVM never returns an error at search time. It supports all Anchored configurations, never quits and works on haystacks of arbitrary length.

There are two caveats to mention though:

If an invalid pattern ID is given to a search via Anchored::Pattern, then the PikeVM will report “no match.” This is consistent with all other regex engines in this crate.
When using PikeVM::which_overlapping_matches with a PatternSetthat has insufficient capacity to store all valid pattern IDs, then if a match occurs for a PatternID that cannot be inserted, it is silently dropped as if it did not match.

§Advice

The PikeVM is generally the most “powerful” regex engine in this crate. “Powerful” in this context means that it can handle any regular expression that is parseable by regex-syntax and any size haystack. Regretably, the PikeVM is also simultaneously often the slowest regex engine in practice. This results in an annoying situation where one generally tries to pick any other regex engine (or perhaps none at all) before being forced to fall back to a PikeVM.

For example, a common strategy for dealing with capturing groups is to actually look for the overall match of the regex using a faster regex engine, like a lazy DFA. Once the overall match is found, one can then run the PikeVM on just the match span to find the spans of the capturing groups. In this way, the faster regex engine does the majority of the work, while the PikeVM only lends its power in a more limited role.

Unfortunately, this isn’t always possible because the faster regex engines don’t support all of the regex features in regex-syntax. This notably includes (and is currently limited to) Unicode word boundaries. So if your pattern has Unicode word boundaries, you typically can’t use a DFA-based regex engine at all (unless you enable heuristic support for it). (The one-pass DFA can handle Unicode word boundaries for anchored searches only, but in a cruel sort of joke, many Unicode features tend to result in making the regex not one-pass.)

§Example

This example shows that the PikeVM implements Unicode word boundaries correctly by default.

use regex_automata::{nfa::thompson::pikevm::PikeVM, Match};

let re = PikeVM::new(r"\b\w+\b")?;
let mut cache = re.create_cache();

let mut it = re.find_iter(&mut cache, "Шерлок Холмс");
assert_eq!(Some(Match::must(0, 0..12)), it.next());
assert_eq!(Some(Match::must(0, 13..23)), it.next());
assert_eq!(None, it.next());

PikeVM in regex_automata::nfa::thompson::pikevm - Rust (original) (raw)

§Infallible APIs

§Advice

§Example

§Example

§Example

§Example

§Example

§Example

§Example

§Example

§Example

§Example

§Example

§Example: consistency with search APIs

§Example

§Example

§Example

§Example

§Example: specific pattern search

§Example: specifying the bounds of a search

§Example

§Example