Regex in regex_automata::meta - Rust (original) (raw)

pub struct Regex { /* private fields */ }

Available on crate feature meta only.

Expand description

A regex matcher that works by composing several other regex matchers automatically.

In effect, a meta regex papers over a lot of the quirks or performance problems in each of the regex engines in this crate. Its goal is to provide an infallible and simple API that “just does the right thing” in the common case.

A meta regex is the implementation of a Regex in the regex crate. Indeed, the regex crate API is essentially just a light wrapper over this type. This includes the regex crate’s RegexSet API!

§Composition

This is called a “meta” matcher precisely because it uses other regex matchers to provide a convenient high level regex API. Here are some examples of how other regex matchers are composed:

When calling Regex::captures, instead of immediately running a slower but more capable regex engine like thePikeVM, the meta regex engine will usually first look for the bounds of a match with a higher throughput regex engine like a lazy DFA. Only when a match is found is a slower engine like PikeVM used to find the matching span for each capture group.
While higher throughout engines like the lazy DFA cannot handle Unicode word boundaries in general, they can still be used on pure ASCII haystacks by pretending that Unicode word boundaries are just plain ASCII word boundaries. However, if a haystack is not ASCII, the meta regex engine will automatically switch to a (possibly slower) regex engine that supports Unicode word boundaries in general.
In some cases where a regex pattern is just a simple literal or a small set of literals, an actual regex engine won’t be used at all. Instead, substring or multi-substring search algorithms will be employed.

There are many other forms of composition happening too, but the above should give a general idea. In particular, it may perhaps be surprising that multiple regex engines might get executed for a single search. That is, the decision of what regex engine to use is not just based on the pattern, but also based on the dynamic execution of the search itself.

The primary reason for this composition is performance. The fundamental tension is that the faster engines tend to be less capable, and the more capable engines tend to be slower.

Note that the forms of composition that are allowed are determined by compile time crate features and configuration. For example, if the hybridfeature isn’t enabled, or if Config::hybrid has been disabled, then the meta regex engine will never use a lazy DFA.

§Synchronization and cloning

Most of the regex engines in this crate require some kind of mutable “scratch” space to read and write from while performing a search. Since a meta regex composes these regex engines, a meta regex also requires mutable scratch space. This scratch space is called a Cache.

Most regex engines also usually have a read-only component, typically a Thompson NFA.

In order to make the Regex API convenient, most of the routines hide the fact that a Cache is needed at all. To achieve this, a memory pool is used internally to retrieve Cachevalues in a thread safe way that also permits reuse. This in turn implies that every such search call requires some form of synchronization. Usually this synchronization is fast enough to not notice, but in some cases, it can be a bottleneck. This typically occurs when all of the following are true:

The same Regex is shared across multiple threads simultaneously, usually via a util:🦥:Lazy or something similar from the once_cell or lazy_static crates.
The primary unit of work in each thread is a regex search.
Searches are run on very short haystacks.

This particular case can lead to high contention on the pool used by aRegex internally, which can in turn increase latency to a noticeable effect. This cost can be mitigated in one of the following ways:

Use a distinct copy of a Regex in each thread, usually by cloning it. Cloning a Regex does not do a deep copy of its read-only component. But it does lead to each Regex having its own memory pool, which in turn eliminates the problem of contention. In general, this technique should not result in any additional memory usage when compared to sharing the sameRegex across multiple threads simultaneously.
Use lower level APIs, like Regex::search_with, which permit passing a Cache explicitly. In this case, it is up to you to determine how best to provide a Cache. For example, you might put a Cache in thread-local storage if your use case allows for it.

Overall, this is an issue that happens rarely in practice, but it can happen.

§Warning: spin-locks may be used in alloc-only mode

When this crate is built without the std feature and the high level APIs on a Regex are used, then a spin-lock will be used to synchronize access to an internal pool of Cache values. This may be undesirable because a spin-lock is effectively impossible to implement correctly in user space. That is, more concretely, the spin-lock could result in a deadlock.

If one wants to avoid the use of spin-locks when the std feature is disabled, then you must use APIs that accept a Cache value explicitly. For example, Regex::search_with.

§Example

use regex_automata::meta::Regex;

let re = Regex::new(r"^[0-9]{4}-[0-9]{2}-[0-9]{2}$")?;
assert!(re.is_match("2010-03-14"));

§Example: anchored search

This example shows how to use Input::anchored to run an anchored search, even when the regex pattern itself isn’t anchored. An anchored search guarantees that if a match is found, then the start offset of the match corresponds to the offset at which the search was started.

use regex_automata::{meta::Regex, Anchored, Input, Match};

let re = Regex::new(r"\bfoo\b")?;
let input = Input::new("xx foo xx").range(3..).anchored(Anchored::Yes);
// The offsets are in terms of the original haystack.
assert_eq!(Some(Match::must(0, 3..6)), re.find(input));

// Notice that no match occurs here, because \b still takes the
// surrounding context into account, even if it means looking back
// before the start of your search.
let hay = "xxfoo xx";
let input = Input::new(hay).range(2..).anchored(Anchored::Yes);
assert_eq!(None, re.find(input));
// Indeed, you cannot achieve the above by simply slicing the
// haystack itself, since the regex engine can't see the
// surrounding context. This is why 'Input' permits setting
// the bounds of a search!
let input = Input::new(&hay[2..]).anchored(Anchored::Yes);
// WRONG!
assert_eq!(Some(Match::must(0, 0..3)), re.find(input));

§Example: earliest search

This example shows how to use Input::earliest to run a search that might stop before finding the typical leftmost match.

use regex_automata::{meta::Regex, Anchored, Input, Match};

let re = Regex::new(r"[a-z]{3}|b")?;
let input = Input::new("abc").earliest(true);
assert_eq!(Some(Match::must(0, 1..2)), re.find(input));

// Note that "earliest" isn't really a match semantic unto itself.
// Instead, it is merely an instruction to whatever regex engine
// gets used internally to quit as soon as it can. For example,
// this regex uses a different search technique, and winds up
// producing a different (but valid) match!
let re = Regex::new(r"abc|b")?;
let input = Input::new("abc").earliest(true);
assert_eq!(Some(Match::must(0, 0..3)), re.find(input));

§Example: change the line terminator

This example shows how to enable multi-line mode by default and change the line terminator to the NUL byte:

use regex_automata::{meta::Regex, util::syntax, Match};

let re = Regex::builder()
    .syntax(syntax::Config::new().multi_line(true))
    .configure(Regex::config().line_terminator(b'\x00'))
    .build(r"^foo$")?;
let hay = "\x00foo\x00";
assert_eq!(Some(Match::must(0, 1..4)), re.find(hay));

Source §

Convenience constructors for a Regex using the default configuration.

Regex in regex_automata::meta - Rust (original) (raw)

§Composition

§Synchronization and cloning

§Warning: spin-locks may be used in alloc-only mode

§Example

§Example: anchored search

§Example: earliest search

§Example: change the line terminator

§Example

§Example: simple lexer

§Example: finding the pattern that caused an error

§Example: zero patterns is valid

§Example: lower the NFA size limit

§Example: change the line terminator

§Example

§Example: consistency with search APIs

§Example

§Example

§Example

§Example

§Example

§Example: more cases

§Example

§Examples: more cases

§Example

§Example

§Example: specific pattern search

§Example: specifying the bounds of a search

§Example

§Example

§Why pass a Cache explicitly?

§Example

§Why pass a Cache explicitly?

§Example

§Why pass a Cache explicitly?

§Example: specific pattern search

§Example: specifying the bounds of a search

§Why pass a Cache explicitly?

§Example

§Why pass a Cache explicitly?

§Example

§Example

§Example

§Example

§Example

§Example: multiple patterns

§Example

§Example: multiple patterns

§Example

§Example

§Why pass a `Cache` explicitly?

§Why pass a `Cache` explicitly?

§Why pass a `Cache` explicitly?

§Why pass a `Cache` explicitly?

§Why pass a `Cache` explicitly?