Input in regex_automata - Rust (original) (raw)
pub struct Input<'h> { /* private fields */ }
Expand description
The parameters for a regex search including the haystack to search.
It turns out that regex searches have a few parameters, and in most cases, those parameters have defaults that work in the vast majority of cases. This Input
type exists to make that common case seamnless while also providing an avenue for changing the parameters of a search. In particular, this type enables doing so without a combinatorial explosion of different methods and/or superfluous parameters in the common cases.
An Input
permits configuring the following things:
- Search only a substring of a haystack, while taking the broader context into account for resolving look-around assertions.
- Indicating whether to search for all patterns in a regex, or to only search for one pattern in particular.
- Whether to perform an anchored on unanchored search.
- Whether to report a match as early as possible.
All of these parameters, except for the haystack, have sensible default values. This means that the minimal search configuration is simply a call to Input::new with your haystack. Setting any other parameter is optional.
Moreover, for any H
that implements AsRef<[u8]>
, there exists aFrom<H> for Input
implementation. This is useful because many of the search APIs in this crate accept an Into<Input>
. This means you can provide string or byte strings to these routines directly, and they’ll automatically get converted into an Input
for you.
The lifetime parameter 'h
refers to the lifetime of the haystack.
§Organization
The API of Input
is split into a few different parts:
- A builder-like API that transforms a
Input
by value. Examples:Input::span and Input::anchored. - A setter API that permits mutating parameters in place. Examples:Input::set_span and Input::set_anchored.
- A getter API that permits retrieving any of the search parameters. Examples: Input::get_span and Input::get_anchored.
- A few convenience getter routines that don’t conform to the above naming pattern due to how common they are. Examples: Input::haystack,Input::start and Input::end.
- Miscellaneous predicates and other helper routines that are useful in some contexts. Examples: Input::is_char_boundary.
A Input
exposes so much because it is meant to be used by both callers of regex engines and implementors of regex engines. A constraining factor is that regex engines should accept a &Input
as its lowest level API, which means that implementors should only use the “getter” APIs of a Input
.
§Valid bounds and search termination
An Input
permits setting the bounds of a search via eitherInput::span or Input::range. The bounds set must be valid, or else a panic will occur. Bounds are valid if and only if:
- The bounds represent a valid range into the input’s haystack.
- or the end bound is a valid ending bound for the haystack _and_the start bound is exactly one greater than the start bound.
In the latter case, Input::is_done will return true and indicates any search receiving such an input should immediately return with no match.
Note that while Input
is used for reverse searches in this crate, theInput::is_done
predicate assumes a forward search. Because unsigned offsets are used internally, there is no way to tell from only the offsets whether a reverse search is done or not.
§Regex engine support
Any regex engine accepting an Input
must support at least the following things:
- Searching a
&[u8]
for matches. - Searching a substring of
&[u8]
for a match, such that any match reported must appear entirely within that substring. - For a forwards search, a match should never be reported whenInput::is_done returns true. (For reverse searches, termination should be handled outside of
Input
.)
Supporting other aspects of an Input
are optional, but regex engines should handle aspects they don’t support gracefully. How this is done is generally up to the regex engine. This crate generally treats unsupported anchored modes as an error to report for example, but for simplicity, in the meta regex engine, trying to search with an invalid pattern ID just results in no match being reported.
Create a new search configuration for the given haystack.
Set the span for this search.
This routine does not panic if the span given is not a valid range for this search’s haystack. If this search is run with an invalid range, then the most likely outcome is that the actual search execution will panic.
This routine is generic over how a span is provided. While a Span may be given directly, one may also provide astd::ops::Range<usize>
. To provide anything supported by range syntax, use the Input::range method.
The default span is the entire haystack.
Note that Input::range overrides this method and vice versa.
§Panics
This panics if the given span does not correspond to valid bounds in the haystack or the termination of a search.
§Example
This example shows how the span of the search can impact whether a match is reported or not. This is particularly relevant for look-around operators, which might take things outside of the span into account when determining whether they match.
use regex_automata::{
nfa::thompson::pikevm::PikeVM,
Match, Input,
};
// Look for 'at', but as a distinct word.
let re = PikeVM::new(r"\bat\b")?;
let mut cache = re.create_cache();
let mut caps = re.create_captures();
// Our haystack contains 'at', but not as a distinct word.
let haystack = "batter";
// A standard search finds nothing, as expected.
let input = Input::new(haystack);
re.search(&mut cache, &input, &mut caps);
assert_eq!(None, caps.get_match());
// But if we wanted to search starting at position '1', we might
// slice the haystack. If we do this, it's impossible for the \b
// anchors to take the surrounding context into account! And thus,
// a match is produced.
let input = Input::new(&haystack[1..3]);
re.search(&mut cache, &input, &mut caps);
assert_eq!(Some(Match::must(0, 0..2)), caps.get_match());
// But if we specify the span of the search instead of slicing the
// haystack, then the regex engine can "see" outside of the span
// and resolve the anchors correctly.
let input = Input::new(haystack).span(1..3);
re.search(&mut cache, &input, &mut caps);
assert_eq!(None, caps.get_match());
This may seem a little ham-fisted, but this scenario tends to come up if some other regex engine found the match span and now you need to re-process that span to look for capturing groups. (e.g., Run a faster DFA first, find a match, then run the PikeVM on just the match span to resolve capturing groups.) In order to implement that sort of logic correctly, you need to set the span on the search instead of slicing the haystack directly.
The other advantage of using this routine to specify the bounds of the search is that the match offsets are still reported in terms of the original haystack. For example, the second search in the example above reported a match at position 0
, even though at
starts at offset1
because we sliced the haystack.
Like Input::span
, but accepts any range instead.
This routine does not panic if the range given is not a valid range for this search’s haystack. If this search is run with an invalid range, then the most likely outcome is that the actual search execution will panic.
The default range is the entire haystack.
Note that Input::span overrides this method and vice versa.
§Panics
This routine will panic if the given range could not be converted to a valid Range. For example, this would panic when given0..=usize::MAX
since it cannot be represented using a half-open interval in terms of usize
.
This also panics if the given range does not correspond to valid bounds in the haystack or the termination of a search.
§Example
use regex_automata::Input;
let input = Input::new("foobar");
assert_eq!(0..6, input.get_range());
let input = Input::new("foobar").range(2..=4);
assert_eq!(2..5, input.get_range());
Sets the anchor mode of a search.
When a search is anchored (so that’s Anchored::Yes orAnchored::Pattern), a match must begin at the start of a search. When a search is not anchored (that’s Anchored::No), regex engines will behave as if the pattern started with a (?s-u:.)*?
. This prefix permits a match to appear anywhere.
By default, the anchored mode is Anchored::No.
WARNING: this is subtly different than using a ^
at the start of your regex. A ^
forces a regex to match exclusively at the start of a haystack, regardless of where you begin your search. In contrast, anchoring a search will allow your regex to match anywhere in your haystack, but the match must start at the beginning of a search.
For example, consider the haystack aba
and the following searches:
- The regex
^a
is compiled withAnchored::No
and searchesaba
starting at position2
. Since^
requires the match to start at the beginning of the haystack and2 > 0
, no match is found. - The regex
a
is compiled withAnchored::Yes
and searchesaba
starting at position2
. This reports a match at[2, 3]
since the match starts where the search started. Since there is no^
, there is no requirement for the match to start at the beginning of the haystack. - The regex
a
is compiled withAnchored::Yes
and searchesaba
starting at position1
. Sinceb
corresponds to position1
and since the search is anchored, it finds no match. While the regex matches at other positions, configuring the search to be anchored requires that it only report a match that begins at the same offset as the beginning of the search. - The regex
a
is compiled withAnchored::No
and searchesaba
starting at position1
. Since the search is not anchored and the regex does not start with^
, the search executes as if there is a(?s:.)*?
prefix that permits it to match anywhere. Thus, it reports a match at[2, 3]
.
Note that the Anchored::Pattern mode is like Anchored::Yes
, except it only reports matches for a particular pattern.
§Example
This demonstrates the differences between an anchored search and a pattern that begins with ^
(as described in the above warning message).
use regex_automata::{
nfa::thompson::pikevm::PikeVM,
Anchored, Match, Input,
};
let haystack = "aba";
let re = PikeVM::new(r"^a")?;
let (mut cache, mut caps) = (re.create_cache(), re.create_captures());
let input = Input::new(haystack).span(2..3).anchored(Anchored::No);
re.search(&mut cache, &input, &mut caps);
// No match is found because 2 is not the beginning of the haystack,
// which is what ^ requires.
assert_eq!(None, caps.get_match());
let re = PikeVM::new(r"a")?;
let (mut cache, mut caps) = (re.create_cache(), re.create_captures());
let input = Input::new(haystack).span(2..3).anchored(Anchored::Yes);
re.search(&mut cache, &input, &mut caps);
// An anchored search can still match anywhere in the haystack, it just
// must begin at the start of the search which is '2' in this case.
assert_eq!(Some(Match::must(0, 2..3)), caps.get_match());
let re = PikeVM::new(r"a")?;
let (mut cache, mut caps) = (re.create_cache(), re.create_captures());
let input = Input::new(haystack).span(1..3).anchored(Anchored::Yes);
re.search(&mut cache, &input, &mut caps);
// No match is found since we start searching at offset 1 which
// corresponds to 'b'. Since there is no '(?s:.)*?' prefix, no match
// is found.
assert_eq!(None, caps.get_match());
let re = PikeVM::new(r"a")?;
let (mut cache, mut caps) = (re.create_cache(), re.create_captures());
let input = Input::new(haystack).span(1..3).anchored(Anchored::No);
re.search(&mut cache, &input, &mut caps);
// Since anchored=no, an implicit '(?s:.)*?' prefix was added to the
// pattern. Even though the search starts at 'b', the 'match anything'
// prefix allows the search to match 'a'.
let expected = Some(Match::must(0, 2..3));
assert_eq!(expected, caps.get_match());
Whether to execute an “earliest” search or not.
When running a non-overlapping search, an “earliest” search will return the match location as early as possible. For example, given a pattern of foo[0-9]+
and a haystack of foo12345
, a normal leftmost search will return foo12345
as a match. But an “earliest” search for regex engines that support “earliest” semantics will return foo1
as a match, since as soon as the first digit following foo
is seen, it is known to have found a match.
Note that “earliest” semantics generally depend on the regex engine. Different regex engines may determine there is a match at different points. So there is no guarantee that “earliest” matches will always return the same offsets for all regex engines. The “earliest” notion is really about when the particular regex engine determines there is a match rather than a consistent semantic unto itself. This is often useful for implementing “did a match occur or not” predicates, but sometimes the offset is useful as well.
This is disabled by default.
§Example
This example shows the difference between “earliest” searching and normal searching.
use regex_automata::{nfa::thompson::pikevm::PikeVM, Match, Input};
let re = PikeVM::new(r"foo[0-9]+")?;
let mut cache = re.create_cache();
let mut caps = re.create_captures();
// A normal search implements greediness like you expect.
let input = Input::new("foo12345");
re.search(&mut cache, &input, &mut caps);
assert_eq!(Some(Match::must(0, 0..8)), caps.get_match());
// When 'earliest' is enabled and the regex engine supports
// it, the search will bail once it knows a match has been
// found.
let input = Input::new("foo12345").earliest(true);
re.search(&mut cache, &input, &mut caps);
assert_eq!(Some(Match::must(0, 0..4)), caps.get_match());
Set the span for this search configuration.
This is like the Input::span method, except this mutates the span in place.
This routine is generic over how a span is provided. While a Span may be given directly, one may also provide astd::ops::Range<usize>
.
§Panics
This panics if the given span does not correspond to valid bounds in the haystack or the termination of a search.
§Example
use regex_automata::Input;
let mut input = Input::new("foobar");
assert_eq!(0..6, input.get_range());
input.set_span(2..4);
assert_eq!(2..4, input.get_range());
Set the span for this search configuration given any range.
This is like the Input::range method, except this mutates the span in place.
This routine does not panic if the range given is not a valid range for this search’s haystack. If this search is run with an invalid range, then the most likely outcome is that the actual search execution will panic.
§Panics
This routine will panic if the given range could not be converted to a valid Range. For example, this would panic when given0..=usize::MAX
since it cannot be represented using a half-open interval in terms of usize
.
This also panics if the given span does not correspond to valid bounds in the haystack or the termination of a search.
§Example
use regex_automata::Input;
let mut input = Input::new("foobar");
assert_eq!(0..6, input.get_range());
input.set_range(2..=4);
assert_eq!(2..5, input.get_range());
Set the starting offset for the span for this search configuration.
This is a convenience routine for only mutating the start of a span without having to set the entire span.
§Panics
This panics if the span resulting from the new start position does not correspond to valid bounds in the haystack or the termination of a search.
§Example
use regex_automata::Input;
let mut input = Input::new("foobar");
assert_eq!(0..6, input.get_range());
input.set_start(5);
assert_eq!(5..6, input.get_range());
Set the ending offset for the span for this search configuration.
This is a convenience routine for only mutating the end of a span without having to set the entire span.
§Panics
This panics if the span resulting from the new end position does not correspond to valid bounds in the haystack or the termination of a search.
§Example
use regex_automata::Input;
let mut input = Input::new("foobar");
assert_eq!(0..6, input.get_range());
input.set_end(5);
assert_eq!(0..5, input.get_range());
Set the anchor mode of a search.
This is like Input::anchored, except it mutates the search configuration in place.
§Example
use regex_automata::{Anchored, Input, PatternID};
let mut input = Input::new("foobar");
assert_eq!(Anchored::No, input.get_anchored());
let pid = PatternID::must(5);
input.set_anchored(Anchored::Pattern(pid));
assert_eq!(Anchored::Pattern(pid), input.get_anchored());
Set whether the search should execute in “earliest” mode or not.
This is like Input::earliest, except it mutates the search configuration in place.
§Example
use regex_automata::Input;
let mut input = Input::new("foobar");
assert!(!input.get_earliest());
input.set_earliest(true);
assert!(input.get_earliest());
Return a borrow of the underlying haystack as a slice of bytes.
§Example
use regex_automata::Input;
let input = Input::new("foobar");
assert_eq!(b"foobar", input.haystack());
Return the start position of this search.
This is a convenience routine for search.get_span().start()
.
When Input::is_done is false
, this is guaranteed to return an offset that is less than or equal to Input::end. Otherwise, the offset is one greater than Input::end.
§Example
use regex_automata::Input;
let input = Input::new("foobar");
assert_eq!(0, input.start());
let input = Input::new("foobar").span(2..4);
assert_eq!(2, input.start());
Return the end position of this search.
This is a convenience routine for search.get_span().end()
.
This is guaranteed to return an offset that is a valid exclusive end bound for this input’s haystack.
§Example
use regex_automata::Input;
let input = Input::new("foobar");
assert_eq!(6, input.end());
let input = Input::new("foobar").span(2..4);
assert_eq!(4, input.end());
Return the span for this search configuration.
If one was not explicitly set, then the span corresponds to the entire range of the haystack.
When Input::is_done is false
, the span returned is guaranteed to correspond to valid bounds for this input’s haystack.
§Example
use regex_automata::{Input, Span};
let input = Input::new("foobar");
assert_eq!(Span { start: 0, end: 6 }, input.get_span());
Return the span as a range for this search configuration.
If one was not explicitly set, then the span corresponds to the entire range of the haystack.
When Input::is_done is false
, the range returned is guaranteed to correspond to valid bounds for this input’s haystack.
§Example
use regex_automata::Input;
let input = Input::new("foobar");
assert_eq!(0..6, input.get_range());
Return the anchored mode for this search configuration.
If no anchored mode was set, then it defaults to Anchored::No.
§Example
use regex_automata::{Anchored, Input, PatternID};
let mut input = Input::new("foobar");
assert_eq!(Anchored::No, input.get_anchored());
let pid = PatternID::must(5);
input.set_anchored(Anchored::Pattern(pid));
assert_eq!(Anchored::Pattern(pid), input.get_anchored());
Return whether this search should execute in “earliest” mode.
§Example
use regex_automata::Input;
let input = Input::new("foobar");
assert!(!input.get_earliest());
Return true if and only if this search can never return any other matches.
This occurs when the start position of this search is greater than the end position of the search.
§Example
use regex_automata::Input;
let mut input = Input::new("foobar");
assert!(!input.is_done());
input.set_start(6);
assert!(!input.is_done());
input.set_start(7);
assert!(input.is_done());
Returns true if and only if the given offset in this search’s haystack falls on a valid UTF-8 encoded codepoint boundary.
If the haystack is not valid UTF-8, then the behavior of this routine is unspecified.
§Example
This shows where codepoint boundaries do and don’t exist in valid UTF-8.
use regex_automata::Input;
let input = Input::new("☃");
assert!(input.is_char_boundary(0));
assert!(!input.is_char_boundary(1));
assert!(!input.is_char_boundary(2));
assert!(input.is_char_boundary(3));
assert!(!input.is_char_boundary(4));