Properties in regex_syntax::hir - Rust (original) (raw)
pub struct Properties(/* private fields */);
Expand description
A type that collects various properties of an HIR value.
Properties are always scalar values and represent meta data that is computed inductively on an HIR value. Properties are defined for all HIR values.
All methods on a Properties
value take constant time and are meant to be cheap to call.
Returns the length (in bytes) of the smallest string matched by this HIR.
A return value of 0
is possible and occurs when the HIR can match an empty string.
None
is returned when there is no minimum length. This occurs in precisely the cases where the HIR matches nothing. i.e., The language the regex matches is empty. An example of such a regex is \P{any}
.
Returns the length (in bytes) of the longest string matched by this HIR.
A return value of 0
is possible and occurs when nothing longer than the empty string is in the language described by this HIR.
None
is returned when there is no longest matching string. This occurs when the HIR matches nothing or when there is no upper bound on the length of matching strings. Example of such regexes are \P{any}
(matches nothing) and a+
(has no upper bound).
Returns a set of all look-around assertions that appear at least once in this HIR value.
Returns a set of all look-around assertions that appear as a prefix for this HIR value. That is, the set returned corresponds to the set of assertions that must be passed before matching any bytes in a haystack.
For example, hir.look_set_prefix().contains(Look::Start)
returns true if and only if the HIR is fully anchored at the start.
Returns a set of all look-around assertions that appear as a _possible_prefix for this HIR value. That is, the set returned corresponds to the set of assertions that may be passed before matching any bytes in a haystack.
For example, hir.look_set_prefix_any().contains(Look::Start)
returns true if and only if it’s possible for the regex to match through a anchored assertion before consuming any input.
Returns a set of all look-around assertions that appear as a suffix for this HIR value. That is, the set returned corresponds to the set of assertions that must be passed in order to be considered a match after all other consuming HIR expressions.
For example, hir.look_set_suffix().contains(Look::End)
returns true if and only if the HIR is fully anchored at the end.
Returns a set of all look-around assertions that appear as a _possible_suffix for this HIR value. That is, the set returned corresponds to the set of assertions that may be passed before matching any bytes in a haystack.
For example, hir.look_set_suffix_any().contains(Look::End)
returns true if and only if it’s possible for the regex to match through a anchored assertion at the end of a match without consuming any input.
Return true if and only if the corresponding HIR will always match valid UTF-8.
When this returns false, then it is possible for this HIR expression to match invalid UTF-8, including by matching between the code units of a single UTF-8 encoded codepoint.
Note that this returns true even when the corresponding HIR can match the empty string. Since an empty string can technically appear between UTF-8 code units, it is possible for a match to be reported that splits a codepoint which could in turn be considered matching invalid UTF-8. However, it is generally assumed that such empty matches are handled specially by the search routine if it is absolutely required that matches not split a codepoint.
§Example
This code example shows the UTF-8 property of a variety of patterns.
use regex_syntax::{ParserBuilder, parse};
// Examples of 'is_utf8() == true'.
assert!(parse(r"a")?.properties().is_utf8());
assert!(parse(r"[^a]")?.properties().is_utf8());
assert!(parse(r".")?.properties().is_utf8());
assert!(parse(r"\W")?.properties().is_utf8());
assert!(parse(r"\b")?.properties().is_utf8());
assert!(parse(r"\B")?.properties().is_utf8());
assert!(parse(r"(?-u)\b")?.properties().is_utf8());
assert!(parse(r"(?-u)\B")?.properties().is_utf8());
// Unicode mode is enabled by default, and in
// that mode, all \x hex escapes are treated as
// codepoints. So this actually matches the UTF-8
// encoding of U+00FF.
assert!(parse(r"\xFF")?.properties().is_utf8());
// Now we show examples of 'is_utf8() == false'.
// The only way to do this is to force the parser
// to permit invalid UTF-8, otherwise all of these
// would fail to parse!
let parse = |pattern| {
ParserBuilder::new().utf8(false).build().parse(pattern)
};
assert!(!parse(r"(?-u)[^a]")?.properties().is_utf8());
assert!(!parse(r"(?-u).")?.properties().is_utf8());
assert!(!parse(r"(?-u)\W")?.properties().is_utf8());
// Conversely to the equivalent example above,
// when Unicode mode is disabled, \x hex escapes
// are treated as their raw byte values.
assert!(!parse(r"(?-u)\xFF")?.properties().is_utf8());
// Note that just because we disabled UTF-8 in the
// parser doesn't mean we still can't use Unicode.
// It is enabled by default, so \xFF is still
// equivalent to matching the UTF-8 encoding of
// U+00FF by default.
assert!(parse(r"\xFF")?.properties().is_utf8());
// Even though we use raw bytes that individually
// are not valid UTF-8, when combined together, the
// overall expression *does* match valid UTF-8!
assert!(parse(r"(?-u)\xE2\x98\x83")?.properties().is_utf8());
Returns the total number of explicit capturing groups in the corresponding HIR.
Note that this does not include the implicit capturing group corresponding to the entire match that is typically included by regex engines.
§Example
This method will return 0
for a
and 1
for (a)
:
use regex_syntax::parse;
assert_eq!(0, parse("a")?.properties().explicit_captures_len());
assert_eq!(1, parse("(a)")?.properties().explicit_captures_len());
Returns the total number of explicit capturing groups that appear in every possible match.
If the number of capture groups can vary depending on the match, then this returns None
. That is, a value is only returned when the number of matching groups is invariant or “static.”
Note that this does not include the implicit capturing group corresponding to the entire match.
§Example
This shows a few cases where a static number of capture groups is available and a few cases where it is not.
use regex_syntax::parse;
let len = |pattern| {
parse(pattern).map(|h| {
h.properties().static_explicit_captures_len()
})
};
assert_eq!(Some(0), len("a")?);
assert_eq!(Some(1), len("(a)")?);
assert_eq!(Some(1), len("(a)|(b)")?);
assert_eq!(Some(2), len("(a)(b)|(c)(d)")?);
assert_eq!(None, len("(a)|b")?);
assert_eq!(None, len("a|(b)")?);
assert_eq!(None, len("(b)*")?);
assert_eq!(Some(1), len("(b)+")?);
Return true if and only if this HIR is a simple literal. This is only true when this HIR expression is either itself a Literal
or a concatenation of only Literal
s.
For example, f
and foo
are literals, but f+
, (foo)
, foo()
and the empty string are not (even though they contain sub-expressions that are literals).
Return true if and only if this HIR is either a simple literal or an alternation of simple literals. This is only true when this HIR expression is either itself a Literal
or a concatenation of only Literal
s or an alternation of only Literal
s.
For example, f
, foo
, a|b|c
, and foo|bar|baz
are alternation literals, but f+
, (foo)
, foo()
, and the empty pattern are not (even though that contain sub-expressions that are literals).
Returns the total amount of heap memory usage, in bytes, used by thisProperties
value.
Returns a new set of properties that corresponds to the union of the iterator of properties given.
This is useful when one has multiple Hir
expressions and wants to combine them into a single alternation without constructing the corresponding Hir
. This routine provides a way of combining the properties of each Hir
expression into one set of properties representing the union of those expressions.
§Example: union with HIRs that never match
This example shows that unioning properties together with one that represents a regex that never matches will “poison” certain attributes, like the minimum and maximum lengths.
use regex_syntax::{hir::Properties, parse};
let hir1 = parse("ab?c?")?;
assert_eq!(Some(1), hir1.properties().minimum_len());
assert_eq!(Some(3), hir1.properties().maximum_len());
let hir2 = parse(r"[a&&b]")?;
assert_eq!(None, hir2.properties().minimum_len());
assert_eq!(None, hir2.properties().maximum_len());
let hir3 = parse(r"wxy?z?")?;
assert_eq!(Some(2), hir3.properties().minimum_len());
assert_eq!(Some(4), hir3.properties().maximum_len());
let unioned = Properties::union([
hir1.properties(),
hir2.properties(),
hir3.properties(),
]);
assert_eq!(None, unioned.minimum_len());
assert_eq!(None, unioned.maximum_len());
The maximum length can also be “poisoned” by a pattern that has no upper bound on the length of a match. The minimum length remains unaffected:
use regex_syntax::{hir::Properties, parse};
let hir1 = parse("ab?c?")?;
assert_eq!(Some(1), hir1.properties().minimum_len());
assert_eq!(Some(3), hir1.properties().maximum_len());
let hir2 = parse(r"a+")?;
assert_eq!(Some(1), hir2.properties().minimum_len());
assert_eq!(None, hir2.properties().maximum_len());
let hir3 = parse(r"wxy?z?")?;
assert_eq!(Some(2), hir3.properties().minimum_len());
assert_eq!(Some(4), hir3.properties().maximum_len());
let unioned = Properties::union([
hir1.properties(),
hir2.properties(),
hir3.properties(),
]);
assert_eq!(Some(1), unioned.minimum_len());
assert_eq!(None, unioned.maximum_len());
Tests for self
and other
values to be equal, and is used by ==
.
Tests for !=
. The default implementation is almost always sufficient, and should not be overridden without very good reason.