regex_syntax::hir::literal - Rust (original) (raw)

Expand description

Provides literal extraction from Hir expressions.

An Extractor pulls literals out of Hir expressions and returns aSeq of Literals.

The purpose of literal extraction is generally to provide avenues for optimizing regex searches. The main idea is that substring searches can be an order of magnitude faster than a regex search. Therefore, if one can execute a substring search to find candidate match locations and only run the regex search at those locations, then it is possible for huge improvements in performance to be realized.

With that said, literal optimizations are generally a black art because even though substring search is generally faster, if the number of candidates produced is high, then it can create a lot of overhead by ping-ponging between the substring search and the regex search.

Here are some heuristics that might be used to help increase the chances of effective literal optimizations:

(It should be noted that there are always pathological cases that can make any kind of literal optimization be a net slower result. This is why it might be a good idea to be conservative, or to even provide a means for literal optimizations to be dynamically disabled if they are determined to be ineffective according to some measure.)

You’re encouraged to explore the methods on Seq, which permit shrinking the size of sequences in a preference-order preserving fashion.

Finally, note that it isn’t strictly necessary to use an Extractor. Namely, an Extractor only uses public APIs of the Seq and Literal types, so it is possible to implement your own extractor. For example, for n-grams or “inner” literals (i.e., not prefix or suffix literals). The Extractoris mostly responsible for the case analysis over Hir expressions. Much of the “trickier” parts are how to combine literal sequences, and that is all implemented on Seq.