Implementing Regular Expressions (original) (raw)
Russ Cox
This page collects resources about implementing regular expression search efficiently.
Articles and Notes
“Regular Expression Matching Can Be Simple And Fast”
“Regular Expression Matching: the Virtual Machine Approach”
An introduction to submatch tracking during efficient (non-backtracking) NFA-based regular expression matching.
Supporting programs:https://github.com/rsc/re1/
“Regular Expression Matching in the Wild”
“Regular Expression Matching with a Trigram Index”
If you want to read Ken Thompson's original 1968 paper (see below), you'll want to take this with you.
“Regular Expressions: Languages, Algorithms, and Software”
by Brian W. Kernighan and Rob Pike
The cleanest, simplest, backtracking implementation you'll ever see.
(But still backtracking, so runs very slowly on some inputs.)
See also Chapter 9 of _The Practice of Programming_and Chapter 1 of Beautiful Code.
Efficient Implementations
RE2 regular expression library
Efficient automaton-based implementation of Perl-syntax regular expressions (excluding backreferences). Written by Russ Cox.
Plan 9 regular expression library
Efficient (non-backtracking) NFA implementation with submatch tracking. Accepts UTF-8 and wide-character Unicode input. Traditional egrep syntax only. Written by Rob Pike.
TRE regular expression library
Efficient (non-backtracking) NFA implementation with submatch tracking. POSIX-compliant, wide-character input. Written by Ville Laurikari.
Very fast DFA-based grep(1) implementation. Accepts UTF-8 input. Traditional egrep syntax only. Written by Ken Thompson.
References
Michael Rabin and Dana Scott, “Finite automata and their decision problems,” IBM Journal of Research and Development 3 (1959), pp. 114–125.http://www.research.ibm.com/journal/rd/032/ibmrd0302C.pdf
Introduces the concept of NFAs, shows they are equivalent in power to DFAs. Won Rabin and Scott a Turing Award
Ken Thompson, “Regular expression search algorithm,” Communications of the ACM 11(6) (June 1968), pp. 419–422.http://doi.acm.org/10.1145/363347.363387(PDF)
The first efficient regular expression search. Four pages but dense: every time I read it I learn something new. Take an IBM 7094 cheat sheet with you. See also old slidesfrom a lecture about the paper.
Ville Laurikari, “NFAs with Tagged Transitions, their Conversion to Deterministic Automata and Application to Regular Expressions,” in Proceedings of the Symposium on String Processing and Information Retrieval, September 2000.http://laurikari.net/ville/spire2000-tnfa.ps
How to track submatches during efficient (non-backtracking) NFA-based regular expression search.
M. Douglas McIlroy, “Enumerating the strings of regular languages,” Journal of Functional Programming 14 (2004), pp. 503–518.http://www.cs.dartmouth.edu/~doug/nfa.ps.gz (preprint)
Not relevant to implementing regular expression matching, but a beautiful example of regular expression analysis and manipulation using Haskell.
Copyright © 2007 Russ Cox. All Rights Reserved.
https://swtch.com/~rsc/regexp/