Approximate matching of regular expressions (original) (raw)
Abstract
Given a sequence_A_ and regular expression_R_, the_approximate regular expression matching_ problem is to find a sequence matching_R_ whose optimal alignment with_A_ is the highest scoring of all such sequences. This paper develops an algorithm to solve the problem in time_O(MN), where_M and_N_ are the lengths of_A_ and_R_. Thus, the time requirement is asymptotically no worse than for the simpler problem of aligning two fixed sequences. Our method is superior to an earlier algorithm by Wagner and Seiferas in several ways. First, it treats real-valued costs, in addition to integer costs, with no loss of asymptotic efficiency. Second, it requires only_O(N)_ space to deliver just the score of the best alignment. Finally, its structure permits implementation techniques that make it extremely fast in practice. We extend the method to accommodate gap penalties, as required for typical applications in molecular biology, and further refine it to search for substrings of_A_ that strongly align with a sequence in_R_, as required for typical data base searches. We also show how to deliver an optimal alignment between_A_ and_R_ in only_O_(N+log_M_) space using_O_(MN log_M_) time. Finally, an_O_(MN(M+N)+N 2log_N_) time algorithm is presented for alignment scoring schemes where the cost of a gap is an arbitrary increasing function of its length.
Access this article
Subscribe and save
- Get 10 units per month
- Download Article/Chapter or eBook
- 1 Unit = 1 Article or 1 Chapter
- Cancel anytime Subscribe now
Buy Now
Price excludes VAT (USA)
Tax calculation will be finalised during checkout.
Instant access to the full article PDF.
Similar content being viewed by others
Literature
- Abarbanel, R. M., P. R. Wieneke, E. Mansfield, D. A. Jaffe and D. L. Brutlag. 1984. “Rapid Searches for Complex Patterns in Biological Molecules.”Nucleic Acids Res. 12, 263–280.
Google Scholar - Aho, A. 1980. “Pattern Matching in Strings.” In_Formal Language Theory_, R. Book (Ed.). New York: Academic Press.
Google Scholar - —, J. E. Hopcroft and J. D. Ullman. 1983.Data Structures and Algorithms, pp. 203–208. Reading, MA: Addison-Wesley.
Google Scholar - Cohen, F. E., R. M. Abarbanel, I. D. Kuntz and R. J. Fletterick. 1986. “Turn Prediction in Proteins Using a Pattern-Matching Approach.”Biochemistry 25, 266–275.
Article Google Scholar - Fitch, W. M. and T. F. Smith. 1983. “Optimal Sequence Alignments.”Proc. Natn. Acad. Sci. U.S.A. 80, 1382–1386.
Article Google Scholar - Gotoh, O. 1982. “An Improved Algorithm for Matching Biological Sequences.”J. Molec. Biol. 162, 705–708.
Article Google Scholar - Hecht, M. S. 1977.Flow Analysis of Computer Programs. Amsterdam: North-Holland.
Google Scholar - — and J. D. Ullman. 1975. “A Simple Algorithm for Global Data Flow Analysis Programs.”SIAM J. Computing 4, 519–532.
Article MATH MathSciNet Google Scholar - Hopcroft, J. E. and J. D. Ullman. 1979.Introduction to Automata Theory, Languages, and Computation. Reading, MA: Addison-Wesley.
Google Scholar - Kennedy, K. 1975. “Node Listing Techniques Applied to Data Flow Analysis.”Proceedings of the 2nd ACM Conference on Principles of Programming Languages, 10–21.
- Levenshtein, V. I. 1966. “Binary Codes Capable of Correcting Deletions, Insertions, and Reversals.”Cybernetics Control Theory 10, 707–710.
MathSciNet Google Scholar - Miller, W. 1987.A Software Tools Sampler. New Jersey. Prentice-Hall.
Google Scholar - — and E. W. Myers. 1988a. “A Simple Row-Replacement Method.”Software-Practice and Experience 18, 597–611.
Google Scholar - — and —. 1988b. “Sequence Comparison with Concave Weighting Functions.”Bull. Math. Biol. 50, 97–120.
Article MATH MathSciNet Google Scholar - Myers, E. W. and W. Miller. 1988a. “Row replacement Algorithms for Screen Editors.”ACM Trans. Prog. Lang. Systems. (to be published).
- — and —. 1988b. “Optimal Alignments in Linear Space.”CABIOS 4, 11–17.
Google Scholar - Pennello, T. J. 1986. “Very Fast LR Parsing.” Proceedings of the SIGPLAN'86 Symposium on Compiler Construction.ACM SIGPLAN Notices 21, 145–150.
Article Google Scholar - Sankoff, D. and J. B. Kruskal. 1983.Time Warps, String Edits and Macromolecules: The Theory and Practice of Sequence Comparison. Reading, MA: Addison-Wesley.
Google Scholar - Sellers, P. H. 1980. “The Theory and Computation of Evolutionary Distances: Pattern Recognition.”J. Algorithms 1, 359–373.
Article MATH MathSciNet Google Scholar - —. 1984. “Pattern Recognition in Genetic Sequences by Mismatch Density.”Bull. Math. Biol. 46, 501–514.
Article MATH MathSciNet Google Scholar - Thompson, K. 1968. “Regular Expression Search Algorithm.”Comm. ACM 11, 419–422.
Article MATH Google Scholar - Wagner, R. A. 1974. “Order-n Correction of Regular Languages.”Comm. ACM 17, 265–268.
Article MATH Google Scholar - — and J. I. Seiferas. 1978. “Correcting Counter-Automaton-Recognizable Languages.”SIAM J. Computing 7, 357–375.
Article MATH MathSciNet Google Scholar - Waterman, M. S. 1984. “General Methods for Sequence Comparison.”Bull. Math. Biol. 46, 473–500.
Article MATH MathSciNet Google Scholar - —, T. F. Smith and W. A. Beyer. 1976. “Some Biological Sequence Metrics.”Adv. Maths 20, 367–387.
Article MATH MathSciNet Google Scholar
Author information
Authors and Affiliations
- Department of Computer Science, University of Arizona, 85721, Tucson, AZ, U.S.A.
Eugene W. Myers - Department of Computer Science, The Pennsylvania State University, 16802, University Park, PA, U.S.A.
Webb Miller
Authors
- Eugene W. Myers
You can also search for this author inPubMed Google Scholar - Webb Miller
You can also search for this author inPubMed Google Scholar
Rights and permissions
About this article
Cite this article
Myers, E.W., Miller, W. Approximate matching of regular expressions.Bltn Mathcal Biology 51, 5–37 (1989). https://doi.org/10.1007/BF02458834
- Received: 16 June 1988
- Issue Date: January 1989
- DOI: https://doi.org/10.1007/BF02458834