A fast bit-vector algorithm for approximate string matching based on dynamic programming (original) (raw)
Published: 01 May 1999 Publication History
Abstract
The approximate string matching problem is to find all locations at which a query of length_m_ matches a substring of a text of length n with _k_-or-fewer differences. Simple and practical bit-vector algorithms have been designed for this problem, most notably the one used in agrep. These algorithms compute a bit representation of the current state-set of the k_-difference automaton for the query, and asymptotically run in either O(nm/w) or O(nm log σ/w) time where w is the word size of the machine (e.g., 32 or 64 in practice), and σ is the size of the pattern alphabet. Here we present an algorithm of comparable simplicity that requires only O(nm/w) time by virtue of computing a bit representation of the relocatable dynamic programming matrix for the problem. Thus, the algorithm's performance is independent of k, and it is found to be more efficient than the previous results for many choices of k and small_m.
Moreover, because the algorithm is not dependent on k, it can be used to rapidly compute blocks of the dynamic programming matrix as in the 4-Russians algorithm of Wu et al.(1996). This gives rise to an O(kn/w) expected-time algorithm for the case where m may be arbitrarily large. In practice this new algorithm, that computes a region of the dynamic progr amming (d.p.) matrx w entries at a time using the basic algorithm as a subroutine is significantly faster than our previous 4-Russians algorithm, that computes the same region 4 or 5 entries at a time using table lookup. This performance improvement yields a code that is either superior or competitive with all existing algorithms except for some filtration algorithms that are superior when k/m is sufficiently small.
References
[1]
BAEZA-YATES, R. A., AND GONNET, G.H. 1992. A new approach to text searching. Commun. A CM 35, 74-82.
[2]
BAEZA-YATES, R. A., AND NAVARRO, G. 1996. A faster algorithm for approximate string matching. In Proceedings of the 7th Symposium on Combinatorial Pattern Matching. Lecture Notes in Computer Science, Vol. 1075. Springer-Verlag, New York, pp. 1-23.
[3]
BAEZA-YATES, R.A. AND NAVARRO, G. 1999. Analysis for algorithm engineering: Improving an algorithm for approximate pattern matching. Unpublished manuscript.
[4]
CHAO, K.M., HARDISON, R.C., AND MILLER, W. 1992. Recent developments in linear-space alignment methods: A survey. J. Comput. Biol. 1, 271-291.
[5]
CHANG, W. I., AND LAMPE, J. 1992. Theoretical and empirical comparisons of approximate string matching algorithms. In Proceedings of the 3rd Symposium on Combinatorial Pattern Matching. Lecture Notes in Computer Science, vol. 644. Springer-Verlag, New York, pp. 172-181.
[6]
CHANG, W.I. AND LAWLER, E.L. 1994. Sublinear expected time approximate matching and biological applications. Algorithmica 12, 327-344.
[7]
CHAO, K.M., HARDISON, R.C., AND MILLER, W. 1992. Recent developments in linear-space alignment methods: A survey. J. Comput. Biol. 1, 271-291.
[8]
CHAO, K. M., PEARSON, W. R., AND MILLER, W. 1992. Aligning two sequences within a specified diagonal band. Comput. Appl. BioSciences 8, 481-487.
[9]
COBBS, A. 1995. Fast approximate matching using suffix trees. In Proceedings of the 6th Symposium on Combinatorial Pattern Matching. Lecture Notes in Computer Science, vol. 937. Springer-Verlag, New York, pp. 41-54.
[10]
GALm, Z., AND PARK, K. 1990. An improved algorithm for approximate string matching. SIAM J. Comput. 19, 989-999.
[11]
LANDAU, G. M., AND VISHKIN, U. 1988. Fast string matching with k differences. J. Comput. Syst. Sci. 37, 63-78.
[12]
MASEK, W. J., AND PATERSON, M.S. 1980. A faster algorithm for computing string edit distances. J. Comput. Syst. Sci. 20, 18-31.
[13]
MYERS, E.W. 1994. A sublinear algorithm for approximate keywords searching. Algorithmica 12, 345-374.
[14]
PEVZNER, P., AND WATERMAN, M.S. 1995. Multiple filtration and approximate pattern matching. Algorithmica 13, 135-154.
[15]
SELLERS, P.H. 1980. The theory and computations of evolutionary distances: Pattern recognition. J. Algorithms 1,359-373.
[16]
SUTINEN, E., AND TARHIO, J. 1996. Filtration with q-samples in approximate string matching. In Proceedings of the 7th Symposium on Combinatorial Pattern Matching. Lecture Notes in Computer Science, vol. 1075. Springer-Verlag, New York, pp. 50-63.
[17]
UKKONEN, E. 1985. Finding approximate patterns in strings. J. Algorithms 6, 132-137.
[18]
UKKONEN, E. 1992. Approximate string-matching with q-grams and maximal matches. Theoret. Comput. Sci. 92, 191-211.
[19]
UKKONEN, E. 1993. Approximate string matching over suffix trees. In Proceedings of the 4th Symposium on Combinatorial Pattern Matching. Lecture Notes in Computer Science, vol. 684. Springer-Verlag, New York, pp. 228-242.
[20]
WAGNER, R.A., AND FISCHER, M.J. 1974. The string to string correction problem. J. ACM 21, 168-173.
[21]
Wu, S., AND MANBER, U. 1992. Fast text searching allowing errors. Commun. ACM 35, 10, 83-91.
[22]
Wu, S., MANBER, U., AND MYERS, G. 1996. A subquadratic algorithm for approximate limited expression matching. Algorithmica 15, 50-67.
[23]
WRIGHT, A.H. 1994. Approximate string matching using within-word parallelism. Soft. Pract. Exper. 24, 337-362.
Information & Contributors
Information
Published In
Journal of the ACM Volume 46, Issue 3
May 1999
113 pages
Copyright © 1999 ACM.
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
Published: 01 May 1999
Published in JACM Volume 46, Issue 3
Permissions
Request permissions for this article.
Check for updates
Author Tags
Qualifiers
- Article
Contributors
Other Metrics
Bibliometrics & Citations
Bibliometrics
Article Metrics
- View Citations
- Downloads (Last 12 months)646
- Downloads (Last 6 weeks)51
Reflects downloads up to 19 Oct 2024
Other Metrics
Citations
- Schäfer MNadi SEghbali ATip F(2024)An Empirical Evaluation of Using Large Language Models for Automated Unit Test GenerationIEEE Transactions on Software Engineering10.1109/TSE.2023.333495550:1(85-105)Online publication date: 1-Jan-2024
- Bingöl ZAlser MMutlu OOzturk OAlkan C(2024)GateKeeper-GPU: Fast and Accurate Pre-Alignment Filtering in Short Read MappingIEEE Transactions on Computers10.1109/TC.2024.336593173:5(1206-1218)Online publication date: 14-Feb-2024
- Pavon JValdivieso IRojas CHernandez CAslan MFigueras RYuan YLindegger JAlser MMoll FMarco-Sola SErgin OTalati NMutlu OUnsal OValero MCristal A(2024)QUETZAL: Vector Acceleration Framework for Modern Genome Sequence Analysis Algorithms2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00050(597-612)Online publication date: 29-Jun-2024
- Walia SYe CBera ALodhavia DTurakhia Y(2024)TALCO: Tiling Genome Sequence Alignment Using Convergence of Traceback Pointers2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00044(91-107)Online publication date: 2-Mar-2024
- Yüksel İTuğrul YBostancı FOliveira GYağlıkçı AOlgun ASoysal MLuo HGómez-Luna JSadrosadati MMutlu O(2024)Simultaneous Many-Row Activation in Off-the-Shelf DRAM Chips: Experimental Characterization and Analysis2024 54th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)10.1109/DSN58291.2024.00024(99-114)Online publication date: 24-Jun-2024
- Leger PFukuda HCardozo NMartín D(2024)Exploring a Self-Replication Algorithm to Flexibly Match PatternsIEEE Access10.1109/ACCESS.2024.335531912(13553-13570)Online publication date: 2024
- Shao HRuan J(2024)BSAlign: A Library for Nucleotide Sequence AlignmentGenomics, Proteomics & Bioinformatics10.1093/gpbjnl/qzae02522:2Online publication date: 14-Mar-2024
- Groot Koerkamp RIvanov P(2024)Exact global alignment using A* with chaining seed heuristic and match pruningBioinformatics10.1093/bioinformatics/btae03240:3Online publication date: 23-Jan-2024
- Zhou QJi FLin DLiu XZhu ZRuan J(2024)KSNP: a fast de Bruijn graph-based haplotyping tool approaching data-in time costNature Communications10.1038/s41467-024-47562-415:1Online publication date: 11-Apr-2024
- López-Villellas LLangarita-Benítez RBadouh ASoria-Pardos VAguado-Puig QLópez-Paradís GDoblas MSetoain JKim COno MArmejach AMarco-Sola SAlastruey-Benedé JIbáñez PMoretó M(2024)GenArchBenchFuture Generation Computer Systems10.1016/j.future.2024.03.050157:C(313-329)Online publication date: 18-Jul-2024
- Show More Cited By
View Options
View options
View or Download as a PDF file.
eReader
View online with eReader.
Get Access
Login options
Check if you have access through your login credentials or your institution to get full access on this article.
Full Access
Media
Figures
Other
Tables
Affiliations
Gene Myers