Fast Bit-Vector Algorithms for Approximate String Matching Under Indel Distance (original) (raw)

Faster Bit-Parallel Approximate String Matching

We present a new bit-parallel technique for approximate string matching. We build on two previous techniques. The first one [Myers, J. of the ACM, 1999], searches for a pattern of length m in a text of length n permitting k differences in O(mn/w) time, where w is the width of the computer word. The second one [Navarro and Raffinot, ACM JEA, 2000], extends a sublinear-time exact algorithm to approximate searching. The latter technique makes use of an O(kmn/w) time algorithm [Wu and Manber, Comm. ACM, 1992] for its internal workings. This algorithm is slow but flexible enough to support all the required operations. In this paper we show that the faster algorithm of Myers can be adapted to support all those operations. This involves extending it to compute edit distance, to search for any pattern suffix, and to detect in advance the impossibility of a later match. The result is an algorithm that performs better than the original version of Navarro and Raffinot and that is the fastest for several combinations of m, k and alphabet sizes that are useful, for example, in natural language searching and computational biology.

Improving the bit-parallel NFA of Baeza-Yates and Navarro for approximate string matching

Information Processing Letters, 2008

We propose a new variant of the bit-parallel NFA of Baeza-Yates and Navarro (BPD) for approximate string matching [R. Baeza-Yates, G. Navarro, Faster approximate string matching, Algorithmica 23 (1999) 127-158]. BPD is one of the most practical approximate string matching algorithms under moderate pattern lengths and error levels [G. Myers, A fast bitvector algorithm for approximate string matching based on dynamic programming, J. ACM 46 (3) 1989 395-415; G. Navarro, M. Raffinot, Flexible Pattern Matching in Strings-Practical On-line Search Algorithms for Texts and Biological Sequences, Cambridge University Press, Cambridge, UK, 2002]. Given a length-m pattern and an error threshold k, the original BPD requires (m − k)(k + 2) bits of space to represent an NFA with (m − k)(k + 1) states. In this paper we remove redundancy from the original NFA representation. Our variant requires (m − k)(k + 1) bits of space, which is optimal in the sense that exactly one bit per state is used. The space efficiency is achieved by using an alternative, but equally or even more efficient, simulation algorithm for the bit-parallel NFA. We also present experimental results to compare our modified NFA against the original BPD and its main competitors. Our new variant is more efficient than the original BPD, and it hence takes over/extends the role of the original BPD as one of the most practical approximate string matching algorithms under moderate values of k and m.

A Parallel Solution to the Approximate String Matching Problem

The Computer Journal, 1992

The approximate string matching problem (ASMP) consists of finding all the occurrences of a string of characters X of length m in another string Y of length n,m<£n, where some errors are allowed in these occurrences. A classical sequential algorithm based on dynamic programming solves the problem in time of O(mri). A parallelisation scheme for this algorithm is proposed, which applies to a very general set of errors, and allows to solve ASMP in time T with N processors, with AT of O(mn), thereby achieving optimal speedup. The scheme is suitable for VLSI implementation on a bounded degree network.

Increased Bit-Parallelism for Approximate String Matching

Lecture Notes in Computer Science, 2004

Bit-parallelism permits executing several operations simultaneously over a set of bits or numbers stored in a single computer word. This technique permits searching for the approximate occurrences of a pattern of length m in a text of length n in time O(⌈m/w⌉n), where w is the number of bits in the computer word. Although this is asymptotically the optimal speedup over the basic O(mn) time algorithm, it wastes bitparallelism's power in the common case where m is much smaller than w, since w − m bits in the computer words get unused. In this paper we explore different ways to increase the bit-parallelism when the search pattern is short. First, we show how multiple patterns can be packed in a single computer word so as to search for multiple patterns simultaneously. Instead of paying O(rn) time to search for r patterns of length m < w, we obtain O(⌈r/⌊w/m⌋⌉n) time. Second, we show how the mechanism permits boosting the search for a single pattern of length m < w, which can be searched for in time O(n/⌊w/m⌋) instead of O(n). Finally, we show how to extend these algorithms so that the time bounds essentially depend on k instead of m, where k is the maximum number of differences permitted. Our experimental results show that that the algorithms work well in practice, and are the fastest alternatives for a wide range of search parameters.

Increased bit-parallelism for approximate and multiple string matching

Journal of Experimental Algorithmics, 2005

Bit-parallelism permits executing several operations simultaneously over a set of bits or numbers stored in a single computer word. This technique permits searching for the approximate occurrences of a pattern of length m in a text of length n in time O(⌈m/w⌉n), where w is the number of bits in the computer word. Although this is asymptotically the optimal bit-parallel speedup over the basic O(mn) time algorithm, it wastes bit-parallelism's power in the common case where m is much smaller than w, since w − m bits in the computer words get unused. In this paper we explore different ways to increase the bit-parallelism when the search pattern is short. First, we show how multiple patterns can be packed into a single computer word so as to search for all them simultaneously. Instead of spending O(rn) time to search for r patterns of length m ≤ w/2, we need O(⌈rm/w⌉n) time. Second, we show how the mechanism permits boosting the search for a single pattern of length m ≤ w/2, which can be searched for in O(⌈n/⌊w/m⌋⌉) bit-parallel steps instead of O(n). Third, we show how to extend these algorithms so that the time bounds essentially depend on k instead of m, where k is the maximum number of differences permitted. Finally, we show how the ideas can be applied to other problems such as multiple exact string matching and one-against-all computation of edit distance and longest common subsequences. Our experimental results show that the new algorithms work well in practice, obtaining significant speedups over the best existing alternatives especially on short patterns and moderate number of differences allowed. This work fills an important gap in the field, where little work has focused on very short patterns.

Faster Approximate String Matching

Algorithmica, 1999

We present a new algorithm for on-line approximate string matching. The algorithm is based on the simulation of a nondeterministic finite automaton built from the pattern and using the text as input. This simulation uses bit operations on a RAM machine with word length w = (log n) bits, where n is the text size. This is essentially similar to the model used in Wu and Manber's work, although we improve the search time by packing the automaton states differently. The running time achieved is O(n) for small patterns (i.e., whenever mk = O(log n)), where m is the pattern length and k < m is the number of allowed errors. This is in contrast with the result of Wu and Manber, which is O(kn) for m = O(log n). Longer patterns can be processed by partitioning the automaton into many machine words, at O(mk/w n) search cost. We allow generalizations in the pattern, such as classes of characters, gaps, and others, at essentially the same search cost.

The Max-Shift Algorithm for Approximate String Matching

Lecture Notes in Computer Science, 2001

The approximate string matching problem is to find all locations which a pattern of length m matches a substring of a text of length n with at most k differences. The program agrep is a simple and practical bit-vector algorithm for this problem. In this paper we consider the following incremental version of the problem: given an appropriate encoding of a comparison between A and bB, can one compute the answer for A and B, and the answer for A and Bc with equal efficiency, where b and c are additional symbols? Here we present an elegant and very easy to implement bit-vector algorithm for answering these questions that requires only O(n m/w) time, where n is the length of A, m is the length of B and w is the number of bits in a machine word. We also present an O(nm h/w) algorithm for the fixed-length approximate string matching problem: given a text t, a pattern p and an integer h, compute the optimal alignment of all substrings of p of length h and a substring of t.

Tighter Packed Bit-Parallel NFA for Approximate String Matching

Lecture Notes in Computer Science, 2006

We propose a new variant of the bit-parallel NFA of Baeza-Yates and Navarro (BPD) for approximate string matching [1]. Given a length-m pattern and an error threshold k, the original BPD uses (m − k)(k + 2) bits of space. We decrease this to (m − k)(k + 1), and also give a slightly more efficient simulation algorithm for the NFA. In experiments our modified NFA is often noticeably more efficient than the original algorithm under moderate values of k and m.

Efficient Algorithms for Approximate String Matching with Swaps

Journal of Complexity, 1999

Most research on the edit distance problem and the k-differences problem considered the set of edit operations consisting of changes, insertions, and deletions. In this paper we include the swap operation that interchanges two adjacent characters into the set of allowable edit operations, and we present an O(t min(m, n))-time algorithm for the extended edit distance problem, where t is the edit distance between the given strings, and an O(kn)-time algorithm for the extended k-differences problem. That is, we add swaps into the set of edit operations without increasing the time complexities of previous algorithms that consider only changes, insertions, and deletions for the edit distance and k-differences problems.

A Comparison of Approximate String Matching Algorithms

Software: Practice and Experience, 1996

Experimental comparison of the running time of approximate string matching algorithms for the differences problem is presented. Given a pattern string, a text string, and integer , the task is to find all approximate occurrences of the pattern in the text with at most differences (insertions, deletions, changes). We consider seven algorithms based on different approaches including dynamic programming, Boyer-Moore string matching, suffix automata, and the distribution of characters. It turns out that none of the algorithms is the best for all values of the problem parameters, and the speed differences between the methods can be considerable.