Efficient Bit-Parallel Algorithms for ( δ , α )Matching (original) (raw)

Faster Bit-Parallel Approximate String Matching

We present a new bit-parallel technique for approximate string matching. We build on two previous techniques. The first one [Myers, J. of the ACM, 1999], searches for a pattern of length m in a text of length n permitting k differences in O(mn/w) time, where w is the width of the computer word. The second one [Navarro and Raffinot, ACM JEA, 2000], extends a sublinear-time exact algorithm to approximate searching. The latter technique makes use of an O(kmn/w) time algorithm [Wu and Manber, Comm. ACM, 1992] for its internal workings. This algorithm is slow but flexible enough to support all the required operations. In this paper we show that the faster algorithm of Myers can be adapted to support all those operations. This involves extending it to compute edit distance, to search for any pattern suffix, and to detect in advance the impossibility of a later match. The result is an algorithm that performs better than the original version of Navarro and Raffinot and that is the fastest for several combinations of m, k and alphabet sizes that are useful, for example, in natural language searching and computational biology.

Efficient bit-parallel multi-patterns approximate string matching algorithms

2011

Multi-patterns approximate string matching (MASM) problem is to find all the occurrences of set of patterns P 0 , P 1 , P 2 ...P r-1 , r≥1, in the given text T[0…n-1], allowing limited number of errors in the matches. This problem has many applications in computational biology viz. finding DNA subsequences after possible mutations, locating positions of a disease(s) in a genome etc. The MASM problem has been previously solved by Baeza-Yates and Navarro by extending the bit-parallel automata (BPA) of approximate matching and using the concept of classes of characters. The drawbacks of this approach are: (a) It requires verification for the potential matches and, (b) It can handle patterns of length less than or equal to word length (w) of computer used. In this paper, we propose two new bit-parallel algorithms to solve the same problem. These new algorithms requires no verification and can handle patterns of length > w. These two techniques also use the same BPA of approximate matching and concatenation to form a single pattern from the set of r patterns. We compare the performance of new algorithms with existing algorithms and found that our algorithms have better running time than the previous algorithms.

Bit-parallel string matching under Hamming distance in O(n[m/w]) worst case time

Information Processing Letters, 2008

Given two strings, a pattern P of length m and a text T of length n over some alphabet Σ, we consider the string matching problem under k mismatches. The wellknown Shift-Add algorithm (Baeza-Yates and Gonnet, 1992) solves the problem in O(n m log(k)/w ) worst case time, where w is the number of bits in a computer word. We present two algorithms that improve this result to O(n m log log(k)/w ) and O(n m/w ), respectively. The algorithms make use of nested varying length bit-strings, that represent the search state. We call these Matryoshka counters. The techniques we developed are of more general use for string matching problems.

Increased Bit-Parallelism for Approximate String Matching

Lecture Notes in Computer Science, 2004

Bit-parallelism permits executing several operations simultaneously over a set of bits or numbers stored in a single computer word. This technique permits searching for the approximate occurrences of a pattern of length m in a text of length n in time O(⌈m/w⌉n), where w is the number of bits in the computer word. Although this is asymptotically the optimal speedup over the basic O(mn) time algorithm, it wastes bitparallelism's power in the common case where m is much smaller than w, since w − m bits in the computer words get unused. In this paper we explore different ways to increase the bit-parallelism when the search pattern is short. First, we show how multiple patterns can be packed in a single computer word so as to search for multiple patterns simultaneously. Instead of paying O(rn) time to search for r patterns of length m < w, we obtain O(⌈r/⌊w/m⌋⌉n) time. Second, we show how the mechanism permits boosting the search for a single pattern of length m < w, which can be searched for in time O(n/⌊w/m⌋) instead of O(n). Finally, we show how to extend these algorithms so that the time bounds essentially depend on k instead of m, where k is the maximum number of differences permitted. Our experimental results show that that the algorithms work well in practice, and are the fastest alternatives for a wide range of search parameters.

Practical and Optimal String Matching

2005

We develop a new exact bit-parallel string matching algorithm, based on the Shift-Or algorithm (Baeza-Yates & Gonnet, 1992). Assuming that the pattern representation fits into a single computer word, this algorithm has optimal O(n logσ m / m) average running time, as well as optimal O(n) worst case running time, where n, m and σ are the sizes of the text, the pattern, and the alphabet, respectively. We also study several implementation details. The experimental results show that our algorithm is the fastest in most of the cases where it can be applied, displacing even the long-standing BNDM (Navarro & Raffinot, 2000) family of algorithms. Finally, we show how to adapt our techniques for the Shift-Add algorithm (Baeza-Yates & Gonnet, 1992), obtaining optimal time for searching under Hamming distance.

Twenty Years of Bit-Parallelism in String Matching

2012

It has been twenty years since the publication of the two seminal papers of Baeza-Yates and Gonnet and of Wu and Manber in the September 1992 issue of the Communications of the ACM. The use of intrinsic parallelism of the bit operations inside a computer word, the so-called bit-parallelism, allows to cut down the number of operations that an algorithm performs by a factor up to ω, where ω is the number of bits in the computer word. This was then achieved by the Shift-Or and the Shift-And string matching algorithms. These two papers has inspired a lot of works and since 1992 a large number of papers were published describing string matching algorithms using this technique. In this survey we will review these solutions for exact single string matching, for exact multiple string matching and for approximate single string matching.

Increased bit-parallelism for approximate and multiple string matching

Journal of Experimental Algorithmics, 2005

Bit-parallelism permits executing several operations simultaneously over a set of bits or numbers stored in a single computer word. This technique permits searching for the approximate occurrences of a pattern of length m in a text of length n in time O(⌈m/w⌉n), where w is the number of bits in the computer word. Although this is asymptotically the optimal bit-parallel speedup over the basic O(mn) time algorithm, it wastes bit-parallelism's power in the common case where m is much smaller than w, since w − m bits in the computer words get unused. In this paper we explore different ways to increase the bit-parallelism when the search pattern is short. First, we show how multiple patterns can be packed into a single computer word so as to search for all them simultaneously. Instead of spending O(rn) time to search for r patterns of length m ≤ w/2, we need O(⌈rm/w⌉n) time. Second, we show how the mechanism permits boosting the search for a single pattern of length m ≤ w/2, which can be searched for in O(⌈n/⌊w/m⌋⌉) bit-parallel steps instead of O(n). Third, we show how to extend these algorithms so that the time bounds essentially depend on k instead of m, where k is the maximum number of differences permitted. Finally, we show how the ideas can be applied to other problems such as multiple exact string matching and one-against-all computation of edit distance and longest common subsequences. Our experimental results show that the new algorithms work well in practice, obtaining significant speedups over the best existing alternatives especially on short patterns and moderate number of differences allowed. This work fills an important gap in the field, where little work has focused on very short patterns.

Optimal parallel pattern matching in strings

Lecture Notes in Computer Science

Given a text of length n and a pattern of length m, we present a parallel linear algorithm for finding all occurrences of the pattern in the text. The algorithm runs in O(n/p) time using any number of p ~< n/log m processors on a concurrent-read concurrent-write parallel random-access-machine.

Faster Approximate String Matching

Algorithmica, 1999

We present a new algorithm for on-line approximate string matching. The algorithm is based on the simulation of a nondeterministic finite automaton built from the pattern and using the text as input. This simulation uses bit operations on a RAM machine with word length w = (log n) bits, where n is the text size. This is essentially similar to the model used in Wu and Manber's work, although we improve the search time by packing the automaton states differently. The running time achieved is O(n) for small patterns (i.e., whenever mk = O(log n)), where m is the pattern length and k < m is the number of allowed errors. This is in contrast with the result of Wu and Manber, which is O(kn) for m = O(log n). Longer patterns can be processed by partitioning the automaton into many machine words, at O(mk/w n) search cost. We allow generalizations in the pattern, such as classes of characters, gaps, and others, at essentially the same search cost.

Practical algorithms for transposition-invariant string-matching

Journal of Discrete Algorithms, 2005

We consider the problems of (1) longest common subsequence (LCS) of two given strings in the case where the first may be shifted by some constant (that is, transposed) to match the second, and (2) transposition-invariant text searching using indel distance. These problems have applications in music comparison and retrieval. We introduce two novel techniques to solve these problems efficiently. The first is based on the branch and bound method, the second on bit-parallelism. Our branch and bound algorithm computes the longest common transposition-invariant subsequence (LCTS) in time O((m 2 +log log σ) log σ) in the best case and O((m 2 +log σ)σ) in the worst case, where m and σ, respectively, are the length of the strings and the size of the alphabet. On the other hand, we show that the same problem can be solved by using bit-parallelism and thus obtain a speedup of O(w/ log m) over the classical algorithms, where the computer word has w bits. The advantage of this latter algorithm over the present bit-parallel ones is that it allows the use of more complex distances, including general integer weights. Since our branch and bound method is very flexible, it can be further improved by combining it with other efficient algorithms such as our novel bit-parallel algorithm. We experiment on several combination possibilities and discuss which are the best settings for each of those combinations. Our algorithms are easily extended to other musically relevant cases, such as δ-matching and polyphony (where there are several parallel texts to be considered). We also show how our bit-parallel algorithm is adapted to text searching and illustrate its effectiveness in complex cases where the only known competing method is the use of brute force.

Efficient Bit-Parallel Algorithms for ( δ , α )Matching (original) (raw)

Related papers