Bit-parallel string matching under Hamming distance in O(n[m/w]) worst case time (original) (raw)

Practical and Optimal String Matching

2005

We develop a new exact bit-parallel string matching algorithm, based on the Shift-Or algorithm (Baeza-Yates & Gonnet, 1992). Assuming that the pattern representation fits into a single computer word, this algorithm has optimal O(n logσ m / m) average running time, as well as optimal O(n) worst case running time, where n, m and σ are the sizes of the text, the pattern, and the alphabet, respectively. We also study several implementation details. The experimental results show that our algorithm is the fastest in most of the cases where it can be applied, displacing even the long-standing BNDM (Navarro & Raffinot, 2000) family of algorithms. Finally, we show how to adapt our techniques for the Shift-Add algorithm (Baeza-Yates & Gonnet, 1992), obtaining optimal time for searching under Hamming distance.

Nested Counters in Bit-Parallel String Matching

2009

Many algorithms, e.g. in the field of string matching, are based on handling many counters, which can be performed in parallel, even on a sequential machine, using bit-parallelism. The recently presented technique of nested counters (Matryoshka counters) [1] is to handle small counters most of the time, and refer to larger counters periodically, when the small counters may get full, to prevent overflow. In this work, we present several non-trivial applications of Matryoshka counters in string matching algorithms, improving their worst- or average-case time complexities. The set of problems comprises (δ,α)-matching, matching with k insertions, episode matching, and matching under Levenshtein distance.

Twenty Years of Bit-Parallelism in String Matching

2012

It has been twenty years since the publication of the two seminal papers of Baeza-Yates and Gonnet and of Wu and Manber in the September 1992 issue of the Communications of the ACM. The use of intrinsic parallelism of the bit operations inside a computer word, the so-called bit-parallelism, allows to cut down the number of operations that an algorithm performs by a factor up to ω, where ω is the number of bits in the computer word. This was then achieved by the Shift-Or and the Shift-And string matching algorithms. These two papers has inspired a lot of works and since 1992 a large number of papers were published describing string matching algorithms using this technique. In this survey we will review these solutions for exact single string matching, for exact multiple string matching and for approximate single string matching.

Faster Bit-Parallel Approximate String Matching

We present a new bit-parallel technique for approximate string matching. We build on two previous techniques. The first one [Myers, J. of the ACM, 1999], searches for a pattern of length m in a text of length n permitting k differences in O(mn/w) time, where w is the width of the computer word. The second one [Navarro and Raffinot, ACM JEA, 2000], extends a sublinear-time exact algorithm to approximate searching. The latter technique makes use of an O(kmn/w) time algorithm [Wu and Manber, Comm. ACM, 1992] for its internal workings. This algorithm is slow but flexible enough to support all the required operations. In this paper we show that the faster algorithm of Myers can be adapted to support all those operations. This involves extending it to compute edit distance, to search for any pattern suffix, and to detect in advance the impossibility of a later match. The result is an algorithm that performs better than the original version of Navarro and Raffinot and that is the fastest for several combinations of m, k and alphabet sizes that are useful, for example, in natural language searching and computational biology.

Increased Bit-Parallelism for Approximate String Matching

Lecture Notes in Computer Science, 2004

Bit-parallelism permits executing several operations simultaneously over a set of bits or numbers stored in a single computer word. This technique permits searching for the approximate occurrences of a pattern of length m in a text of length n in time O(⌈m/w⌉n), where w is the number of bits in the computer word. Although this is asymptotically the optimal speedup over the basic O(mn) time algorithm, it wastes bitparallelism's power in the common case where m is much smaller than w, since w − m bits in the computer words get unused. In this paper we explore different ways to increase the bit-parallelism when the search pattern is short. First, we show how multiple patterns can be packed in a single computer word so as to search for multiple patterns simultaneously. Instead of paying O(rn) time to search for r patterns of length m < w, we obtain O(⌈r/⌊w/m⌋⌉n) time. Second, we show how the mechanism permits boosting the search for a single pattern of length m < w, which can be searched for in time O(n/⌊w/m⌋) instead of O(n). Finally, we show how to extend these algorithms so that the time bounds essentially depend on k instead of m, where k is the maximum number of differences permitted. Our experimental results show that that the algorithms work well in practice, and are the fastest alternatives for a wide range of search parameters.

A Parallel Algorithm for Fixed-Length Approximate String-Matching with k-mismatches

Lecture Notes in Computer Science, 2010

This paper deals with the approximate string-matching problem with Hamming distance. The approximate string-matching with kmismatches problem is to find all locations at which a query of length m matches a factor of a text of length n with k or fewer mismatches. The approximate string-matching algorithms have both pleasing theoretical features, as well as direct applications, especially in computational biology. We consider a generalisation of this problem, the fixed-length approximate string-matching with k-mismatches problem: given a text t, a pattern x and an integer ℓ, search for all the occurrences in t of all factors of x of length ℓ with k or fewer mismatches with a factor of t. We present a practical parallel algorithm of comparable simplicity that requires only O(nm⌈ℓ/w⌉ p) time, where w is the word size of the machine (e.g. 32 or 64 in practice) and p the number of processors. Thus the algorithm's performance is independent of k and the alphabet size |Σ|. The proposed parallel algorithm makes use of message-passing parallelism model, and word-level parallelism for efficient approximate string-matching.

Bit Parallel String Matching Algorithms: A Survey

International Journal of Computer Applications, 2014

The intrinsic parallelism in bit operations like AND/OR inside a computer word is known as bit parallelism. Since 1992, this bit parallelism is directly used in string matching for matching efficiency improvement. Some of the popular bit parallel string matching algorithms Shift OR, Shift OR with Q-Gram, BNDM, TNDM, SBNDM, LBNDM, FBNDM, BNDMq, and Multiple pattern BNDM. This paper discusses the working of various bit parallel string matching algorithms with example. Here we present how bit parallelism is useful for efficiency improvement in various algorithms.

On string matching with k mismatches

In this paper we consider several variants of the pattern matching problem. In particular, we investigate the following problems: 1) Pattern matching with k mismatches; 2) Approximate counting of mismatches; and 3) Pattern matching with mismatches. The distance metric used is the Hamming distance. We present some novel algorithms and techniques for solving these problems. Both deterministic and randomized algorithms are offered. Variants of these problems where there could be wild cards in either the text or the pattern or both are considered. An experimental evaluation of these algorithms is also presented. The source code is available at http://www.engr.uconn.edu/∼man09004/kmis.zip.

Increased bit-parallelism for approximate and multiple string matching

Journal of Experimental Algorithmics, 2005

Bit-parallelism permits executing several operations simultaneously over a set of bits or numbers stored in a single computer word. This technique permits searching for the approximate occurrences of a pattern of length m in a text of length n in time O(⌈m/w⌉n), where w is the number of bits in the computer word. Although this is asymptotically the optimal bit-parallel speedup over the basic O(mn) time algorithm, it wastes bit-parallelism's power in the common case where m is much smaller than w, since w − m bits in the computer words get unused. In this paper we explore different ways to increase the bit-parallelism when the search pattern is short. First, we show how multiple patterns can be packed into a single computer word so as to search for all them simultaneously. Instead of spending O(rn) time to search for r patterns of length m ≤ w/2, we need O(⌈rm/w⌉n) time. Second, we show how the mechanism permits boosting the search for a single pattern of length m ≤ w/2, which can be searched for in O(⌈n/⌊w/m⌋⌉) bit-parallel steps instead of O(n). Third, we show how to extend these algorithms so that the time bounds essentially depend on k instead of m, where k is the maximum number of differences permitted. Finally, we show how the ideas can be applied to other problems such as multiple exact string matching and one-against-all computation of edit distance and longest common subsequences. Our experimental results show that the new algorithms work well in practice, obtaining significant speedups over the best existing alternatives especially on short patterns and moderate number of differences allowed. This work fills an important gap in the field, where little work has focused on very short patterns.

On String Matching with Mismatches

Algorithms, 2015

In this paper, we consider several variants of the pattern matching with mismatches problem. In particular, given a text T = t 1 t 2 • • • t n and a pattern P = p 1 p 2 • • • p m , we investigate the following problems: (1) pattern matching with mismatches: for every i, 1 ≤ i ≤ n − m + 1 output, the distance between P and t i t i+1 • • • t i+m−1 ; and (2) pattern matching with k mismatches: output those positions i where the distance between P and t i t i+1 • • • t i+m−1 is less than a given threshold k. The distance metric used is the Hamming distance. We present some novel algorithms and techniques for solving these problems. We offer deterministic, randomized and approximation algorithms. We consider variants of these problems where there could be wild cards in either the text or the pattern or both. We also present an experimental evaluation of these algorithms. The source code is available at http://www.engr.uconn.edu/∼man09004/kmis.zip.