A note on the longest common substring with k-mismatches problem (original) (raw)


ABSTRACT The longest common substring with kkk-mismatches problem is to find, given two strings S_1S_1S1 and S2S_2S2, a longest substring A1A_1A1 of S1S_1S1 and A2A_2A2 of S2S_2S2 such that the Hamming distance between A1A_1A1 and A2A_2A2 is lek\le klek. We introduce a practical O(nm)O(nm)O(nm) time and O(1)O(1)O(1) space solution for this problem, where nnn and mmm are the length of S1S_1S1 and S2S_2S2, respectively. This algorithm can also be used to compute the matching statistics with kkk-mismatches of S1S_1S1 and S2S_2S_2 in O(nm)O(nm)O(nm) time and O(m)O(m)O(m) space. Moreover, we also present a theoretical solution for the k=1k = 1k=1 case which runs in O((n+m)log(n+m))O((n + m) \log (n + m))O((n+m)log(n+m)) time and uses O(n+m)O(n + m)O(n+m) space, improving over the existing O(nm)O(nm)O(nm) time and O(m)O(m)O(m) space bound of Babenko and Starikovskaya.

This paper investigates the approximability of the Longest Common Subsequence (LCS) problem. The fastest algorithm for solving the LCS problem exactly runs in essentially quadratic time in the length of the input, and it is known that under the Strong Exponential Time Hypothesis the quadratic running time cannot be beaten. There are no such limitations for the approximate computation of the LCS however, except in some limited scenarios. There is also a scarcity of approximation algorithms. When the two given strings are over an alphabet of size k, returning the subsequence formed by the most frequent symbol occurring in both strings achieves a 1/k approximation for the LCS. It is an open problem whether a better than 1/k approximation can be achieved in truly subquadratic time (O(n2−δ) time for constant δ > 0). A recent result [Rubinstein and Song SODA’2020] showed that a 1/2 + ε approximation for the LCS over a binary alphabet is possible in truly subquadratic time, provided the...

Given a set of kkk strings III, their longest common subsequence (LCS) is the string with the maximum length that is a subset of all the strings in III. A data-structure for this problem preprocesses III into a data-structure such that the LCS of a set of query strings QQQ with the strings of III can be computed faster. Since the problem is NP-hard for arbitrary kkk, we allow an error that allows some characters to be replaced by other characters. We define the approximation version of the problem with an extra input mmm, which is the length of the regular expression (regex) that describes the input, and the approximation factor is the logarithm of the number of possibilities in the regex returned by the algorithm, divided by the logarithm regex with the minimum number of possibilities.

We propose a new algorithm for computing the longest prefix of each suffix of a given string of length n over a constant-sized alphabet of size σ that occurs elsewhere in the string with Hamming distance at most k. Specifically, we show that the proposed algorithm requires time O(n(σR) log log n(log k + log logn)) on average, where R = d(k + 2)(logσ n + 1)e, and space O(n). This improves upon the state-of-theart average-case time complexity for the case when k = 1 [Manzini, SPIRE 2015] by a factor of logn/ log logn. In addition, we show how the proposed technique can be adapted and applied in order to compute the longest previous factors under the Hamming distance model within the same complexities. In terms of real-world applications, we show that our technique can be directly applied to the problem of genome mappability.