Kunihiko Sadakane | National Institute of Informatics (original) (raw)
Uploads
Papers by Kunihiko Sadakane
Journal of Discrete Algorithms, 2007
ACM Transactions on Algorithms, 2007
Theory of Computing Systems / Mathematical Systems Theory, 2007
Recent research in compressing suffix arrays has resulted in two breakthrough indexing data struc... more Recent research in compressing suffix arrays has resulted in two breakthrough indexing data structures, namely, compressed suffix arrays (CSA) [7] and FM-index [5]. Either of them makes it feasible to store a full-text index in the main memory even for a piece of text data with a few billion characters (such as human DNA). However, constructing such indexing data structures with limited working memory (i.e., without constructing suffix arrays) is not a trivial task. This paper addresses this problem. Currently, only CSA admits a space-efficient construction algorithm [15]. For a text T of length n over an alphabet Σ, this algorithm requires O(|Σ|nlogn) time and (2 H 0 + 1+ε)n bits of working space, where H 0 is the 0-th order empirical entropy of T and ε is any non-zero constant. This algorithm is good enough when the alphabet size | Σ| is small. It is not practical for text data containing protein, Chinese or Japanese, where the alphabet may include up to a few thousand characters. The main contribution of this paper is a new algorithm which can construct CSA in O(nlogn) time using (H 0 + 2+ε)n bits of working space. Note that the running time of our algorithm is independent of the alphabet size and the space requirement is smaller as it is likely that H 0 > 1. This paper also makes contribution to the space-efficient construction of FM-index. We show that FM-index can indeed be constructed from CSA directly in O(n) time.
A compressed text database based on the compressed sufffix array is proposed. The compressed suff... more A compressed text database based on the compressed sufffix array is proposed. The compressed suffix array of Grossi and Vitter occupies only O(n) bits for a text of length n; however it also uses the text itself that occupies $ O(n\log |\Sigma |) $ bits for the alphabet ∑. On the other hand, our data structure does not use the text itself, and supports important operations for text databases: inverse, search and decompress. Our algorithms can find occ occurrences of any substring P of the text in $ O(|P|\log n + occ\log ^\varepsilon n) $ time and decompress a part of the text of length l in $ O(l + \log ^e n) $ time for any given 1 ≥ ∈ > 0. Our data structure occupies only $ n(\frac{2} {\varepsilon }(\frac{3} {2} + H_0 + 2logH_0 ) + 2 + \frac{{4log^\varepsilon n}} {{log^\varepsilon n - 1}}) + o(n) + O(|\Sigma |log|\Sigma |) $ bits where $ {\rm H}0 \leqslant {\text{log}}\left| \sum \right| $ is the order-0 entropy of the text. We also show the relationship with the opportunistic data structure of Ferragina and Manzini.
Theoretical Computer Science, 2005
Algorithmica, 2007
With the first human DNA being decoded into a sequence of about 2.8 billion characters, much biol... more With the first human DNA being decoded into a sequence of about 2.8 billion characters, much biological research has been centered on analyzing this sequence. Theoretically speaking, it is now feasible to accommodate an index for human DNA in the main memory so that any pattern can be located efficiently. This is due to the recent breakthrough on compressed suffix arrays, which reduces the space requirement from O(n log n) bits to O(n) bits. However, constructing compressed suffix arrays is still not an easy task because we still have to compute suffix arrays first and need a working memory of O(n log n) bits (i.e., more than 13 gigabytes for human DNA). This paper initiates the study of constructing compressed suffix arrays directly from the text. The main contribution is a construction algorithm that uses only O(n) bits of working memory, and the time complexity is O(n log n). Our construction algorithm is also time and space efficient for texts with large alphabets such as Chinese or Japanese. Precisely, when the alphabet size is |Σ|, the working space is O(n log |Σ|) bits, and the time complexity remains O(n log n), which is independent of |Σ|.
With the first Human DNA being decoded into a sequence of about 2.8 billion base pairs, many biol... more With the first Human DNA being decoded into a sequence of about 2.8 billion base pairs, many biological research has been centered on analyzing this sequence. Theoretically speaking, it is now feasible to accommodate an index for human DNA in main memory so that any pattern can be located efficiently. This is due to the recent breakthrough on compressed suffix arrays, which reduces the space requirement from O(n log n) bits to O(n) bits. However, constructing compressed suffix arrays is still not an easy task because we still have to compute suffix arrays first and need a working memory of O(n log n) bits (i.e., more than 13 Gigabytes for human DNA). This paper initiates the study of constructing compressed suffix arrays directly from text. The main contribution is a new construction algorithm that uses only O(n) bits of working memory, and more importantly, the time complexity remains the same as before, i.e., O(n log n).
Given a set mathcalT\mathcal{T}mathcalT of rooted, unordered trees, where each TiinmathcalTT_{i} \in \mathcal{T}TiinmathcalT is dist... more Given a set mathcalT\mathcal{T}mathcalT of rooted, unordered trees, where each TiinmathcalTT_{i} \in \mathcal{T}TiinmathcalT is distinctly leaf-labeled by a set Λ(T i ) and where the sets Λ(T i ) may overlap, the maximum agreement supertree problem (MASP) is to construct a distinctly leaf-labeled tree Q with leaf set Lambda(Q)subseteqbigcupTiepsilonmathcalTLambda(Ti)\Lambda(Q)\subseteq \bigcup_{Ti\epsilon\mathcal{T}}\Lambda(Ti)Lambda(Q)subseteqbigcupTiepsilonmathcalTLambda(Ti) such that |Λ(Q)| is maximized and for each TiinmathcalTT_{i}\in \mathcal{T}TiinmathcalT , the topological restriction of T i to Λ(Q) is isomorphic to the topological restriction of Q to Λ(T i ). Let n=∣bigcupTiinmathcalTbigwedge(Ti)∣,k=∣mathcalT∣,andD=maxtiinmathcalTdeg(Tin = |\bigcup{T_{i}\in\mathcal{T}}\bigwedge(T_{i})|, k=|\mathcal{T}|, and D=maxt_{i}\in \mathcal{T}\{deg(T_{i}\}n=∣bigcupTiinmathcalTbigwedge(Ti)∣,k=∣mathcalT∣,andD=maxtiinmathcalTdeg(Ti . We first show that MASP with k = 2 can be solved in O(sqrtDnlog(2n/D))O(\sqrt{D}n {log}(2n/D))O(sqrtDnlog(2n/D)) time, which is O(nlogn) when D = O(1) and O(n 1.5) when D is unrestricted. We then present an algorithm for MASP with D = 2 whose running time is polynomial if k = O(1). On the other hand, we prove that MASP is NP-hard for any fixed k ≥ 3 when D is unrestricted, and also NP-hard for any fixed D ≥ 2 when k is unrestricted even if each input tree is required to contain at most three leaves. Finally, we describe a polynomial-time (n/log n)-approximation algorithm for MASP.
Computing Research Repository, 2006
Journal of Algorithms, 2003
Summary form only given. The suffix array is a memory-efficient data structure for searching any ... more Summary form only given. The suffix array is a memory-efficient data structure for searching any substring of a text. It is also used for defining the Burrows-Wheeler transformation (BWT), which is the core of block sorting. When a compressed text is decoded, the inverse of BWT, which is faster than forward transformation, is performed and in the process the suffix array of the text is also obtained. This means that we can compress and transfer a text and its suffix array by simply using block sorting. This fact can be used for creating large full-text databases. We propose a modified Burrows-Wheeler transformation. By using our transformation, we obtain a suffix array from a compressed text which can be used for case-insensitive searches. An exact query can be done from the result of a case-insensitive search because we can decode the original text from the compressed text. It is available for case-insensitive and more general character conversions. We call the conversion unification and the text after conversion unified text. The proposed transformation is defined by the suffix array of the unified text. Our transformation is not a permutation of an alphabet followed by the original transformation but a combination of unification and the original transformation. From a compressed text using our transformation we can obtain the original text and the suffix array of the unified text. After decoding we can perform ambiguous searches like case-insensitive search by using the suffix array. Experimental results show that our transformation decreases the compression ratio very little. Though decompression and search takes more time than decoding of the original block sorting plus grep command, finding positions of keywords is quite fast which is available for advanced searches
Now the Block sorting compression (l) becomes common by its good balance of compression ratio and... more Now the Block sorting compression (l) becomes common by its good balance of compression ratio and speed. It has another nice feature, which is the relation between encoding/decoding process and suffix array. The suffix array (2) is a memory-efficie nt data structure for searching any substring of a text. It is an array of lexicographic ally sorted pointers to suffixes
Journal of Discrete Algorithms, 2007
ACM Transactions on Algorithms, 2007
Theory of Computing Systems / Mathematical Systems Theory, 2007
Recent research in compressing suffix arrays has resulted in two breakthrough indexing data struc... more Recent research in compressing suffix arrays has resulted in two breakthrough indexing data structures, namely, compressed suffix arrays (CSA) [7] and FM-index [5]. Either of them makes it feasible to store a full-text index in the main memory even for a piece of text data with a few billion characters (such as human DNA). However, constructing such indexing data structures with limited working memory (i.e., without constructing suffix arrays) is not a trivial task. This paper addresses this problem. Currently, only CSA admits a space-efficient construction algorithm [15]. For a text T of length n over an alphabet Σ, this algorithm requires O(|Σ|nlogn) time and (2 H 0 + 1+ε)n bits of working space, where H 0 is the 0-th order empirical entropy of T and ε is any non-zero constant. This algorithm is good enough when the alphabet size | Σ| is small. It is not practical for text data containing protein, Chinese or Japanese, where the alphabet may include up to a few thousand characters. The main contribution of this paper is a new algorithm which can construct CSA in O(nlogn) time using (H 0 + 2+ε)n bits of working space. Note that the running time of our algorithm is independent of the alphabet size and the space requirement is smaller as it is likely that H 0 > 1. This paper also makes contribution to the space-efficient construction of FM-index. We show that FM-index can indeed be constructed from CSA directly in O(n) time.
A compressed text database based on the compressed sufffix array is proposed. The compressed suff... more A compressed text database based on the compressed sufffix array is proposed. The compressed suffix array of Grossi and Vitter occupies only O(n) bits for a text of length n; however it also uses the text itself that occupies $ O(n\log |\Sigma |) $ bits for the alphabet ∑. On the other hand, our data structure does not use the text itself, and supports important operations for text databases: inverse, search and decompress. Our algorithms can find occ occurrences of any substring P of the text in $ O(|P|\log n + occ\log ^\varepsilon n) $ time and decompress a part of the text of length l in $ O(l + \log ^e n) $ time for any given 1 ≥ ∈ > 0. Our data structure occupies only $ n(\frac{2} {\varepsilon }(\frac{3} {2} + H_0 + 2logH_0 ) + 2 + \frac{{4log^\varepsilon n}} {{log^\varepsilon n - 1}}) + o(n) + O(|\Sigma |log|\Sigma |) $ bits where $ {\rm H}0 \leqslant {\text{log}}\left| \sum \right| $ is the order-0 entropy of the text. We also show the relationship with the opportunistic data structure of Ferragina and Manzini.
Theoretical Computer Science, 2005
Algorithmica, 2007
With the first human DNA being decoded into a sequence of about 2.8 billion characters, much biol... more With the first human DNA being decoded into a sequence of about 2.8 billion characters, much biological research has been centered on analyzing this sequence. Theoretically speaking, it is now feasible to accommodate an index for human DNA in the main memory so that any pattern can be located efficiently. This is due to the recent breakthrough on compressed suffix arrays, which reduces the space requirement from O(n log n) bits to O(n) bits. However, constructing compressed suffix arrays is still not an easy task because we still have to compute suffix arrays first and need a working memory of O(n log n) bits (i.e., more than 13 gigabytes for human DNA). This paper initiates the study of constructing compressed suffix arrays directly from the text. The main contribution is a construction algorithm that uses only O(n) bits of working memory, and the time complexity is O(n log n). Our construction algorithm is also time and space efficient for texts with large alphabets such as Chinese or Japanese. Precisely, when the alphabet size is |Σ|, the working space is O(n log |Σ|) bits, and the time complexity remains O(n log n), which is independent of |Σ|.
With the first Human DNA being decoded into a sequence of about 2.8 billion base pairs, many biol... more With the first Human DNA being decoded into a sequence of about 2.8 billion base pairs, many biological research has been centered on analyzing this sequence. Theoretically speaking, it is now feasible to accommodate an index for human DNA in main memory so that any pattern can be located efficiently. This is due to the recent breakthrough on compressed suffix arrays, which reduces the space requirement from O(n log n) bits to O(n) bits. However, constructing compressed suffix arrays is still not an easy task because we still have to compute suffix arrays first and need a working memory of O(n log n) bits (i.e., more than 13 Gigabytes for human DNA). This paper initiates the study of constructing compressed suffix arrays directly from text. The main contribution is a new construction algorithm that uses only O(n) bits of working memory, and more importantly, the time complexity remains the same as before, i.e., O(n log n).
Given a set mathcalT\mathcal{T}mathcalT of rooted, unordered trees, where each TiinmathcalTT_{i} \in \mathcal{T}TiinmathcalT is dist... more Given a set mathcalT\mathcal{T}mathcalT of rooted, unordered trees, where each TiinmathcalTT_{i} \in \mathcal{T}TiinmathcalT is distinctly leaf-labeled by a set Λ(T i ) and where the sets Λ(T i ) may overlap, the maximum agreement supertree problem (MASP) is to construct a distinctly leaf-labeled tree Q with leaf set Lambda(Q)subseteqbigcupTiepsilonmathcalTLambda(Ti)\Lambda(Q)\subseteq \bigcup_{Ti\epsilon\mathcal{T}}\Lambda(Ti)Lambda(Q)subseteqbigcupTiepsilonmathcalTLambda(Ti) such that |Λ(Q)| is maximized and for each TiinmathcalTT_{i}\in \mathcal{T}TiinmathcalT , the topological restriction of T i to Λ(Q) is isomorphic to the topological restriction of Q to Λ(T i ). Let n=∣bigcupTiinmathcalTbigwedge(Ti)∣,k=∣mathcalT∣,andD=maxtiinmathcalTdeg(Tin = |\bigcup{T_{i}\in\mathcal{T}}\bigwedge(T_{i})|, k=|\mathcal{T}|, and D=maxt_{i}\in \mathcal{T}\{deg(T_{i}\}n=∣bigcupTiinmathcalTbigwedge(Ti)∣,k=∣mathcalT∣,andD=maxtiinmathcalTdeg(Ti . We first show that MASP with k = 2 can be solved in O(sqrtDnlog(2n/D))O(\sqrt{D}n {log}(2n/D))O(sqrtDnlog(2n/D)) time, which is O(nlogn) when D = O(1) and O(n 1.5) when D is unrestricted. We then present an algorithm for MASP with D = 2 whose running time is polynomial if k = O(1). On the other hand, we prove that MASP is NP-hard for any fixed k ≥ 3 when D is unrestricted, and also NP-hard for any fixed D ≥ 2 when k is unrestricted even if each input tree is required to contain at most three leaves. Finally, we describe a polynomial-time (n/log n)-approximation algorithm for MASP.
Computing Research Repository, 2006
Journal of Algorithms, 2003
Summary form only given. The suffix array is a memory-efficient data structure for searching any ... more Summary form only given. The suffix array is a memory-efficient data structure for searching any substring of a text. It is also used for defining the Burrows-Wheeler transformation (BWT), which is the core of block sorting. When a compressed text is decoded, the inverse of BWT, which is faster than forward transformation, is performed and in the process the suffix array of the text is also obtained. This means that we can compress and transfer a text and its suffix array by simply using block sorting. This fact can be used for creating large full-text databases. We propose a modified Burrows-Wheeler transformation. By using our transformation, we obtain a suffix array from a compressed text which can be used for case-insensitive searches. An exact query can be done from the result of a case-insensitive search because we can decode the original text from the compressed text. It is available for case-insensitive and more general character conversions. We call the conversion unification and the text after conversion unified text. The proposed transformation is defined by the suffix array of the unified text. Our transformation is not a permutation of an alphabet followed by the original transformation but a combination of unification and the original transformation. From a compressed text using our transformation we can obtain the original text and the suffix array of the unified text. After decoding we can perform ambiguous searches like case-insensitive search by using the suffix array. Experimental results show that our transformation decreases the compression ratio very little. Though decompression and search takes more time than decoding of the original block sorting plus grep command, finding positions of keywords is quite fast which is available for advanced searches
Now the Block sorting compression (l) becomes common by its good balance of compression ratio and... more Now the Block sorting compression (l) becomes common by its good balance of compression ratio and speed. It has another nice feature, which is the relation between encoding/decoding process and suffix array. The suffix array (2) is a memory-efficie nt data structure for searching any substring of a text. It is an array of lexicographic ally sorted pointers to suffixes