Reducing the Space Requirement of LZ-Index (original) (raw)

On Entropy-Compressed Text Indexing in External Memory

A new trend in the field of pattern matching is to design indexing data structures which take space very close to that required by the indexed text (in entropy-compressed form) and also simultaneously achieve good query performance. Two popular indexes, namely the FM-index [Ferragina and Manzini, 2005] and the CSA [Grossi and Vitter 2005], achieve this goal by exploiting the Burrows-Wheeler transform (BWT) [Burrows and Wheeler, 1994]. However, due to the intricate permutation structure of BWT, no locality of reference can be guaranteed when we perform pattern matching with these indexes. Chien et al. [2008] gave an alternative text index which is based on sparsifying the traditional suffix tree and maintaining an auxiliary 2-D range query structure. Given a text T of length n drawn from a σ-sized alphabet set, they achieved O(n log σ)-bit index for T and showed that this index can preserve locality in pattern matching and hence is amenable to be used in external-memory settings. We improve upon this index and show how to apply entropy compression to reduce index space. Our index takes O(n(H k + 1)) + o(n log σ) bits of space where H k is the kth-order empirical entropy of the text. This is achieved by creating variable length blocks of text using arithmetic coding.

Universal Compressed Text Indexing 1

2018

The rise of repetitive datasets has lately generated a lot of interest in compressed self-indexes based on dictionary compression, a rich and heterogeneous family of techniques that exploits text repetitions in different ways. For each such compression scheme, several different indexing solutions have been proposed in the last two decades. To date, the fastest indexes for repetitive texts are based on the run-length compressed Burrows–Wheeler transform (BWT) and on the Compact Directed Acyclic Word Graph (CDAWG). The most space-efficient indexes, on the other hand, are based on the Lempel–Ziv parsing and on grammar compression. Indexes for more universal schemes such as collage systems and macro schemes have not yet been proposed. Very recently, Kempa and Prezza [STOC 2018] showed that all dictionary compressors can be interpreted as approximation algorithms for the smallest string attractor, that is, a set of text positions capturing all distinct substrings. Starting from this obse...

Fast Compressed Self-Indexes with Deterministic Linear-Time Construction

ArXiv, 2017

We introduce a compressed suffix array representation that, on a text TTT of length nnn over an alphabet of size sigma\sigmasigma, can be built in O(n)O(n)O(n) deterministic time, within O(nlogsigma)O(n\log\sigma)O(nlogsigma) bits of working space, and counts the number of occurrences of any pattern PPP in TTT in time O(∣P∣+loglogwsigma)O(|P| + \log\log_w \sigma)O(P+loglogwsigma) on a RAM machine of w=Omega(logn)w=\Omega(\log n)w=Omega(logn)-bit words. This new index outperforms all the other compressed indexes that can be built in linear deterministic time, and some others. The only faster indexes can be built in linear time only in expectation, or require Theta(nlogn)\Theta(n\log n)Theta(nlogn) bits. We also show that, by using O(nlogsigma)O(n\log\sigma)O(nlogsigma) bits, we can build in linear time an index that counts in time O(∣P∣/logsigman+logn(loglogn)2)O(|P|/\log_\sigma n + \log n(\log\log n)^2)O(P∣/logsigman+logn(loglogn)2), which is RAM-optimal for w=Theta(logn)w=\Theta(\log n)w=Theta(logn) and sufficiently long patterns.

I/O-efficient Compressed Text Indexes: From Theory to Practice

Pattern matching on text data has been a fundamental field of Computer Science for nearly 40 years. Databases supporting full-text indexing functionality on text data are now widely used by biologists. In the theoretical literature, the most popular internal-memory index structures are the suffix trees and the suffix arrays, and the most popular external-memory index structure is the string B-tree. However, the practical applicability of these indexes has been limited mainly because of their space consumption and I/O issues. These structures use a lot more space (almost 20 to 50 times more) than the original text data and are often disk-resident. Ferragina and Manzini (2005) and Grossi and Vitter (2005) gave the first compressed text indexes with efficient query times in the internal-memory model. Recently, Chien et al (2008) presented a compact text index in the external memory based on the concept of Geometric Burrows-Wheeler Transform. They also presented lower bounds which suggested that it may be hard to obtain a good index structure in the external memory. In this paper, we investigate this issue from a practical point of view. On the positive side we show an external-memory text indexing structure (based on R-trees and KD-trees) that saves space by about an order of magnitude as compared to the standard String B-tree. While saving space, these structures also maintain a comparable I/O efficiency to that of String B-tree. We also show various space vs I/O efficiency trade-offs for our structures.