Szymon Grabowski - Academia.edu (original) (raw)
Papers by Szymon Grabowski
Bookmarks Related papers MentionsView impact
ArXiv, 2015
The FM-index is a well-known compressed full-text index, based on the Burrows-Wheeler transform (... more The FM-index is a well-known compressed full-text index, based on the Burrows-Wheeler transform (BWT). During a pattern search, the BWT sequence is accessed at "random" locations, which is cache-unfriendly. In this paper, we are interested in speeding up the FM-index by working on qqq-grams rather than individual characters, at the cost of using more space. The first presented variant is related to an inverted index on qqq-grams, yet the occurrence lists in our solution are in the sorted suffix order rather than text order in a traditional inverted index. This variant obtains O(m/∣CL∣+lognlogm)O(m/|CL| + \log n \log m)O(m/∣CL∣+lognlogm) cache misses in the worst case, where nnn and mmm are the text and pattern lengths, respectively, and ∣CL∣|CL|∣CL∣ is the CPU cache line size, in symbols (typically 64 in modern hardware). This index is often several times faster than the fastest known FM-indexes (especially for long patterns), yet the space requirements are enormous, O(nlog2n)O(n\log^2 n)O(nlog2n) bits in theory and about 80n80n80n...
Bookmarks Related papers MentionsView impact
Computing and Informatics, 2019
Bookmarks Related papers MentionsView impact
Computing and Informatics, 2019
Bookmarks Related papers MentionsView impact
Computing and Informatics, 2017
Bookmarks Related papers MentionsView impact
Fundamenta Informaticae, 2018
Bookmarks Related papers MentionsView impact
Discrete Applied Mathematics, 2016
Bookmarks Related papers MentionsView impact
Software: Practice and Experience, 2016
Bookmarks Related papers MentionsView impact
String Processing and Information Retrieval, 2015
Bookmarks Related papers MentionsView impact
We propose two suffix array inspired full-text indexes. One, called SA-hash, augments the suffix ... more We propose two suffix array inspired full-text indexes. One, called SA-hash, augments the suffix array with a hash table to speed up pattern searches due to significantly narrowed search interval before the binary search phase. The other, called FBCSA, is a compact data structure, similar to M{\"a}kinen's compact suffix array, but working on fixed sized blocks, which allows to arrange the data in multiples of 32 bits, beneficial for CPU access. Experimental results on the Pizza~\&~Chili 200\,MB datasets show that SA-hash is about 2.5--3 times faster in pattern searches (counts) than the standard suffix array, for the price of requiring 0.3n−2.0n0.3n-2.0n0.3n−2.0n extra space, where nnn is the text length, and setting a minimum pattern length. The latter limitation can be removed for the price of even more extra space. FBCSA is relatively fast in single cell accesses (a few times faster than related indexes at about the same or better compression), but not competitive if many consecutive ce...
Bookmarks Related papers MentionsView impact
We consider the classical exact multiple string matching problem. Our solution is based on qqq-gr... more We consider the classical exact multiple string matching problem. Our solution is based on qqq-grams combined with pattern superimposition, bit-parallelism and alphabet size reduction. We discuss the pros and cons of the various alternatives of how to achieve best combination. Our method is closely related to previous work by (Salmela et al., 2006). The experimental results show that our method performs well on different alphabet sizes and that they scale to large pattern sets.
Bookmarks Related papers MentionsView impact
PloS one, 2015
We propose a lightweight data structure for indexing and querying collections of NGS reads data i... more We propose a lightweight data structure for indexing and querying collections of NGS reads data in main memory. The data structure supports the interface proposed in the pioneering work by Philippe et al. for counting and locating k-mers in sequencing reads. Our solution, PgSA (pseudogenome suffix array), based on finding overlapping reads, is competitive to the existing algorithms in the space use, query times, or both. The main applications of our index include variant calling, error correction and analysis of reads from RNA-seq experiments.
Bookmarks Related papers MentionsView impact
Information Processing Letters, 2015
Bookmarks Related papers MentionsView impact
PLoS ONE, 2014
Bookmarks Related papers MentionsView impact
Bookmarks Related papers MentionsView impact
Computing Research Repository - CORR, 2011
Technology progress in DNA sequencing boosts the genomic database growth at faster and faster rat... more Technology progress in DNA sequencing boosts the genomic database growth at faster and faster rate. Compression, accompanied with random access capabilities, is the key to maintain those huge amounts of data. In this paper we present an LZ77-style compression scheme for relative compression of multiple genomes of the same species. While the solution bears similarity to known algorithms, it offers significantly higher compression ratios at compression speed over a order of magnitude greater. One of the new successful ideas is augmenting the reference sequence with phrases from the other sequences, making more LZ-matches available.
Bookmarks Related papers MentionsView impact
International Journal of Foundations of Computer Science, 2008
We propose new algorithms for (δ,γ,α)-matching. In this string matching problem we are given a pa... more We propose new algorithms for (δ,γ,α)-matching. In this string matching problem we are given a pattern P = p0p1 … pm−1 and a text T = t0t1 … tn−1 over some integer alphabet Σ = {0…σ − 1}. The pattern symbol pi δ-matches the text symbol tj iff |pi − tj| ≤ δ. The pattern P (δ,γ)-matches some text substring tj … tj+m−1 iff for all i it holds that |pi − tj+i| ≤ δ and Σ |pi − tj+i| ≤ γ. Finally, in (δ,γ,α)-matching we also permit at most α-symbol gaps between each matching text symbol. The only known previous algorithm runs in O(nm) time. We give several algorithms that improve the average case up to O(n) for small α, and the worst case to [Formula: see text] or O(nm log (γ)/w), where [Formula: see text] and w is the number of bits in a machine word. The proposed algorithms can be easily modified to solve several other related problems, we explicitly consider e.g. character classes (instead of δ-matching), (Δ-limited) k-mismatches (instead of γ-matching) and more general gaps, including ...
Bookmarks Related papers MentionsView impact
Information Processing Letters, 2013
Bookmarks Related papers MentionsView impact
Information Processing Letters, 2013
Bookmarks Related papers MentionsView impact
Information Processing Letters, 2006
Bookmarks Related papers MentionsView impact
Bookmarks Related papers MentionsView impact
ArXiv, 2015
The FM-index is a well-known compressed full-text index, based on the Burrows-Wheeler transform (... more The FM-index is a well-known compressed full-text index, based on the Burrows-Wheeler transform (BWT). During a pattern search, the BWT sequence is accessed at "random" locations, which is cache-unfriendly. In this paper, we are interested in speeding up the FM-index by working on qqq-grams rather than individual characters, at the cost of using more space. The first presented variant is related to an inverted index on qqq-grams, yet the occurrence lists in our solution are in the sorted suffix order rather than text order in a traditional inverted index. This variant obtains O(m/∣CL∣+lognlogm)O(m/|CL| + \log n \log m)O(m/∣CL∣+lognlogm) cache misses in the worst case, where nnn and mmm are the text and pattern lengths, respectively, and ∣CL∣|CL|∣CL∣ is the CPU cache line size, in symbols (typically 64 in modern hardware). This index is often several times faster than the fastest known FM-indexes (especially for long patterns), yet the space requirements are enormous, O(nlog2n)O(n\log^2 n)O(nlog2n) bits in theory and about 80n80n80n...
Bookmarks Related papers MentionsView impact
Computing and Informatics, 2019
Bookmarks Related papers MentionsView impact
Computing and Informatics, 2019
Bookmarks Related papers MentionsView impact
Computing and Informatics, 2017
Bookmarks Related papers MentionsView impact
Fundamenta Informaticae, 2018
Bookmarks Related papers MentionsView impact
Discrete Applied Mathematics, 2016
Bookmarks Related papers MentionsView impact
Software: Practice and Experience, 2016
Bookmarks Related papers MentionsView impact
String Processing and Information Retrieval, 2015
Bookmarks Related papers MentionsView impact
We propose two suffix array inspired full-text indexes. One, called SA-hash, augments the suffix ... more We propose two suffix array inspired full-text indexes. One, called SA-hash, augments the suffix array with a hash table to speed up pattern searches due to significantly narrowed search interval before the binary search phase. The other, called FBCSA, is a compact data structure, similar to M{\"a}kinen's compact suffix array, but working on fixed sized blocks, which allows to arrange the data in multiples of 32 bits, beneficial for CPU access. Experimental results on the Pizza~\&~Chili 200\,MB datasets show that SA-hash is about 2.5--3 times faster in pattern searches (counts) than the standard suffix array, for the price of requiring 0.3n−2.0n0.3n-2.0n0.3n−2.0n extra space, where nnn is the text length, and setting a minimum pattern length. The latter limitation can be removed for the price of even more extra space. FBCSA is relatively fast in single cell accesses (a few times faster than related indexes at about the same or better compression), but not competitive if many consecutive ce...
Bookmarks Related papers MentionsView impact
We consider the classical exact multiple string matching problem. Our solution is based on qqq-gr... more We consider the classical exact multiple string matching problem. Our solution is based on qqq-grams combined with pattern superimposition, bit-parallelism and alphabet size reduction. We discuss the pros and cons of the various alternatives of how to achieve best combination. Our method is closely related to previous work by (Salmela et al., 2006). The experimental results show that our method performs well on different alphabet sizes and that they scale to large pattern sets.
Bookmarks Related papers MentionsView impact
PloS one, 2015
We propose a lightweight data structure for indexing and querying collections of NGS reads data i... more We propose a lightweight data structure for indexing and querying collections of NGS reads data in main memory. The data structure supports the interface proposed in the pioneering work by Philippe et al. for counting and locating k-mers in sequencing reads. Our solution, PgSA (pseudogenome suffix array), based on finding overlapping reads, is competitive to the existing algorithms in the space use, query times, or both. The main applications of our index include variant calling, error correction and analysis of reads from RNA-seq experiments.
Bookmarks Related papers MentionsView impact
Information Processing Letters, 2015
Bookmarks Related papers MentionsView impact
PLoS ONE, 2014
Bookmarks Related papers MentionsView impact
Bookmarks Related papers MentionsView impact
Computing Research Repository - CORR, 2011
Technology progress in DNA sequencing boosts the genomic database growth at faster and faster rat... more Technology progress in DNA sequencing boosts the genomic database growth at faster and faster rate. Compression, accompanied with random access capabilities, is the key to maintain those huge amounts of data. In this paper we present an LZ77-style compression scheme for relative compression of multiple genomes of the same species. While the solution bears similarity to known algorithms, it offers significantly higher compression ratios at compression speed over a order of magnitude greater. One of the new successful ideas is augmenting the reference sequence with phrases from the other sequences, making more LZ-matches available.
Bookmarks Related papers MentionsView impact
International Journal of Foundations of Computer Science, 2008
We propose new algorithms for (δ,γ,α)-matching. In this string matching problem we are given a pa... more We propose new algorithms for (δ,γ,α)-matching. In this string matching problem we are given a pattern P = p0p1 … pm−1 and a text T = t0t1 … tn−1 over some integer alphabet Σ = {0…σ − 1}. The pattern symbol pi δ-matches the text symbol tj iff |pi − tj| ≤ δ. The pattern P (δ,γ)-matches some text substring tj … tj+m−1 iff for all i it holds that |pi − tj+i| ≤ δ and Σ |pi − tj+i| ≤ γ. Finally, in (δ,γ,α)-matching we also permit at most α-symbol gaps between each matching text symbol. The only known previous algorithm runs in O(nm) time. We give several algorithms that improve the average case up to O(n) for small α, and the worst case to [Formula: see text] or O(nm log (γ)/w), where [Formula: see text] and w is the number of bits in a machine word. The proposed algorithms can be easily modified to solve several other related problems, we explicitly consider e.g. character classes (instead of δ-matching), (Δ-limited) k-mismatches (instead of γ-matching) and more general gaps, including ...
Bookmarks Related papers MentionsView impact
Information Processing Letters, 2013
Bookmarks Related papers MentionsView impact
Information Processing Letters, 2013
Bookmarks Related papers MentionsView impact
Information Processing Letters, 2006
Bookmarks Related papers MentionsView impact