Szymon Grabowski - Academia.edu (original) (raw)

Papers by Szymon Grabowski

Research paper thumbnail of New algorithms for binary jumbled

Bookmarks Related papers MentionsView impact

Research paper thumbnail of A bloated FM-index reducing the number of cache misses during the search

ArXiv, 2015

The FM-index is a well-known compressed full-text index, based on the Burrows-Wheeler transform (... more The FM-index is a well-known compressed full-text index, based on the Burrows-Wheeler transform (BWT). During a pattern search, the BWT sequence is accessed at "random" locations, which is cache-unfriendly. In this paper, we are interested in speeding up the FM-index by working on qqq-grams rather than individual characters, at the cost of using more space. The first presented variant is related to an inverted index on qqq-grams, yet the occurrence lists in our solution are in the sorted suffix order rather than text order in a traditional inverted index. This variant obtains O(m/∣CL∣+lognlogm)O(m/|CL| + \log n \log m)O(m/∣CL+lognlogm) cache misses in the worst case, where nnn and mmm are the text and pattern lengths, respectively, and ∣CL∣|CL|CL is the CPU cache line size, in symbols (typically 64 in modern hardware). This index is often several times faster than the fastest known FM-indexes (especially for long patterns), yet the space requirements are enormous, O(nlog2n)O(n\log^2 n)O(nlog2n) bits in theory and about 80n80n80n...

Bookmarks Related papers MentionsView impact

Research paper thumbnail of Revisiting Multiple Pattern Matching

Computing and Informatics, 2019

Bookmarks Related papers MentionsView impact

Research paper thumbnail of Lightweight Fingerprints for Fast Approximate Keyword Matching Using Bitwise Operations

Computing and Informatics, 2019

Bookmarks Related papers MentionsView impact

Research paper thumbnail of A Practical Index for Approximate Dictionary Matching with Few Mismatches

Computing and Informatics, 2017

Bookmarks Related papers MentionsView impact

Research paper thumbnail of On Abelian Longest Common Factor with and without RLE

Fundamenta Informaticae, 2018

Bookmarks Related papers MentionsView impact

Research paper thumbnail of New tabulation and sparse dynamic programming based techniques for sequence similarity problems

Discrete Applied Mathematics, 2016

Bookmarks Related papers MentionsView impact

Research paper thumbnail of A Bloom filter based semi-index onq-grams

Software: Practice and Experience, 2016

Bookmarks Related papers MentionsView impact

Research paper thumbnail of Sampling the Suffix Array with Minimizers

String Processing and Information Retrieval, 2015

Bookmarks Related papers MentionsView impact

Research paper thumbnail of Two simple full-text indexes based on the suffix array

We propose two suffix array inspired full-text indexes. One, called SA-hash, augments the suffix ... more We propose two suffix array inspired full-text indexes. One, called SA-hash, augments the suffix array with a hash table to speed up pattern searches due to significantly narrowed search interval before the binary search phase. The other, called FBCSA, is a compact data structure, similar to M{\"a}kinen's compact suffix array, but working on fixed sized blocks, which allows to arrange the data in multiples of 32 bits, beneficial for CPU access. Experimental results on the Pizza~\&~Chili 200\,MB datasets show that SA-hash is about 2.5--3 times faster in pattern searches (counts) than the standard suffix array, for the price of requiring 0.3n−2.0n0.3n-2.0n0.3n2.0n extra space, where nnn is the text length, and setting a minimum pattern length. The latter limitation can be removed for the price of even more extra space. FBCSA is relatively fast in single cell accesses (a few times faster than related indexes at about the same or better compression), but not competitive if many consecutive ce...

Bookmarks Related papers MentionsView impact

Research paper thumbnail of Multiple pattern matching revisited

We consider the classical exact multiple string matching problem. Our solution is based on qqq-gr... more We consider the classical exact multiple string matching problem. Our solution is based on qqq-grams combined with pattern superimposition, bit-parallelism and alphabet size reduction. We discuss the pros and cons of the various alternatives of how to achieve best combination. Our method is closely related to previous work by (Salmela et al., 2006). The experimental results show that our method performs well on different alphabet sizes and that they scale to large pattern sets.

Bookmarks Related papers MentionsView impact

Research paper thumbnail of Indexing Arbitrary-Length k-Mers in Sequencing Reads

PloS one, 2015

We propose a lightweight data structure for indexing and querying collections of NGS reads data i... more We propose a lightweight data structure for indexing and querying collections of NGS reads data in main memory. The data structure supports the interface proposed in the pioneering work by Philippe et al. for counting and locating k-mers in sequencing reads. Our solution, PgSA (pseudogenome suffix array), based on finding overlapping reads, is competitive to the existing algorithms in the space use, query times, or both. The main applications of our index include variant calling, error correction and analysis of reads from RNA-seq experiments.

Bookmarks Related papers MentionsView impact

Research paper thumbnail of A note on the longest common substring with k-mismatches problem

Information Processing Letters, 2015

Bookmarks Related papers MentionsView impact

Research paper thumbnail of Indexes of Large Genome Collections on a PC

PLoS ONE, 2014

Bookmarks Related papers MentionsView impact

Research paper thumbnail of Motif matching using gapped patterns

Bookmarks Related papers MentionsView impact

Research paper thumbnail of Engineering Relative Compression of Genomes

Computing Research Repository - CORR, 2011

Technology progress in DNA sequencing boosts the genomic database growth at faster and faster rat... more Technology progress in DNA sequencing boosts the genomic database growth at faster and faster rate. Compression, accompanied with random access capabilities, is the key to maintain those huge amounts of data. In this paper we present an LZ77-style compression scheme for relative compression of multiple genomes of the same species. While the solution bears similarity to known algorithms, it offers significantly higher compression ratios at compression speed over a order of magnitude greater. One of the new successful ideas is augmenting the reference sequence with phrases from the other sequences, making more LZ-matches available.

Bookmarks Related papers MentionsView impact

Research paper thumbnail of EFFICIENT ALGORITHMS FOR (δ,γ,α) AND (δ, k Δ , α)-MATCHING

International Journal of Foundations of Computer Science, 2008

We propose new algorithms for (δ,γ,α)-matching. In this string matching problem we are given a pa... more We propose new algorithms for (δ,γ,α)-matching. In this string matching problem we are given a pattern P = p0p1 … pm−1 and a text T = t0t1 … tn−1 over some integer alphabet Σ = {0…σ − 1}. The pattern symbol pi δ-matches the text symbol tj iff |pi − tj| ≤ δ. The pattern P (δ,γ)-matches some text substring tj … tj+m−1 iff for all i it holds that |pi − tj+i| ≤ δ and Σ |pi − tj+i| ≤ γ. Finally, in (δ,γ,α)-matching we also permit at most α-symbol gaps between each matching text symbol. The only known previous algorithm runs in O(nm) time. We give several algorithms that improve the average case up to O(n) for small α, and the worst case to [Formula: see text] or O(nm log (γ)/w), where [Formula: see text] and w is the number of bits in a machine word. The proposed algorithms can be easily modified to solve several other related problems, we explicitly consider e.g. character classes (instead of δ-matching), (Δ-limited) k-mismatches (instead of γ-matching) and more general gaps, including ...

Bookmarks Related papers MentionsView impact

Research paper thumbnail of New algorithms for binary jumbled pattern matching

Information Processing Letters, 2013

Bookmarks Related papers MentionsView impact

Research paper thumbnail of Approximate pattern matching with k-mismatches in packed text

Information Processing Letters, 2013

Bookmarks Related papers MentionsView impact

Research paper thumbnail of A general compression algorithm that supports fast searching

Information Processing Letters, 2006

Bookmarks Related papers MentionsView impact

Research paper thumbnail of New algorithms for binary jumbled

Bookmarks Related papers MentionsView impact

Research paper thumbnail of A bloated FM-index reducing the number of cache misses during the search

ArXiv, 2015

The FM-index is a well-known compressed full-text index, based on the Burrows-Wheeler transform (... more The FM-index is a well-known compressed full-text index, based on the Burrows-Wheeler transform (BWT). During a pattern search, the BWT sequence is accessed at "random" locations, which is cache-unfriendly. In this paper, we are interested in speeding up the FM-index by working on qqq-grams rather than individual characters, at the cost of using more space. The first presented variant is related to an inverted index on qqq-grams, yet the occurrence lists in our solution are in the sorted suffix order rather than text order in a traditional inverted index. This variant obtains O(m/∣CL∣+lognlogm)O(m/|CL| + \log n \log m)O(m/∣CL+lognlogm) cache misses in the worst case, where nnn and mmm are the text and pattern lengths, respectively, and ∣CL∣|CL|CL is the CPU cache line size, in symbols (typically 64 in modern hardware). This index is often several times faster than the fastest known FM-indexes (especially for long patterns), yet the space requirements are enormous, O(nlog2n)O(n\log^2 n)O(nlog2n) bits in theory and about 80n80n80n...

Bookmarks Related papers MentionsView impact

Research paper thumbnail of Revisiting Multiple Pattern Matching

Computing and Informatics, 2019

Bookmarks Related papers MentionsView impact

Research paper thumbnail of Lightweight Fingerprints for Fast Approximate Keyword Matching Using Bitwise Operations

Computing and Informatics, 2019

Bookmarks Related papers MentionsView impact

Research paper thumbnail of A Practical Index for Approximate Dictionary Matching with Few Mismatches

Computing and Informatics, 2017

Bookmarks Related papers MentionsView impact

Research paper thumbnail of On Abelian Longest Common Factor with and without RLE

Fundamenta Informaticae, 2018

Bookmarks Related papers MentionsView impact

Research paper thumbnail of New tabulation and sparse dynamic programming based techniques for sequence similarity problems

Discrete Applied Mathematics, 2016

Bookmarks Related papers MentionsView impact

Research paper thumbnail of A Bloom filter based semi-index onq-grams

Software: Practice and Experience, 2016

Bookmarks Related papers MentionsView impact

Research paper thumbnail of Sampling the Suffix Array with Minimizers

String Processing and Information Retrieval, 2015

Bookmarks Related papers MentionsView impact

Research paper thumbnail of Two simple full-text indexes based on the suffix array

We propose two suffix array inspired full-text indexes. One, called SA-hash, augments the suffix ... more We propose two suffix array inspired full-text indexes. One, called SA-hash, augments the suffix array with a hash table to speed up pattern searches due to significantly narrowed search interval before the binary search phase. The other, called FBCSA, is a compact data structure, similar to M{\"a}kinen's compact suffix array, but working on fixed sized blocks, which allows to arrange the data in multiples of 32 bits, beneficial for CPU access. Experimental results on the Pizza~\&~Chili 200\,MB datasets show that SA-hash is about 2.5--3 times faster in pattern searches (counts) than the standard suffix array, for the price of requiring 0.3n−2.0n0.3n-2.0n0.3n2.0n extra space, where nnn is the text length, and setting a minimum pattern length. The latter limitation can be removed for the price of even more extra space. FBCSA is relatively fast in single cell accesses (a few times faster than related indexes at about the same or better compression), but not competitive if many consecutive ce...

Bookmarks Related papers MentionsView impact

Research paper thumbnail of Multiple pattern matching revisited

We consider the classical exact multiple string matching problem. Our solution is based on qqq-gr... more We consider the classical exact multiple string matching problem. Our solution is based on qqq-grams combined with pattern superimposition, bit-parallelism and alphabet size reduction. We discuss the pros and cons of the various alternatives of how to achieve best combination. Our method is closely related to previous work by (Salmela et al., 2006). The experimental results show that our method performs well on different alphabet sizes and that they scale to large pattern sets.

Bookmarks Related papers MentionsView impact

Research paper thumbnail of Indexing Arbitrary-Length k-Mers in Sequencing Reads

PloS one, 2015

We propose a lightweight data structure for indexing and querying collections of NGS reads data i... more We propose a lightweight data structure for indexing and querying collections of NGS reads data in main memory. The data structure supports the interface proposed in the pioneering work by Philippe et al. for counting and locating k-mers in sequencing reads. Our solution, PgSA (pseudogenome suffix array), based on finding overlapping reads, is competitive to the existing algorithms in the space use, query times, or both. The main applications of our index include variant calling, error correction and analysis of reads from RNA-seq experiments.

Bookmarks Related papers MentionsView impact

Research paper thumbnail of A note on the longest common substring with k-mismatches problem

Information Processing Letters, 2015

Bookmarks Related papers MentionsView impact

Research paper thumbnail of Indexes of Large Genome Collections on a PC

PLoS ONE, 2014

Bookmarks Related papers MentionsView impact

Research paper thumbnail of Motif matching using gapped patterns

Bookmarks Related papers MentionsView impact

Research paper thumbnail of Engineering Relative Compression of Genomes

Computing Research Repository - CORR, 2011

Technology progress in DNA sequencing boosts the genomic database growth at faster and faster rat... more Technology progress in DNA sequencing boosts the genomic database growth at faster and faster rate. Compression, accompanied with random access capabilities, is the key to maintain those huge amounts of data. In this paper we present an LZ77-style compression scheme for relative compression of multiple genomes of the same species. While the solution bears similarity to known algorithms, it offers significantly higher compression ratios at compression speed over a order of magnitude greater. One of the new successful ideas is augmenting the reference sequence with phrases from the other sequences, making more LZ-matches available.

Bookmarks Related papers MentionsView impact

Research paper thumbnail of EFFICIENT ALGORITHMS FOR (δ,γ,α) AND (δ, k Δ , α)-MATCHING

International Journal of Foundations of Computer Science, 2008

We propose new algorithms for (δ,γ,α)-matching. In this string matching problem we are given a pa... more We propose new algorithms for (δ,γ,α)-matching. In this string matching problem we are given a pattern P = p0p1 … pm−1 and a text T = t0t1 … tn−1 over some integer alphabet Σ = {0…σ − 1}. The pattern symbol pi δ-matches the text symbol tj iff |pi − tj| ≤ δ. The pattern P (δ,γ)-matches some text substring tj … tj+m−1 iff for all i it holds that |pi − tj+i| ≤ δ and Σ |pi − tj+i| ≤ γ. Finally, in (δ,γ,α)-matching we also permit at most α-symbol gaps between each matching text symbol. The only known previous algorithm runs in O(nm) time. We give several algorithms that improve the average case up to O(n) for small α, and the worst case to [Formula: see text] or O(nm log (γ)/w), where [Formula: see text] and w is the number of bits in a machine word. The proposed algorithms can be easily modified to solve several other related problems, we explicitly consider e.g. character classes (instead of δ-matching), (Δ-limited) k-mismatches (instead of γ-matching) and more general gaps, including ...

Bookmarks Related papers MentionsView impact

Research paper thumbnail of New algorithms for binary jumbled pattern matching

Information Processing Letters, 2013

Bookmarks Related papers MentionsView impact

Research paper thumbnail of Approximate pattern matching with k-mismatches in packed text

Information Processing Letters, 2013

Bookmarks Related papers MentionsView impact

Research paper thumbnail of A general compression algorithm that supports fast searching

Information Processing Letters, 2006

Bookmarks Related papers MentionsView impact