Pattern Matching for DNA Sequencing Data Using Multiple Bloom Filters (original) (raw)

Classification of DNA sequences using Bloom filters

Bioinformatics/computer Applications in The Biosciences, 2010

New generation sequencing technologies producing increasingly complex datasets demand new efficient and specialized sequence analysis algorithms. Often, it is only the 'novel' sequences in a complex dataset that are of interest and the superfluous sequences need to be removed. Results: A novel algorithm, fast and accurate classification of sequences (FACSs), is introduced that can accurately and rapidly classify sequences as belonging or not belonging to a reference sequence. FACS was first optimized and validated using a synthetic metagenome dataset. An experimental metagenome dataset was then used to show that FACS achieves comparable accuracy as BLAT and SSAHA2 but is at least 21 times faster in classifying sequences. Availability: Source code for FACS, Bloom filters and MetaSim dataset used is available at http://facs.biotech.kth.se. The Bloom::Faster 1.6 Perl module can be downloaded from CPAN at http://search.cpan.org/∼palvaro/Bloom-Faster-1.6/ Contacts:

Improving Bloom Filter Performance on Sequence Data Using k-mer Bloom Filters

Using a sequence's k-mer content rather than the full sequence directly has enabled significant performance improvements in several sequencing applications, such as metagenomic species identification, estimation of transcript abundances, and alignment-free comparison of sequencing data. As k-mer sets often reach hundreds of millions of elements, traditional data structures are often impractical for k-mer set storage, and Bloom filters (BFs) and their variants are used instead. BFs reduce the memory footprint required to store millions of k-mers while allowing for fast set containment queries, at the cost of a low false positive rate (FPR). We show that, because k-mers are derived from sequencing reads, the information about k-mer overlap in the original sequence can be used to reduce the FPR up to 30 · with little or no additional memory and with set containment queries that are only 1:3-1:6 times slower. Alternatively, we can leverage k-mer overlap information to store k-mer sets in about half the space while maintaining the original FPR. We consider several variants of such k-mer Bloom filters (kBFs), derive theoretical upper bounds for their FPR, and discuss their range of applications and limitations.

Sub-linear Sequence Search via a Repeated And Merged Bloom Filter (RAMBO)

arXiv: Genomics, 2019

Whole-genome shotgun sequencing (WGS), especially that of microbial genomes, has been the core of recent research advances in large-scale comparative genomics. The data deluge has resulted in exponential growth in genomic datasets over the past years and has shown no sign of slowing down. Several recent attempts have been made to tame the computational burden of read classification and sequence search on these ultra large-scale datasets, including both raw reads and assembled genomes. A notable recent method is BigSI. BigSI is based around bloom filters and offers very efficient query sequence search times. However, querying with BigSI still requires probing Bloom filters (or sets of bitslices) which scales linearly with the number of datasets. As a result, scaling up BigSI for datasets with potentially millions (or higher) samples is likely prohibitive. In this paper, we propose RAMBO (Repeated and Merged Bloom Filter) where the number of Bloom filter probes is significantly less t...

kmtricks: Efficient and flexible construction of Bloom filters for large sequencing data collections

2021

When indexing large collections of short-read sequencing data, a common operation that has now been implemented in several tools (Sequence Bloom Trees and variants, BIGSI, ..) is to construct a collection of Bloom filters, one per sample. Each Bloom filter is used to represent a set of k-mers which approximates the desired set of all the non-erroneous k-mers present in the sample. However, this approximation is imperfect, especially in the case of metagenomics data. Erroneous but abundant k-mers are wrongly included, and non-erroneous but low-abundant ones are wrongly discarded. We propose kmtricks, a novel approach for generating Bloom filters from terabase-sized collections of sequencing data. Our main contributions are 1/ an efficient method for jointly counting k-mers across multiple samples, including a streamlined Bloom filter construction by directly counting, partitioning and sorting hashes instead of k-mers, which is approximately four times faster than state-of-the-art too...

Fast searching in biological sequences using multiple hash functions

2012

With the availability of large amounts of DNA data, exact matching of nucleotide sequences has become an important application in modern computational biology and in metagenomics. In this paper we present an efficient method based on multiple hashing functions which improves the performance of existing string matching algorithms when used for searching DNA sequences. From our experimental results it turns out that the new proposed technique leads to algorithms which are up to 8 times faster than the best algorithm known for matching multiple patterns. It turns out also that the gain in performances is larger when searching for larger sets. Thus, considering the fact that the number of reads produced by next generation sequencing equipments is ever growing, the new technique serves a good basis for massive multiple long pattern search applications.

High performance pattern matching using Bloom-Bloomier Filter

International Conference on Electrical Engineering/Electronics Computer Telecommunications and Information Technology (ECTI-CON), 2010 , 2010

In this paper, we propose a high performance architecture based on the combination of Bloom Filter and Bloomier Filter (BBF) to enhance the speed of pattern matching process on Clam Antivirus (ClamAV) database. BBF maintains small on-chip memory, low number of fault positives and can indicate which patterns are the candidate matches. The implementation results on low-cost Altera Cyclone II show that our architecture can handle 43,491-characters of ClamAV pattern set with only 9.5 bits per character and achieve a throughput of 1 gigabit per second (Gbps). As compared with previous systems, our memory utilization is far better up to 73%.

A Review on Role of Bloom Filter on DNA Assembly

IEEE Access, 2019

The advancement of DNA assembly techniques has greatly boosted up the bioinformatics research and discovery. More precisely, DNA assembly has achieved tremendous popularity due to the ability to decode the hidden information in the DNA. DNA assembly is the process of finding the correct sequence of the nucleotide bases in DNA. The key challenges are a) size of the genomic data and, b) time to process the genomic data. Apparently, genomic data are voluminous consisting of many repeated fragments. The huge sized genomic data makes the DNA assembling a time consuming process. To address the space and time complexity, bloom filter is deployed in DNA assembling. Moreover, bloom filter plays a vital role in DNA processing to deal with the repeated data of DNA. A bloom filter is a probabilistic data structure for membership filter. Bloom filter uses a tiny amount of memory size to store information on the genomic data. However, DNA assembling is a very memory-intensive process. The whole process consists of many stages. In every stage, the repeated data need to be taken care of. Hence, bloom filter is deployed in every stage for its implementation. This paper presents the impact of bloom filter in DNA assembling process. It also gives a precise explanation on every aspect of the DNA assembling process. The focus of this paper is to review the techniques that implemented bloom filter. INDEX TERMS Bloom filter, DNA sequencing, DNA assembly, de novo, de Bruijn Graph, bioinformatics, big data.

Fast detection of maximal exact matches via fixed sampling of query K-mers and Bloom filtering of index K-mers

Bioinformatics, 2019

Motivation Detection of maximal exact matches (MEMs) between two long sequences is a fundamental problem in pairwise reference-query genome comparisons. To efficiently compare larger and larger genomes, reducing the number of indexed k-mers as well as the number of query k-mers has been adopted as a mainstream approach which saves the computational resources by avoiding a significant number of unnecessary matches. Results Under this framework, we proposed a new method to detect all MEMs from a pair of genomes. The method first performs a fixed sampling of k-mers on the query sequence, and adds these selected k-mers to a Bloom filter. Then all the k-mers of the reference sequence are tested by the Bloom filter. If a k-mer passes the test, it is inserted into a hash table for indexing. Compared with the existing methods, much less number of query k-mers are generated and much less k-mers are inserted into the index to avoid unnecessary matches, leading to an efficient matching process...

Fast and Efficient Hashing for Sequence Similarity Search using Substring Extraction in DNA Sequence Databases

Emergent interest in genomic research has resulted in the creation of huge biological sequence databases, however search and retrieval of relevant information from these databases takes a lot of processing time, when performed conventionally as size of databases containing DNA sequences is huge. Hence, providing an efficient searching mechanism is mandatory. In this paper we present an efficient search mechanism using Hashing techniques. Initially, the data is hashed and indexed according to different window sizes. During this process, we eliminate redundancies and only record patterns with distinct elements and provide them with corresponding hash values. During the search phase, the search string is checked for the size of the window and if it exceeds the maximum limit of 4, then it is divided. The first part is considered as the search string and the search is made. After the confirmation of the index, the strings that follow the current indexed string are matched with the search string and finally the confirmation is made. The simulation results show that the current methodology provides faster results, while occupying lesser memory.

Role of Bloom Filter in Big Data Research: A Survey

International Journal of Advanced Computer Science and Applications

Big Data is the most popular emerging trends that becomes a blessing for human kinds and it is the necessity of day-today life. For example, Facebook. Every person involves with producing data either directly or indirectly. Thus, Big Data is a high volume of data with exponential growth rate that consists of a variety of data. Big Data touches all fields, including Government sector, IT industry, Business, Economy, Engineering, Bioinformatics, and other basic sciences. Thus, Big Data forms a data silo. Most of the data are duplicates and unstructured. To deal with such kind of data silo, Bloom Filter is a precious resource to filter out the duplicate data. Also, Bloom Filter is inevitable in a Big Data storage system to optimize the memory consumption. Undoubtedly, Bloom Filter uses a tiny amount of memory space to filter a very large data size and it stores information of a large set of data. However, functionality of the Bloom Filter is limited to membership filter, but it can be adapted in various applications. Besides, the Bloom Filter is deployed in diverse field, and also used in the interdisciplinary research area. Bioinformatics, for instance. In this article, we expose the usefulness of Bloom Filter in Big Data research.