Parallel Processing in Genome Mapping and Sequencing (original) (raw)

Sequential and parallel algorithms for DNA sequencing

Bioinformatics, 1997

Reconstruction of the original DNA sequence in the sequencing by the hybridization approach (SBH) requires computational support due to a large number of possible combinations. One can notice a lack of algorithms admitting false-negative data and giving in addition all possible solutions. Results: In this paper, a new method of sequencing has been proposed. An algorithm based on its idea (for the general case, when some data are missing, like in the real experiment) has been implemented and tested. Authentic DNA sequences have been used for testing. A parallel version of the algorithm has also been implemented and tested. The quality of the reconstruction is satisfactory for the library of oligonucleotides of length between 8 and 12, and 100, 200 and 300 bp long sequences. A way to a further decrease in the computation time is also suggested.

Improving Bioinformatics Analysis of Large Sequence Datasets Parallelizing Tools for Population Genomics

Euro-Par 2016: Parallel Processing Workshops, 2017

Next-generation sequencing (NGS) technologies initiated a revolution in genomics, producing massive amounts of biological data and the consequent need for adapting current computing infrastructures. Multiple alignment of genomes, analysis of variants or phylogenetic tree construction, with quadratic polynomial complexity in the best case are tools that can take days or weeks to complete in conventional computers. Most of these analysis, involving several tools integrated in workflows, present the possibility of dividing the computational load in independent tasks allowing parallel execution. Determining adequate load balancing, data partitioning, granularity and I/O tuning are key factors for achieving suitable speedups. In this paper we present a coarse-grain parallelization of GH caller (Genotype/Haplotype caller), a tool used in population genomics workflows that performs a probabilistic identification process to account for the frequency of variants present between population individuals. It implements a master-worker model, using the standard Message Passing Interface (MPI), and concurrently and iteratively distributes the data among the available worker processes by mapping subsets of data and leaving the orchestration to the master process. Our results show a performance gain factor of 260x using 64 processes and additional optimizations with regard to the initial non-parallelized version.

SIGNIFICANCE OF MASSIVELY PARALLEL SEQUENCING STRATEGIES AND DE NOVO ASSEMBLY ALGORITHMS IN WHOLE GENOME SEQUENCING

Massively parallel sequencing (MPS) is a novel approach for sequencing genomes and it provides a significantly higher throughput when compared with the conventional sequencing platforms. Therefore, it has become a better solution for achieving genome sequences of particular organisms with a greater accuracy and precision. Genome assembly is the immediate process which is performed subsequently to genome sequencing and if the genome of a particular organism has not been sequenced previously, de novo assembly approach is the sole mode of acquiring the complete genome. When considering a plausible model organism, it is crucial to procure the complete genome sequence in order to provide a better biological insight to that particular organism. This article therefore, begins to discuss the significance of acquiring the complete genome sequence of a plausible model organism and then the major approaches that are available for genomes sequencing showing the suitability of massively parallel sequencing approaches for such exercise. Thereafter, the comparison of next generation sequencing platforms and sequencing assembly algorithms showing the importance of de novo assembly approach in achieving genome sequencing are discussed while highlighting the importance of quality assessment and validation procedures for sequenced genomes. Finally, the challenges and their countermeasures in whole genome sequencing of model organisms are addressed. Introduction Genome sequencing is the process in which the nucleotide order of a particular genome is achieved in terms of four different nucleotides that are Adenine, Guanine, Thymine, and Cytosine. There are numerous advantages of acquiring the complete genome sequence of an organism, especially in research purposes since it represents the entire biological and biochemical insight into the organism of interest [1]. As a result of that most of the biological and biochemical research projects that are based on animals or plants, are demanding for the sequence information in order to ascertain biological processes or biochemical pathways that are being taken place within them.

The complete genome of an individual by massively parallel DNA sequencing

Nature, 2008

The association of genetic variation with disease and drug response, and improvements in nucleic acid technologies, have given great optimism for the impact of 'genomic medicine'. However, the formidable size of the diploid human genome 1 , approximately 6 gigabases, has prevented the routine application of sequencing methods to deciphering complete individual human genomes. To realize the full potential of genomics for human health, this limitation must be overcome. Here we report the DNA sequence of a diploid genome of a single individual, James D. Watson, sequenced to 7.4-fold redundancy in two months using massively parallel sequencing in picolitre-size reaction vessels. This sequence was completed in two months at approximately onehundredth of the cost of traditional capillary electrophoresis methods. Comparison of the sequence to the reference genome led to the identification of 3.3 million single nucleotide polymorphisms, of which 10,654 cause amino-acid substitution within the coding sequence. In addition, we accurately identified smallscale (2-40,000 base pair (bp)) insertion and deletion polymorphism as well as copy number variation resulting in the large-scale gain and loss of chromosomal segments ranging from 26,000 to 1.5 million base pairs. Overall, these results agree well with recent results of sequencing of a single individual 2 by traditional methods. However, in addition to being faster and significantly less expensive, this sequencing technology avoids the arbitrary loss of genomic sequences inherent in random shotgun sequencing by bacterial cloning because it amplifies DNA in a cell-free system. As a result, we further demonstrate the acquisition of novel human sequence, including novel genes not previously identified by traditional genomic sequencing. This is the first genome sequenced by next-generation technologies. Therefore it is a pilot for the future challenges of 'personalized genome sequencing'.

Whole genome comparison using commodity workstations

2003

Whole genome comparison consists of comparing or aligning two genome sequences in the hope that analogous functional or physical characteristics may be observed. Sequence comparison is done via a number of slow rigorous algorithms, or faster heuristic approaches. However, due to the large size of genomic sequences, the capacity of current software is limited.

Data Analysis for Next Generation Sequencing – Parallel Computing Approaches in de Novo Assembly Algorithms

The new parallel sequencing technologies produce gigabases of genome information in just a few days bring with them new problems for data storage and processing. Sequencing technologies have applications in human, plant and animal genome studies, metagenomics, epigenetics, discovery of non-coding RNAs and protein binding sites. There are two major problems in next generation sequencing (NGS) data processing: algorithms for alignment of sequences (for which exists a reference sequence) and algorithms for de novo genome (sequence) assembly (for which no reference sequence is available). Different factors define the choice of better algorithmic solution: cost, reads length, data volume, rate of data generation). As a result the particular bioinformatics solution depends on the biological application and on the type of sequencing technology used to generate the data. All the technologies have their strengths and weaknesses and limits of their performance for providing error free sequenc...

A distributed scheme for efficient pair-wise comparison of complete genomes

iciis, 1999

The comparisons of newly sequenced genomes against a genome with known functionality of genes provide important clues to the structure and function of genes and identification of metabolic pathways in newly sequenced organisms. New and more complex organisms are being added to biological databases at an increasing rate. Time-efficient, automated computational methods are needed to analyze the increasing amount of data in realistic time. This paper describes a distributed technique and a CORBA-based implementation to compare and align gene sequences in large complete genomes, using multiple heterogeneous distributed processors on a distributed network. The performance evaluation suggests that the distributed technique can significantly reduce the computational time.

Collaborative computing for gene mapping

1993

The authors are investigating mechanisms for utilizing advances in high performance computing and alignment algorithm development which will allow the analysis of newly acquired sequence data in real time and eliminate the global alignments problems associated with existing datasets. The presence of repetitive DNA sequences in the human genome complicates the process of homology comparisons. Three approaches have been used to address this problem. Two of the approaches involve elimination of the repetitive elements either by removing the repetitive element from the query or scoring words due to the repetitive elements poorly or not at all during the alignment process. The approach involves identification of the repetitive element in the query by comparison to a known repeat set prior to comparison to the large database. Any homologies returned which are contained within a previously identified repeat are ignored unless the homology exceeds set quality parameters. The homologies whic...

Removing Sequential Bottlenecks in Analysis of Next-Generation Sequencing Data

2014 IEEE International Parallel & Distributed Processing Symposium Workshops, 2014

Throughput from sequencing instruments has been increasing in an unprecedented speed, leading to an explosion of the next-generation sequencing (NGS) data, and challenges in storing, managing, and analyzing these datasets. Parallelism is the key in handling large-scale data, and some progress has been made in parallelizing important steps, like sequence alignment. However, other major steps continue to be sequential, limiting the ability to handle massive datasets. In this paper, we focus on parallelizing algorithms from two areas. The first is efficient data format conversion among a wide variety of sequence data formats, which is important for cross-utilization of different analysis modules. The second is statistical analysis. Our parallelization sequence data format converter allows sequence datasets in BAM/SAM format to be converted into multiple formats, including SAM/BAM, BED, FASTA, FASTQ, BEDGRAPH, JSON, and YAML, using both shared memory and distributed memory parallelism. The converter currently comprises three instances: SAM format converter, BAM format converter and preprocessing-optimized SAM format converter. Additionally, our converter can also support partial format conversion, to perform format conversion only on a specified chromosome region. The statistical analysis module includes parallelized non-local means (NLmeans) algorithm and false discovery rate (FDR) computation. Through extensive evaluation, we demonstrate high scalability of our framework.