An integer programming approach to DNA sequence assembly (original) (raw)
Related papers
Combinatorial algorithms for DNA sequence assembly
Algorithmica, 1995
The trend towards very large DNA sequencing projects, such as those being undertaken as part of the human genome initiative, necessitates the development of efficient and precise algorithms for assembling a long DNA sequence from the fragments obtained by shotgun sequencing or other methods. The sequence reconstruction problem that we take as our formulation of DNA sequence assembly is a variation of the shortest common superstring problem, complicated by the presence of sequencing errors and reverse complements of fragments. Since the simpler superstring problem is NP-hard, any efficient reconstruction procedure must resort to heuristics. In this paper, however, a four phase approach based on rigorous design criteria is presented, and has been found to be very accurate in practice. Our method is robust in the sense that it can accommodate high sequencing error rates and list a series of alternate solutions in the event that several appear equally good. Moreover it uses a limited form of multiple sequence alignment to detect, and often correct, errors in the data. Our combined algorithm has successfully reconstructed non-repetitive sequences of length 50,000 sampled at error rates of as high as 10 percent.
Accurate Reconstruction for DNA Sequencing by Hybridization Based on a Constructive Heuristic
IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2011
Sequencing by hybridization is a promising cost-effective technology for high-throughput DNA sequencing via microarray chips. However, due to the effects of spectrum errors rooted in experimental conditions, an accurate and fast reconstruction of original sequences has become a challenging problem. In the last decade, a variety of analyses and designs have been tried to overcome this problem, where different strategies have different trade-offs in speed and accuracy. Motivated by the idea that the errors could be identified by analyzing the interrelation of spectrum elements, this paper presents a constructive heuristic algorithm, featuring an accurate reconstruction guided by a set of well-defined criteria and rules. Instead of directly reconstructing the original sequence, the new algorithm first builds several accurate short fragments, which are then carefully assembled into a whole sequence. The experiments on benchmark instance sets demonstrate that the proposed method can reconstruct long DNA sequences with higher accuracy than current approaches in the literature.
Sequencing-by-Hybridization at the Information-Theory Bound: An Optimal Algorithm
Journal of Computational Biology, 2000
In a recent paper (Preparata et al., 1999) we introduced a novel probing scheme for DNA sequencing by hybridization (SBH). The new gapped-probe scheme combines natural and universal bases in a well-de ned periodic pattern. It has been shown (Preparata et al., 1999) that the performance of the gapped-probe scheme (in terms of the length of a sequence that can be uniquely reconstructed using a given size library of probes) is signi cantly better than the standard scheme based on oligomer probes. In this paper we present and analyze a new, more powerful, sequencing algorithm for the gapped-probe scheme. We prove that the new algorithm exploits the full potential of the SBH technology with high-con dence performance that comes within a small constant factor (about 2) of the information-theory bound. Moreover, this performance is achieved while maintaining running time linear in the target sequence length.
On the Complexity of Positional Sequencing by Hybridization
Journal of Computational Biology, 2001
In sequencing by hybridization (SBH), one has to reconstruct a sequence from its l-long substrings. SBH was proposed as an alternative to gel-based DNA sequencing approaches, but in its original form the method is not competitive. Positional SBH (PSBH) is a recently proposed enhancement of SBH in which one has additional information about the possible positions of each substring along the target sequence. We give a linear time algorithm for solving PSBH when each substring has at most two possible positions. On the other hand, we prove that the problem is NP-complete if each substring has at most three possible positions. We also show that PSBH is NP-complete if the set of allowed positions for each substring is an interval of length k and provide a fast algorithm for the latter problem when k is bounded.
Algorithms for optimizing production DNA sequencing
2000
We discuss the problem of optimally "finishing" a partially sequenced, reconstructed DNA segment. At first sight, this appears to be computationally hard. We construct a series of increasingly realistic models for the problem and show that all of these can in fact be solved to optimality in polynomial time, with near-optimal solutions available in linear time. Implementation of our algorithms could result in a substantial efficiency gain for automated DNA sequencing.
International Journal of …
Since the advent of rapid DNA sequencing methods in 1976, scientists have had the problem of inferring DNA sequences from sequenced fragments. Shotgun sequencing is a well-established biological and computational method used in practice. Many conventional algorithms for shotgun sequencing are based on the notion of pair wise fragment overlap. While shotgun sequencing infers a DNA sequence given the sequences of overlapping fragments, a recent and complementary method, called sequencing by hybridization (SBH), infers a DNA sequence given the set of oligomers that represents all sub words of some fixed length, k. In this paper, we propose a new computer algorithm for DNA sequence assembly that combines in a novel way the techniques of both shotgun and SBH methods. Based on our preliminary investigations, the algorithm promises-to be very fast and practical for DNA sequence assembly [1].
DNA Sequencing by Hybridization via Genetic Search
Operations Research, 2006
An innovative approach to DNA sequencing by hybridization utilizes isothermic oligonucleotide libraries. In this paper, we demonstrate the utility of a genetic algorithm for the combinatorial portion of this new approach by incorporating characteristics of DNA sequencing by hybridization in addition to isothermic oligonucleotide libraries. Specialized crossover and mutation operators were developed for this purpose. After initial experiments for parameter adjustment, the performance of the genetic algorithm approach was evaluated with respect to previous methods in the literature. The results indicate that the proposed new approach is superior to previous approaches. The proposed new crossover operator that inherits some features of the structured weighted combinations might also be of value for some other combinatorial problems, including the traveling salesman problem.
Dealing with repetitions in sequencing by hybridization
Computational Biology and Chemistry, 2006
DNA sequencing by hybridization (SBH) induces errors in the biochemical experiment. Some of them are random and disappear when the experiment is repeated. Others are systematic, involving repetitions in the probes of the target sequence. A good method for solving SBH problems must deal with both types of errors. In this work we propose a new hybrid genetic algorithm for isothermic and standard sequencing that incorporates the concept of structured combinations. The algorithm is then compared with other methods designed for handling errors that arise in standard and isothermic SBH approaches. DNA sequences used for testing are taken from GenBank. The set of instances for testing was divided into two groups. The first group consisted of sequences containing positive and negative errors in the spectrum, at a rate of up to 20%, excluding errors coming from repetitions. The second group consisted of sequences containing repeated oligonucleotides, and containing additional errors up to 5% added into the spectra. Our new method outperforms the best alternative procedures for both data sets. Moreover, the method produces solutions exhibiting extremely high degree of similarity to the target sequences in the cases without repetitions, which is an important outcome for biologists. The spectra prepared from the sequences taken from GenBank are available on our website http://bio.cs.put.poznan.pl/.
Seeding strategies and recombination operators for solving the DNA fragment assembly problem
Information Processing Letters, 2008
The fragment assembly problem consists in building the DNA sequence from several hundreds (or even, thousands) of fragments obtained by biologists in the laboratory. This is an important task in any genome project since the rest of the phases depend on the accuracy of the results of this stage. Therefore, accurate and efficient methods for handling this problem are needed. Genetic Algorithms (GAs) have been proposed to solve this problem in the past but a detailed analysis of their components is needed if we aim to create a GA capable of working in industrial applications. In this paper, we take a first step in this direction, and focus on two components of the GA: the initialization of the population and the recombination operator. We propose several alternatives for each one and analyze the behavior of the different variants. Results indicate that using a heuristically generated initial population and the Edge Recombination (ER) operator is the best approach for constructing accurate and efficient GAs to solve this problem.
Large Scale Sequencing by Hybridization
Journal of Computational Biology, 2002
Sequencing by Hybridization is a method for reconstructing a DNA sequence based on its k-mer content. This content, called the spectrum of the sequence, can be obtained from hybridization with a universal DNA chip. However, even with a sequencing chip containing all 4 9 9-mers and assuming no hybridization errors, only about 400 bases-long sequences can be reconstructed unambiguously.