GRASP with path-relinking for some molecular biology consensus problems (original) (raw)

On some optimization problems in molecular biology

Mathematical Biosciences, 2007

In the last two decades, the study of gene structure and function and molecular genetics have become some of the most prominent sub-fields of molecular biology. Computational molecular biology has emerged as one of the most exciting interdisciplinary fields, riding on the success of the ongoing Human Genome Project, which culminated in the 2001 announcement of the complete sequencing of the human genome. The field has currently benefited from concepts and theoretical results obtained by different scientific research communities, including genetics, biochemistry, and computer science. It is only in the past few years that it has been shown that a large number of molecular biology problems can be formulated as combinatorial optimization problems, including sequence alignment problems, genome rearrangement problems, string selection and comparison problems, and protein structure prediction and recognition. This paper provides a detailed description of some among the most interesting molecular biology problems that can be formulated as combinatorial optimization problems and proposes a new heuristic to find improved solutions for a particular class of them, known as the far from most string problem.

Algorithms on strings, trees and sequences: computer science and computational biology

1997

Although I didn't know it at the time, I began writing this book in the summer of 1988 when I was part of a computer science research group at the Human Genome Center of Lawrence Berkeley Laboratory. Our group followed the standard assumption that biologically meaningful results could come from considering DNA as a one-dimensional character string, abstracting away the reality of DNA as a flexible three-dimensional molecule, interacting in a dynamic environment with protein and RNA, and repeating a life-cycle in which even the classic linear chromosome exists for only a fraction of the time. A similar, but stronger, assumption existed for protein, holding for example that all the information needed for correct three-dimensional folding is contained in the protein sequence itself, essentiaUy independent of the biological environment the protein lives in. This assumption has recently been modified, but remains largely intact.

A linear programming approach for identifying a consensus sequence on DNA sequences

Bioinformatics, 2005

Motivation: Maximum-likelihood methods for solving the consensus sequence identification (CSI) problem on DNA sequences may only find a local optimum rather than the global optimum. Additionally, such methods do not allow logical constraints to be imposed on their models. This study develops a linear programming technique to solve CSI problems by finding an optimum consensus sequence. This method is computationally more efficient and is guaranteed to reach the global optimum. The developed method can also be extended to treat more complicated CSI problems with ambiguous conserved patterns. Results: A CSI problem is first formulated as a non-linear mixed 0-1 optimization program, which is then converted into a linear mixed 0-1 program. The proposed method provides the following advantages over maximum-likelihood methods: (1) It is guaranteed to find the global optimum. (2) It can embed various logical constraints into the corresponding model. (3) It is applicable to problems with many long sequences. (4) It can find the second and the third best solutions. An extension of the proposed linear mixed 0-1 program is also designed to solve CSI problems with an unknown spacer length between conserved regions. Two examples of searching for CRP-binding sites and for FNR-binding sites in the Escherichia coli genome are used to illustrate and test the proposed method.

Combinatorial algorithms for DNA sequence assembly

Algorithmica, 1995

The trend towards very large DNA sequencing projects, such as those being undertaken as part of the human genome initiative, necessitates the development of efficient and precise algorithms for assembling a long DNA sequence from the fragments obtained by shotgun sequencing or other methods. The sequence reconstruction problem that we take as our formulation of DNA sequence assembly is a variation of the shortest common superstring problem, complicated by the presence of sequencing errors and reverse complements of fragments. Since the simpler superstring problem is NP-hard, any efficient reconstruction procedure must resort to heuristics. In this paper, however, a four phase approach based on rigorous design criteria is presented, and has been found to be very accurate in practice. Our method is robust in the sense that it can accommodate high sequencing error rates and list a series of alternate solutions in the event that several appear equally good. Moreover it uses a limited form of multiple sequence alignment to detect, and often correct, errors in the data. Our combined algorithm has successfully reconstructed non-repetitive sequences of length 50,000 sampled at error rates of as high as 10 percent.

Efficient Algorithms for Handling Molecular Weighted Sequences

IFIP International Federation for Information Processing, 2004

In this paper we introduce the Weighted Suffix Tree, an efficient data structure for computing string regularities in weighted sequences of molecular data. Molecular Weighted Sequences can model important biological processes such as the DNA Assembly Process or the DNA-Protein Binding Process. Thus pattern matching or identification of repeated patterns, in biological weighted sequences is a very important procedure in the translation of gene expression and regulation. We present time and space efficient algorithms for constructing the weighted suffix tree and some applications of the proposed data structure to problems taken from the Molecular Biology area such as pattern matching, repeats discovery, discovery of the longest common subsequence of two weighted sequences and computation of covers.

Graph search and variable neighborhood search for finding constrained longest common subsequences in artificial and real gene sequences

Applied Soft Computing, 2022

We consider the constrained longest common subsequence problem with an arbitrary set of input strings as well as an arbitrary set of pattern strings. This problem has applications, for example, in computational biology where it serves as a measure of similarity for sets of molecules with putative structures in common. We contribute in several ways. First, it is formally proven that finding a feasible solution of arbitrary length is, in general, -complete. Second, we propose several heuristic approaches: a greedy algorithm, a beam search aiming for feasibility, a variable neighborhood search, and a hybrid of the latter two approaches. An exhaustive experimental study shows the effectivity and differences of the proposed approaches in respect to finding a feasible solution, finding high-quality solutions, and runtime for both, artificial and real-world instance sets. The latter ones are generated from a set of 12681 bacteria 16S rRNA gene sequences and consider 15 primer contigs as pattern strings.

An Efficient Combinatorial Approach for Solving the DNA Motif Finding Problem

2009

The detection of an over-represented sub-sequence in a set of (carefully chosen) DNA sequences is often the main clue leading to the investigation of a possible functional role for such a subsequence. Over-represented substrings (with possibly local mutations) in a biological string are termed motifs. A typical functional unit that can be modeled by a motif is a Transcription Factor Binding Site (TFBS), a portion of the DNA sequence apt to the binding of a protein that participates in complex transcriptomic biochemical reactions.

An integer programming approach to DNA sequence assembly

Computational Biology and Chemistry, 2011

De novo sequence assembly is a ubiquitous combinatorial problem in all DNA sequencing technologies. In the presence of errors in the experimental data, the assembly problem is computationally challenging, and its solution may not lead to a unique reconstruct. The enumeration of all alternative solutions is important in drawing a reliable conclusion on the target sequence, and is often overlooked in the heuristic approaches that are currently available. In this paper, we develop an integer programming formulation and global optimization solution strategy to solve the sequence assembly problem with errors in the data. We also propose an efficient technique to identify all alternative reconstructs. When applied to examples of sequencing-by-hybridization, our approach dramatically increases the length of DNA sequences that can be handled with global optimality certificate to over 10,000, which is more than 10 times longer than previously reported. For some problem instances, alternative solutions exhibited a wide range of different ability in reproducing the target DNA sequence. Therefore, it is important to utilize the methodology proposed in this paper in order to obtain all alternative solutions to reliably infer the true reconstruct. These alternative solutions can be used to refine the obtained results and guide the design of further experiments to correctly reconstruct the target DNA sequence.