Rumen Andonov - Academia.edu (original) (raw)
Papers by Rumen Andonov
Algorithms for molecular biology, Feb 6, 2024
Background Scaffolding is an intermediate stage of fragment assembly. It consists in orienting an... more Background Scaffolding is an intermediate stage of fragment assembly. It consists in orienting and ordering the contigs obtained by the assembly of the sequencing reads. In the general case, the problem has been largely studied with the use of distances data between the contigs. Here we focus on a dedicated scaffolding for the chloroplast genomes. As these genomes are small, circular and with few specific repeats, numerous approaches have been proposed to assemble them. However, their specificities have not been sufficiently exploited. Results We give a new formulation for the scaffolding in the case of chloroplast genomes as a discrete optimisation problem, that we prove the decision version to be N P-Complete. We take advantage of the knowledge of chloroplast genomes and succeed in expressing the relationships between a few specific genomic repeats in mathematical constraints. Our approach is independent of the distances and adopts a genomic regions view, with the priority on scaffolding the repeats first. In this way, we encode the structural haplotype issue in order to retrieve several genome forms that coexist in the same chloroplast cell. To solve exactly the optimisation problem, we develop an integer linear program that we implement in Python3 package khloraascaf. We test it on synthetic data to investigate its performance behaviour and its robustness against several chosen difficulties. Conclusions We succeed to model biological knowledge on genomic structures to scaffold chloroplast genomes. Our results suggest that modelling genomic regions is sufficient for scaffolding repeats and is suitable for finding several solutions corresponding to several genome forms.
HAL (Le Centre pour la Communication Scientifique Directe), Jun 18, 2018
Given a directed graph G = (V, E, l) with weights l e ≥ 0 associated with arcs e ∈ E and a set of... more Given a directed graph G = (V, E, l) with weights l e ≥ 0 associated with arcs e ∈ E and a set of vertex pairs with distances between them (called distance constraints), the problem is to find an elementary path in G that satisfies a maximum number of distance constraints. We describe two MIP formulations for this problem and discuss their advantages.
HAL (Le Centre pour la Communication Scientifique Directe), Jul 5, 2022
Scaffolding step in the genome assembly aims to determine the order and the orientation of a huge... more Scaffolding step in the genome assembly aims to determine the order and the orientation of a huge number of previously assembled genomic fractions (contigs/scaffolds). Here we introduce a particular case of this problem and denote it by Nested Inverted Fragments Scaffolding (NIFS). We formulate it as an optimisation problem in a particular kind of directed graph that we call Multiplied Doubled Contigs Graph (MDCG). Furthermore, we prove that the NIFS problem is NP-Hard. We also discuss how the chloroplast data have been generated by filtering the reads sequenced both from plants and chloroplasts. Moreover, we propose a graph structure to visualise the solution and to highlight the particularity of chloroplast's regions structure.
HAL (Le Centre pour la Communication Scientifique Directe), Jul 6, 2015
Almost 25% of proteins contains internal repeats, these repeats may have a major role in the prot... more Almost 25% of proteins contains internal repeats, these repeats may have a major role in the protein function. Furthermore some proteins actually are the same substructure repeated many times, these proteins are solenoids. But only few repeat detection programs exist, we present here Kunoichi, a simple and efficient tool for discovering protein repeats. Kunoichi is based on protein fragment comparison and clique detection. As first results, we show that Kunoichi can find different levels of repetitions and successfully identify protein tiles. Kunoichi is available on request from the authors.
HAL (Le Centre pour la Communication Scientifique Directe), 2008
International audienc
In silico studying a genome requires two steps: sequencing it with cloning and cutting the genome... more In silico studying a genome requires two steps: sequencing it with cloning and cutting the genome in several reads, and then, assembling the reads. It is well known that the number of sequencing errors is proportional to the reads' size. However, the use of long reads can be an advantage against genome repeated regions issues. De novo is an assembly method which does not use a reference. The purpose of the described here tool, named LOREAS, is to be a de novo assembler in two tasks: first, ordering the long reads, and then, obtaining a consensus sequence of the ordered reads. Currently, only the first task was realised. While other de novo long reads assemblers use heuristics and De Bruijn graphs, LOREAS is based on overlaps similarity between all the long reads. It uses integer linear programing, to find the heaviest path in a graph G=(V,E,λ)G= (V,E,λ)G=(V,E,λ), where V is the vertices set corresponding to the long reads set, E the set of edges associated with the overlaps between long reads – weighted by λ: the overlap length. When this graph is too huge, the set of reads V is partitioned in several parts. Then, all the parts are solved sequentially. Here we present the solution concerning the first task related to ten bacteria genomes. Seven of them have been succefully solved for less than 12 minutes on a laptop.
The fold recognition methods are promissing tools for capturing the structure of a protein by its... more The fold recognition methods are promissing tools for capturing the structure of a protein by its amino acid residues sequence but their use is still restricted by the needs of huge computational resources and suitable efficient algorithms as well. In the recent version of FROST (Fold Recognition Oriented Search Tool) package the most efficient algorithm for solving the Protein Threading Problem (PTP) is implemented due to the strong collaboration between the SYMBIOSE group in IRISA and MIG in Jouy-en-Josas. In this paper, we present the diverse components of FROST, emphasizing on the recent advances in formulating and solving new versions of the PTP and on the way of solving on a computer cluster a million of instances in a reasonable time.
HAL (Le Centre pour la Communication Scientifique Directe), 2012
CSA is a web server for the comprehensive comparison of pairwise protein structure alignments. It... more CSA is a web server for the comprehensive comparison of pairwise protein structure alignments. Its exact alignment engine computes either optimal, top-scoring alignments or heuristic alignments with quality guarantee for the inter-residue distance based scorings of contact map overlap, PAUL, DALI and MATRAS. These and additional, uploaded alignments are compared using a number of quality measures and intuitive visualizations. CSA brings new insight into the structural relationship of the protein pairs under investigation and is a valuable tool for studying structural similarities. It is available at http://csa.project.cwi.nl.
arXiv (Cornell University), Dec 13, 2017
We propose an optimization approach for determining both hardware and software parameters for the... more We propose an optimization approach for determining both hardware and software parameters for the efficient implementation of a (family of) applications called dense stencil computations on programmable GPGPUs. We first introduce a simple, analytical model for the silicon area usage of accelerator architectures and a workload characterization of stencil computations. We combine this characterization with a parametric execution time model and formulate a mathematical optimization problem. That problem seeks to maximize a common objective function of all the hardware and software parameters. The solution to this problem therefore "solves" the codesign problem: simultaneously choosing software-hardware parameters to optimize total performance. We validate this approach by proposing architectural variants of the NVIDIA Maxwell GTX-980 (respectively, Titan X) specifically tuned to a predetermined workload of four common 2D stencils (Heat, Jacobi, Laplacian, and Gradient) and two 3D ones (Heat and Laplacian). Our model predicts that performance would potentially improve by 28% (respectively, 33%) with simple tweaks to the hardware parameters such as adapting coarse and fine-grained parallelism by changing the number of streaming multiprocessors and the number of compute cores each contains. We propose a set of Pareto-optimal design points to exploit the trade-off between performance and silicon area and show that by additionally eliminating GPU caches, we can get a further 2-fold improvement.
HAL (Le Centre pour la Communication Scientifique Directe), Sep 5, 2016
We develop a method for solving genome scaffolding as a problem of finding a long simple path in ... more We develop a method for solving genome scaffolding as a problem of finding a long simple path in a graph defined by the contigs that satisfies additional constraints encoding the insert-size information. Then we solve the resulting mixed integer linear program to optimality using the Gurobi solver. We test our algorithm on several chloroplast genomes and show that it is fast and outperforms other widely-used assembly algorithms by the accuracy of the results.
HAL (Le Centre pour la Communication Scientifique Directe), Feb 23, 2022
HAL (Le Centre pour la Communication Scientifique Directe), Oct 14, 2022
Assembling DNA fragments based on their overlaps remains the main assembly paradigm with long DNA... more Assembling DNA fragments based on their overlaps remains the main assembly paradigm with long DNA fragments sequencing technologies, independently of the aim to resolve only one or several haplotypes. Since an overlap can be seen as a succession relationship between two oriented fragments, the directed graph structure has emerged as an appropriate data structure for handling overlaps. However, this graph paradigm does not appear to take benefit of the reverse symmetry of the orientated fragments and their overlaps, which is a result of blind DNA double-strand sequencing. Thus, the bi-directed graph paradigm was introduced in 1995 towards reducing the graph size by handling the reverse symmetry, and becomes since then the main graph paradigm used in assembly/scaffolding methods. Nevertheless, the available graph paradigms have never been contrasted before, and no implementations have been described. Here we make a complete review on the existing overlap graph paradigms. Furthermore, we present suitable data structures that are theoretically compared in terms of time and memory consumption in the context of the design of some basic graph algorithms. We also show that each one of the paradigms can be switched to another by slightly modifying their data structures.
HAL (Le Centre pour la Communication Scientifique Directe), Jul 12, 2021
HAL is a multidisciplinary open access archive for the deposit and dissemination of scientific re... more HAL is a multidisciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L'archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d'enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
Un nouvel algorithme pour la récherche du plus grand sous-graphe commun ordonné Résumé : Dans cet... more Un nouvel algorithme pour la récherche du plus grand sous-graphe commun ordonné Résumé : Dans cet article, nous étudions le problème suivant : étant donné deux matrices d'adjacences de deux graphes simples, trouver deux matrices principales (en faite, deux vecteurs) ayant le plus grand produit scalaire. Quand il est utilisé pour calculer la similarité de deux structures de protéines, ce problème est appelé « Contact Map Overlap » (CMO), et par la suite, nous montrons un algorithme de branch and bound exacte, dont les bornes sont calculées en résolvant la relaxation lagrangienne de ce problème. L'efficacité de cette approche est démontrée sur un jeu de test d'instances réelles, en comparaison avec le meilleur algorithme existant.
Algorithms for molecular biology, Feb 6, 2024
Background Scaffolding is an intermediate stage of fragment assembly. It consists in orienting an... more Background Scaffolding is an intermediate stage of fragment assembly. It consists in orienting and ordering the contigs obtained by the assembly of the sequencing reads. In the general case, the problem has been largely studied with the use of distances data between the contigs. Here we focus on a dedicated scaffolding for the chloroplast genomes. As these genomes are small, circular and with few specific repeats, numerous approaches have been proposed to assemble them. However, their specificities have not been sufficiently exploited. Results We give a new formulation for the scaffolding in the case of chloroplast genomes as a discrete optimisation problem, that we prove the decision version to be N P-Complete. We take advantage of the knowledge of chloroplast genomes and succeed in expressing the relationships between a few specific genomic repeats in mathematical constraints. Our approach is independent of the distances and adopts a genomic regions view, with the priority on scaffolding the repeats first. In this way, we encode the structural haplotype issue in order to retrieve several genome forms that coexist in the same chloroplast cell. To solve exactly the optimisation problem, we develop an integer linear program that we implement in Python3 package khloraascaf. We test it on synthetic data to investigate its performance behaviour and its robustness against several chosen difficulties. Conclusions We succeed to model biological knowledge on genomic structures to scaffold chloroplast genomes. Our results suggest that modelling genomic regions is sufficient for scaffolding repeats and is suitable for finding several solutions corresponding to several genome forms.
HAL (Le Centre pour la Communication Scientifique Directe), Jun 18, 2018
Given a directed graph G = (V, E, l) with weights l e ≥ 0 associated with arcs e ∈ E and a set of... more Given a directed graph G = (V, E, l) with weights l e ≥ 0 associated with arcs e ∈ E and a set of vertex pairs with distances between them (called distance constraints), the problem is to find an elementary path in G that satisfies a maximum number of distance constraints. We describe two MIP formulations for this problem and discuss their advantages.
HAL (Le Centre pour la Communication Scientifique Directe), Jul 5, 2022
Scaffolding step in the genome assembly aims to determine the order and the orientation of a huge... more Scaffolding step in the genome assembly aims to determine the order and the orientation of a huge number of previously assembled genomic fractions (contigs/scaffolds). Here we introduce a particular case of this problem and denote it by Nested Inverted Fragments Scaffolding (NIFS). We formulate it as an optimisation problem in a particular kind of directed graph that we call Multiplied Doubled Contigs Graph (MDCG). Furthermore, we prove that the NIFS problem is NP-Hard. We also discuss how the chloroplast data have been generated by filtering the reads sequenced both from plants and chloroplasts. Moreover, we propose a graph structure to visualise the solution and to highlight the particularity of chloroplast's regions structure.
HAL (Le Centre pour la Communication Scientifique Directe), Jul 6, 2015
Almost 25% of proteins contains internal repeats, these repeats may have a major role in the prot... more Almost 25% of proteins contains internal repeats, these repeats may have a major role in the protein function. Furthermore some proteins actually are the same substructure repeated many times, these proteins are solenoids. But only few repeat detection programs exist, we present here Kunoichi, a simple and efficient tool for discovering protein repeats. Kunoichi is based on protein fragment comparison and clique detection. As first results, we show that Kunoichi can find different levels of repetitions and successfully identify protein tiles. Kunoichi is available on request from the authors.
HAL (Le Centre pour la Communication Scientifique Directe), 2008
International audienc
In silico studying a genome requires two steps: sequencing it with cloning and cutting the genome... more In silico studying a genome requires two steps: sequencing it with cloning and cutting the genome in several reads, and then, assembling the reads. It is well known that the number of sequencing errors is proportional to the reads' size. However, the use of long reads can be an advantage against genome repeated regions issues. De novo is an assembly method which does not use a reference. The purpose of the described here tool, named LOREAS, is to be a de novo assembler in two tasks: first, ordering the long reads, and then, obtaining a consensus sequence of the ordered reads. Currently, only the first task was realised. While other de novo long reads assemblers use heuristics and De Bruijn graphs, LOREAS is based on overlaps similarity between all the long reads. It uses integer linear programing, to find the heaviest path in a graph G=(V,E,λ)G= (V,E,λ)G=(V,E,λ), where V is the vertices set corresponding to the long reads set, E the set of edges associated with the overlaps between long reads – weighted by λ: the overlap length. When this graph is too huge, the set of reads V is partitioned in several parts. Then, all the parts are solved sequentially. Here we present the solution concerning the first task related to ten bacteria genomes. Seven of them have been succefully solved for less than 12 minutes on a laptop.
The fold recognition methods are promissing tools for capturing the structure of a protein by its... more The fold recognition methods are promissing tools for capturing the structure of a protein by its amino acid residues sequence but their use is still restricted by the needs of huge computational resources and suitable efficient algorithms as well. In the recent version of FROST (Fold Recognition Oriented Search Tool) package the most efficient algorithm for solving the Protein Threading Problem (PTP) is implemented due to the strong collaboration between the SYMBIOSE group in IRISA and MIG in Jouy-en-Josas. In this paper, we present the diverse components of FROST, emphasizing on the recent advances in formulating and solving new versions of the PTP and on the way of solving on a computer cluster a million of instances in a reasonable time.
HAL (Le Centre pour la Communication Scientifique Directe), 2012
CSA is a web server for the comprehensive comparison of pairwise protein structure alignments. It... more CSA is a web server for the comprehensive comparison of pairwise protein structure alignments. Its exact alignment engine computes either optimal, top-scoring alignments or heuristic alignments with quality guarantee for the inter-residue distance based scorings of contact map overlap, PAUL, DALI and MATRAS. These and additional, uploaded alignments are compared using a number of quality measures and intuitive visualizations. CSA brings new insight into the structural relationship of the protein pairs under investigation and is a valuable tool for studying structural similarities. It is available at http://csa.project.cwi.nl.
arXiv (Cornell University), Dec 13, 2017
We propose an optimization approach for determining both hardware and software parameters for the... more We propose an optimization approach for determining both hardware and software parameters for the efficient implementation of a (family of) applications called dense stencil computations on programmable GPGPUs. We first introduce a simple, analytical model for the silicon area usage of accelerator architectures and a workload characterization of stencil computations. We combine this characterization with a parametric execution time model and formulate a mathematical optimization problem. That problem seeks to maximize a common objective function of all the hardware and software parameters. The solution to this problem therefore "solves" the codesign problem: simultaneously choosing software-hardware parameters to optimize total performance. We validate this approach by proposing architectural variants of the NVIDIA Maxwell GTX-980 (respectively, Titan X) specifically tuned to a predetermined workload of four common 2D stencils (Heat, Jacobi, Laplacian, and Gradient) and two 3D ones (Heat and Laplacian). Our model predicts that performance would potentially improve by 28% (respectively, 33%) with simple tweaks to the hardware parameters such as adapting coarse and fine-grained parallelism by changing the number of streaming multiprocessors and the number of compute cores each contains. We propose a set of Pareto-optimal design points to exploit the trade-off between performance and silicon area and show that by additionally eliminating GPU caches, we can get a further 2-fold improvement.
HAL (Le Centre pour la Communication Scientifique Directe), Sep 5, 2016
We develop a method for solving genome scaffolding as a problem of finding a long simple path in ... more We develop a method for solving genome scaffolding as a problem of finding a long simple path in a graph defined by the contigs that satisfies additional constraints encoding the insert-size information. Then we solve the resulting mixed integer linear program to optimality using the Gurobi solver. We test our algorithm on several chloroplast genomes and show that it is fast and outperforms other widely-used assembly algorithms by the accuracy of the results.
HAL (Le Centre pour la Communication Scientifique Directe), Feb 23, 2022
HAL (Le Centre pour la Communication Scientifique Directe), Oct 14, 2022
Assembling DNA fragments based on their overlaps remains the main assembly paradigm with long DNA... more Assembling DNA fragments based on their overlaps remains the main assembly paradigm with long DNA fragments sequencing technologies, independently of the aim to resolve only one or several haplotypes. Since an overlap can be seen as a succession relationship between two oriented fragments, the directed graph structure has emerged as an appropriate data structure for handling overlaps. However, this graph paradigm does not appear to take benefit of the reverse symmetry of the orientated fragments and their overlaps, which is a result of blind DNA double-strand sequencing. Thus, the bi-directed graph paradigm was introduced in 1995 towards reducing the graph size by handling the reverse symmetry, and becomes since then the main graph paradigm used in assembly/scaffolding methods. Nevertheless, the available graph paradigms have never been contrasted before, and no implementations have been described. Here we make a complete review on the existing overlap graph paradigms. Furthermore, we present suitable data structures that are theoretically compared in terms of time and memory consumption in the context of the design of some basic graph algorithms. We also show that each one of the paradigms can be switched to another by slightly modifying their data structures.
HAL (Le Centre pour la Communication Scientifique Directe), Jul 12, 2021
HAL is a multidisciplinary open access archive for the deposit and dissemination of scientific re... more HAL is a multidisciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L'archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d'enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
Un nouvel algorithme pour la récherche du plus grand sous-graphe commun ordonné Résumé : Dans cet... more Un nouvel algorithme pour la récherche du plus grand sous-graphe commun ordonné Résumé : Dans cet article, nous étudions le problème suivant : étant donné deux matrices d'adjacences de deux graphes simples, trouver deux matrices principales (en faite, deux vecteurs) ayant le plus grand produit scalaire. Quand il est utilisé pour calculer la similarité de deux structures de protéines, ce problème est appelé « Contact Map Overlap » (CMO), et par la suite, nous montrons un algorithme de branch and bound exacte, dont les bornes sont calculées en résolvant la relaxation lagrangienne de ce problème. L'efficacité de cette approche est démontrée sur un jeu de test d'instances réelles, en comparaison avec le meilleur algorithme existant.