Filippo Utro - Academia.edu (original) (raw)
Papers by Filippo Utro
International Journal of Molecular Sciences
Papillomaviruses (PVs) are a heterogeneous group of DNA viruses that can infect fish, birds, rept... more Papillomaviruses (PVs) are a heterogeneous group of DNA viruses that can infect fish, birds, reptiles, and mammals. PVs infecting humans (HPVs) phylogenetically cluster into five genera (Alpha-, Beta-, Gamma-, Mu- and Nu-PV), with differences in tissue tropism and carcinogenicity. The evolutionary features associated with the divergence of Papillomaviridae are not well understood. Using a combination of k-mer distributions, genetic metrics, and phylogenetic algorithms, we sought to evaluate the characteristics and differences of Alpha-, Beta- and Gamma-PVs constituting the majority of HPV genomes. A total of 640 PVs including 442 HPV types, 27 non-human primate PV types, and 171 non-primate animal PV types were evaluated. Our analyses revealed the highest genetic diversity amongst Gamma-PVs compared to the Alpha and Beta PVs, suggesting reduced selective pressures on Gamma-PVs. Using a sequence alignment-free trimer (k = 3) phylogeny algorithm, we reconstructed a phylogeny that grou...
Bioinformatics, 2015
Simulating complex evolution scenarios of multiple populations is an important task for answering... more Simulating complex evolution scenarios of multiple populations is an important task for answering many basic questions relating to population genomics. Apart from the population samples, the underlying Ancestral Recombinations Graph (ARG) is an additional important means in hypothesis checking and reconstruction studies. Furthermore, complex simulations require a plethora of interdependent parameters making even the scenario-specification highly non-trivial. We present an algorithm SimRA that simulates generic multiple population evolution model with admixture. It is based on random graphs that improve dramatically in time and space requirements of the classical algorithm of single populations.Using the underlying random graphs model, we also derive closed forms of expected values of the ARG characteristics i.e., height of the graph, number of recombinations, number of mutations and population diversity in terms of its defining parameters. This is crucial in aiding the user to specify meaningful parameters for the complex scenario simulations, not through trial-and-error based on raw compute power but intelligent parameter estimation. To the best of our knowledge this is the first time closed form expressions have been computed for the ARG properties. We show that the expected values closely match the empirical values through simulations.Finally, we demonstrate that SimRA produces the ARG in compact forms without compromising any accuracy. We demonstrate the compactness and accuracy through extensive experiments. SimRA (Simulation based on Random graph Algorithms) source, executable, user manual and sample input-output sets are available for downloading at: https://github.com/ComputationalGenomics/SimRA CONTACT: : parida@us.ibm.comSupplementary information: Supplementary data are available at Bioinformatics online.
Bioinformatics (Oxford, England), Jan 16, 2015
Thanks to research spanning nearly 30 years, two major models have emerged that account for nucle... more Thanks to research spanning nearly 30 years, two major models have emerged that account for nucleosome organization in chromatin: statistical and sequence specific. The first is based on elegant, easy to compute, closed-form mathematical formulas that make no assumptions of the physical and chemical properties of the underlying DNA sequence. Moreover, they need no training on the data for their computation. The latter is based on some sequence regularities but, as opposed to the statistical model, it lacks the same type of closed-form formulas that, in this case, should be based on the DNA sequence only. We contribute to close this important methodological gap between the two models by providing three very simple formulas for the sequence specific one. They are all based on well known formulas in Computer Science and Bioinformatics, and they give different quantifications of how complex a sequence is. In view of how remarkably well they perform, it is very surprising that measures o...
Quantitative trait loci (QTL) mapping associates a phenotypic trait with a region of the genome. ... more Quantitative trait loci (QTL) mapping associates a phenotypic trait with a region of the genome. In standard QTL mapping, each marker locus is considered individually. Another approach is to associate the trait with haplotype blocks rather than individual markers. To this end, we have developed iXora (Identifying crossovers and recombining alleles), a robust and efficient method for extracting reliable haplotypes of a mapping population, including the parental haplotypes. Each allele in the progeny is assigned to a haplotype inherited from the parent, and this information is used to associate the parental haplotypes with traits in the progeny. We apply the iXora phasing and trait association framework to the avocado Florida mapping populations from the crosses Simmonds x Tonnage and Tonnage x Simmonds. The increased resolution from phasing is employed to locate the trait locus controlling flower type. We show that flower type is determined very accurately by the haplotype in a speci...
Inferring cluster structure in microarray datasets is a fundamental task for the -omic sciences. ... more Inferring cluster structure in microarray datasets is a fundamental task for the -omic sciences. A fundamental question in Statistics, Data Analysis and Classification, is the prediction of the number of clusters in a dataset, usually established via internal validation measures. Despite the wealth of internal measures available in the literature, new ones have been recently proposed, some of them specifically for microarray data. In this dissertation, a study of internal validation measures is given, paying particular attention to the stability based ones. Indeed, this class of measures is particularly prominent and promising in order to have a reliable estimate the number of clusters in a dataset. For those measures, a new general algorithmic paradigm is proposed here that highlights the richness of measures in this class and accounts for the ones already available in the literature. Moreover, some of the most representative validation measures are also considered. Experiments on ...
Lecture Notes in Computer Science, 2012
ABSTRACT Clustering is one of the most well known activities in scientific investigation and the ... more ABSTRACT Clustering is one of the most well known activities in scientific investigation and the object of research in many disciplines, ranging from Statistics to Computer Science. In this beautiful area, one of the most difficult challenges is the model selection problem, i.e., the identification of the correct number of clusters in a dataset. In the last decade, a few novel techniques for model selection, representing a sharp departure from previous ones in statistics, have been proposed and gained prominence for microarray data analysis. Among those, the stability-based methods are the most robust and best performing in terms of prediction, but the slowest in terms of time. Unfortunately, this fascinating and classic area of statistics as model selection, with important practical applications, has received very little attention in terms of algorithmic design and engineering. In this paper, in order to partially fill this gap, we highlight: (A) the first general algorithmic paradigm for stability-based methods for model selection; (B) a novel algorithmic paradigm for the class of stability-based methods for cluster validity, i.e., methods assessing how statistically significant is a given clustering solution; (C) a general algorithmic paradigm that describes heuristic and very effective speed-ups known in the Literature for stability-based model selection methods.
BMC Bioinformatics, 2008
BACKGROUND: Inferring cluster structure in microarray datasets is a fundamental task for the so-c... more BACKGROUND: Inferring cluster structure in microarray datasets is a fundamental task for the so-called -omic sciences. It is also a fundamental question in Statistics, Data Analysis and Classification, in particular with regard to the prediction of the number of clusters in a dataset, usually established via internal validation measures. Despite the wealth of internal measures available in the literature, new
Bioinformatics, 2015
Information-theoretic and compositional analysis of biological sequences, in terms of k-mer dicti... more Information-theoretic and compositional analysis of biological sequences, in terms of k-mer dictionaries, has a well established role in genomic and proteomic studies. Much less so in epigenomics, although the role of k-mers in chromatin organization and nucleosome positioning is particularly relevant. Fundamental questions concerning the informational content and compositional structure of nucleosome favouring and disfavoring sequences with respect to their basic building blocks still remain open. We present the first analysis on the role of k-mers in the composition of nucleosome enriched and depleted genomic regions (NER and NDR for short) that is: (i) exhaustive and within the bounds dictated by the information-theoretic content of the sample sets we use and (ii) informative for comparative epigenomics. We analize four different organisms and we propose a paradigmatic formalization of k-mer dictionaries, providing two different and complementary views of the k-mers involved in NER and NDR. The first extends well known studies in this area, its comparative nature being its major merit. The second, very novel, brings to light the rich variety of k-mers involved in influencing nucleosome positioning, for which an initial classification in terms of clusters is also provided. Although such a classification offers many insights, the following deserves to be singled-out: short poly(dA:dT) tracts are reported in the literature as fundamental for nucleosome depletion, however a global quantitative look reveals that their role is much less prominent than one would expect based on previous studies. Dictionaries, clusters and Supplementary Material are available online at http://math.unipa.it/rombo/epigenomics/. simona.rombo@unipa.itSupplementary information: Supplementary data are available at Bioinformatics online.
Lecture Notes in Computer Science, 2015
Open Source Software in Life Science Research, 2012
Chapman & Hall/CRC Mathematical & Computational Biology, 2009
BMC bioinformatics, 2012
Ancestral recombinations graph (ARG) is a topological structure that captures the relationship be... more Ancestral recombinations graph (ARG) is a topological structure that captures the relationship between the extant genomic sequences in terms of genetic events including recombinations. IRiS is a system that estimates the ARG on sequences of individuals, at genomic scales, capturing the relationship between these individuals of the species. Recently, this system was used to estimate the ARG of the recombining X Chromosome of a collection of human populations using relatively dense, bi-allelic SNP data. While the ARG is a natural model for capturing the inter-relationship between a single chromosome of the individuals of a species, it is not immediately apparent how the model can utilize whole-genome (across chromosomes) diploid data. Also, the sheer complexity of an ARG structure presents a challenge to graph visualization techniques. In this paper we examine the ARG reconstruction for (1) genome-wide or multiple chromosomes, (2) multi-allelic and (3) extremely sparse data. To aid in...
BMC genomics, 2014
Reed canary grass (Phalaris arundinacea) is an economically important forage and bioenergy grass ... more Reed canary grass (Phalaris arundinacea) is an economically important forage and bioenergy grass of the temperate regions of the world. Despite its economic importance, it is lacking in public genomic data. We explore comparative exomics of the grass cultivars in the context of response to salt exposure. The limited data set poses challenges to the computational pipeline. As a prerequisite for the comparative study, we generate the Phalaris reference transcriptome sequence, one of the first steps in addressing the issue of paucity of processed genomic data in this species. In addition, the differential expression (DE) and active-but-stable genes for salt stress conditions were analyzed by a novel method that was experimentally verified on human RNA-seq data. For the comparative exomics, we focus on the DE and stable genic regions, with respect to salt stress, of the genome. In our comparative study, we find that phylogeny of the DE and stable genic regions of the Phalaris cultivars ...
Lecture Notes in Computer Science, 2011
Computer Methods and Programs in Biomedicine, 2015
The prediction of the number of clusters in a dataset, in particular microarrays, is a fundamenta... more The prediction of the number of clusters in a dataset, in particular microarrays, is a fundamental task in biological data analysis, usually performed via validation measures. Unfortunately, it has received very little attention and in fact there is a growing need for software tools/libraries dedicated to it. Here we present ValWorkBench, a software library consisting of eleven well known validation measures, together with novel heuristic approximations for some of them. The main objective of this paper is to provide the interested researcher with the full software documentation of an open source cluster validation platform having the main features of being easily extendible in a homogeneous way and of offering software components that can be readily re-used. Consequently, the focus of the presentation is on the architecture of the library, since it provides an essential map that can be used to access the full software documentation, which is available at the supplementary material website [1]. The mentioned main features of ValWorkBench are also discussed and exemplified, with emphasis on software abstraction design and re-usability. A comparison with existing cluster validation software libraries, mainly in terms of the mentioned features, is also offered. It suggests that ValWorkBench is a much needed contribution to the microarray software development/algorithm engineering community. For completeness, it is important to mention that previous accurate algorithmic experimental analysis of the relative merits of each of the implemented measures [19,23,25], carried out specifically on microarray data, gives useful insights on the effectiveness of ValWorkBench for cluster validation to researchers in the microarray community interested in its use for the mentioned task.
Chapman & Hall/CRC Data Mining and Knowledge Discovery Series, 2009
Biological Knowledge Discovery Handbook, 2013
International Journal of Molecular Sciences
Papillomaviruses (PVs) are a heterogeneous group of DNA viruses that can infect fish, birds, rept... more Papillomaviruses (PVs) are a heterogeneous group of DNA viruses that can infect fish, birds, reptiles, and mammals. PVs infecting humans (HPVs) phylogenetically cluster into five genera (Alpha-, Beta-, Gamma-, Mu- and Nu-PV), with differences in tissue tropism and carcinogenicity. The evolutionary features associated with the divergence of Papillomaviridae are not well understood. Using a combination of k-mer distributions, genetic metrics, and phylogenetic algorithms, we sought to evaluate the characteristics and differences of Alpha-, Beta- and Gamma-PVs constituting the majority of HPV genomes. A total of 640 PVs including 442 HPV types, 27 non-human primate PV types, and 171 non-primate animal PV types were evaluated. Our analyses revealed the highest genetic diversity amongst Gamma-PVs compared to the Alpha and Beta PVs, suggesting reduced selective pressures on Gamma-PVs. Using a sequence alignment-free trimer (k = 3) phylogeny algorithm, we reconstructed a phylogeny that grou...
Bioinformatics, 2015
Simulating complex evolution scenarios of multiple populations is an important task for answering... more Simulating complex evolution scenarios of multiple populations is an important task for answering many basic questions relating to population genomics. Apart from the population samples, the underlying Ancestral Recombinations Graph (ARG) is an additional important means in hypothesis checking and reconstruction studies. Furthermore, complex simulations require a plethora of interdependent parameters making even the scenario-specification highly non-trivial. We present an algorithm SimRA that simulates generic multiple population evolution model with admixture. It is based on random graphs that improve dramatically in time and space requirements of the classical algorithm of single populations.Using the underlying random graphs model, we also derive closed forms of expected values of the ARG characteristics i.e., height of the graph, number of recombinations, number of mutations and population diversity in terms of its defining parameters. This is crucial in aiding the user to specify meaningful parameters for the complex scenario simulations, not through trial-and-error based on raw compute power but intelligent parameter estimation. To the best of our knowledge this is the first time closed form expressions have been computed for the ARG properties. We show that the expected values closely match the empirical values through simulations.Finally, we demonstrate that SimRA produces the ARG in compact forms without compromising any accuracy. We demonstrate the compactness and accuracy through extensive experiments. SimRA (Simulation based on Random graph Algorithms) source, executable, user manual and sample input-output sets are available for downloading at: https://github.com/ComputationalGenomics/SimRA CONTACT: : parida@us.ibm.comSupplementary information: Supplementary data are available at Bioinformatics online.
Bioinformatics (Oxford, England), Jan 16, 2015
Thanks to research spanning nearly 30 years, two major models have emerged that account for nucle... more Thanks to research spanning nearly 30 years, two major models have emerged that account for nucleosome organization in chromatin: statistical and sequence specific. The first is based on elegant, easy to compute, closed-form mathematical formulas that make no assumptions of the physical and chemical properties of the underlying DNA sequence. Moreover, they need no training on the data for their computation. The latter is based on some sequence regularities but, as opposed to the statistical model, it lacks the same type of closed-form formulas that, in this case, should be based on the DNA sequence only. We contribute to close this important methodological gap between the two models by providing three very simple formulas for the sequence specific one. They are all based on well known formulas in Computer Science and Bioinformatics, and they give different quantifications of how complex a sequence is. In view of how remarkably well they perform, it is very surprising that measures o...
Quantitative trait loci (QTL) mapping associates a phenotypic trait with a region of the genome. ... more Quantitative trait loci (QTL) mapping associates a phenotypic trait with a region of the genome. In standard QTL mapping, each marker locus is considered individually. Another approach is to associate the trait with haplotype blocks rather than individual markers. To this end, we have developed iXora (Identifying crossovers and recombining alleles), a robust and efficient method for extracting reliable haplotypes of a mapping population, including the parental haplotypes. Each allele in the progeny is assigned to a haplotype inherited from the parent, and this information is used to associate the parental haplotypes with traits in the progeny. We apply the iXora phasing and trait association framework to the avocado Florida mapping populations from the crosses Simmonds x Tonnage and Tonnage x Simmonds. The increased resolution from phasing is employed to locate the trait locus controlling flower type. We show that flower type is determined very accurately by the haplotype in a speci...
Inferring cluster structure in microarray datasets is a fundamental task for the -omic sciences. ... more Inferring cluster structure in microarray datasets is a fundamental task for the -omic sciences. A fundamental question in Statistics, Data Analysis and Classification, is the prediction of the number of clusters in a dataset, usually established via internal validation measures. Despite the wealth of internal measures available in the literature, new ones have been recently proposed, some of them specifically for microarray data. In this dissertation, a study of internal validation measures is given, paying particular attention to the stability based ones. Indeed, this class of measures is particularly prominent and promising in order to have a reliable estimate the number of clusters in a dataset. For those measures, a new general algorithmic paradigm is proposed here that highlights the richness of measures in this class and accounts for the ones already available in the literature. Moreover, some of the most representative validation measures are also considered. Experiments on ...
Lecture Notes in Computer Science, 2012
ABSTRACT Clustering is one of the most well known activities in scientific investigation and the ... more ABSTRACT Clustering is one of the most well known activities in scientific investigation and the object of research in many disciplines, ranging from Statistics to Computer Science. In this beautiful area, one of the most difficult challenges is the model selection problem, i.e., the identification of the correct number of clusters in a dataset. In the last decade, a few novel techniques for model selection, representing a sharp departure from previous ones in statistics, have been proposed and gained prominence for microarray data analysis. Among those, the stability-based methods are the most robust and best performing in terms of prediction, but the slowest in terms of time. Unfortunately, this fascinating and classic area of statistics as model selection, with important practical applications, has received very little attention in terms of algorithmic design and engineering. In this paper, in order to partially fill this gap, we highlight: (A) the first general algorithmic paradigm for stability-based methods for model selection; (B) a novel algorithmic paradigm for the class of stability-based methods for cluster validity, i.e., methods assessing how statistically significant is a given clustering solution; (C) a general algorithmic paradigm that describes heuristic and very effective speed-ups known in the Literature for stability-based model selection methods.
BMC Bioinformatics, 2008
BACKGROUND: Inferring cluster structure in microarray datasets is a fundamental task for the so-c... more BACKGROUND: Inferring cluster structure in microarray datasets is a fundamental task for the so-called -omic sciences. It is also a fundamental question in Statistics, Data Analysis and Classification, in particular with regard to the prediction of the number of clusters in a dataset, usually established via internal validation measures. Despite the wealth of internal measures available in the literature, new
Bioinformatics, 2015
Information-theoretic and compositional analysis of biological sequences, in terms of k-mer dicti... more Information-theoretic and compositional analysis of biological sequences, in terms of k-mer dictionaries, has a well established role in genomic and proteomic studies. Much less so in epigenomics, although the role of k-mers in chromatin organization and nucleosome positioning is particularly relevant. Fundamental questions concerning the informational content and compositional structure of nucleosome favouring and disfavoring sequences with respect to their basic building blocks still remain open. We present the first analysis on the role of k-mers in the composition of nucleosome enriched and depleted genomic regions (NER and NDR for short) that is: (i) exhaustive and within the bounds dictated by the information-theoretic content of the sample sets we use and (ii) informative for comparative epigenomics. We analize four different organisms and we propose a paradigmatic formalization of k-mer dictionaries, providing two different and complementary views of the k-mers involved in NER and NDR. The first extends well known studies in this area, its comparative nature being its major merit. The second, very novel, brings to light the rich variety of k-mers involved in influencing nucleosome positioning, for which an initial classification in terms of clusters is also provided. Although such a classification offers many insights, the following deserves to be singled-out: short poly(dA:dT) tracts are reported in the literature as fundamental for nucleosome depletion, however a global quantitative look reveals that their role is much less prominent than one would expect based on previous studies. Dictionaries, clusters and Supplementary Material are available online at http://math.unipa.it/rombo/epigenomics/. simona.rombo@unipa.itSupplementary information: Supplementary data are available at Bioinformatics online.
Lecture Notes in Computer Science, 2015
Open Source Software in Life Science Research, 2012
Chapman & Hall/CRC Mathematical & Computational Biology, 2009
BMC bioinformatics, 2012
Ancestral recombinations graph (ARG) is a topological structure that captures the relationship be... more Ancestral recombinations graph (ARG) is a topological structure that captures the relationship between the extant genomic sequences in terms of genetic events including recombinations. IRiS is a system that estimates the ARG on sequences of individuals, at genomic scales, capturing the relationship between these individuals of the species. Recently, this system was used to estimate the ARG of the recombining X Chromosome of a collection of human populations using relatively dense, bi-allelic SNP data. While the ARG is a natural model for capturing the inter-relationship between a single chromosome of the individuals of a species, it is not immediately apparent how the model can utilize whole-genome (across chromosomes) diploid data. Also, the sheer complexity of an ARG structure presents a challenge to graph visualization techniques. In this paper we examine the ARG reconstruction for (1) genome-wide or multiple chromosomes, (2) multi-allelic and (3) extremely sparse data. To aid in...
BMC genomics, 2014
Reed canary grass (Phalaris arundinacea) is an economically important forage and bioenergy grass ... more Reed canary grass (Phalaris arundinacea) is an economically important forage and bioenergy grass of the temperate regions of the world. Despite its economic importance, it is lacking in public genomic data. We explore comparative exomics of the grass cultivars in the context of response to salt exposure. The limited data set poses challenges to the computational pipeline. As a prerequisite for the comparative study, we generate the Phalaris reference transcriptome sequence, one of the first steps in addressing the issue of paucity of processed genomic data in this species. In addition, the differential expression (DE) and active-but-stable genes for salt stress conditions were analyzed by a novel method that was experimentally verified on human RNA-seq data. For the comparative exomics, we focus on the DE and stable genic regions, with respect to salt stress, of the genome. In our comparative study, we find that phylogeny of the DE and stable genic regions of the Phalaris cultivars ...
Lecture Notes in Computer Science, 2011
Computer Methods and Programs in Biomedicine, 2015
The prediction of the number of clusters in a dataset, in particular microarrays, is a fundamenta... more The prediction of the number of clusters in a dataset, in particular microarrays, is a fundamental task in biological data analysis, usually performed via validation measures. Unfortunately, it has received very little attention and in fact there is a growing need for software tools/libraries dedicated to it. Here we present ValWorkBench, a software library consisting of eleven well known validation measures, together with novel heuristic approximations for some of them. The main objective of this paper is to provide the interested researcher with the full software documentation of an open source cluster validation platform having the main features of being easily extendible in a homogeneous way and of offering software components that can be readily re-used. Consequently, the focus of the presentation is on the architecture of the library, since it provides an essential map that can be used to access the full software documentation, which is available at the supplementary material website [1]. The mentioned main features of ValWorkBench are also discussed and exemplified, with emphasis on software abstraction design and re-usability. A comparison with existing cluster validation software libraries, mainly in terms of the mentioned features, is also offered. It suggests that ValWorkBench is a much needed contribution to the microarray software development/algorithm engineering community. For completeness, it is important to mention that previous accurate algorithmic experimental analysis of the relative merits of each of the implemented measures [19,23,25], carried out specifically on microarray data, gives useful insights on the effectiveness of ValWorkBench for cluster validation to researchers in the microarray community interested in its use for the mentioned task.
Chapman & Hall/CRC Data Mining and Knowledge Discovery Series, 2009
Biological Knowledge Discovery Handbook, 2013