Ruriko Yoshida | Naval Postgraduate School (original) (raw)
Papers by Ruriko Yoshida
arXiv (Cornell University), Sep 29, 2022
Bioinformatics, 2020
Motivation Due to new technology for efficiently generating genome data, machine learning methods... more Motivation Due to new technology for efficiently generating genome data, machine learning methods are urgently needed to analyze large sets of gene trees over the space of phylogenetic trees. However, the space of phylogenetic trees is not Euclidean, so ordinary machine learning methods cannot be directly applied. In 2019, Yoshida et al. introduced the notion of tropical principal component analysis (PCA), a statistical method for visualization and dimensionality reduction using a tropical polytope with a fixed number of vertices that minimizes the sum of tropical distances between each data point and its tropical projection. However, their work focused on the tropical projective space rather than the space of phylogenetic trees. We focus here on tropical PCA for dimension reduction and visualization over the space of phylogenetic trees. Results Our main results are 2-fold: (i) theoretical interpretations of the tropical principal components over the space of phylogenetic trees, nam...
arXiv (Cornell University), Nov 3, 2009
Discrete Optimization, Jun 1, 2005
Lecture Notes in Computer Science, 2004
Journal of Symbolic Computation, Oct 1, 2004
arXiv (Cornell University), Oct 10, 2008
Mathematics
When we apply comparative phylogenetic analyses to genome data, it poses a significant problem an... more When we apply comparative phylogenetic analyses to genome data, it poses a significant problem and challenge that some of the given species (or taxa) often have missing genes (i.e., data). In such a case, we have to impute a missing part of a gene tree from a sample of gene trees. In this short paper, we propose a novel method to infer the missing part of a phylogenetic tree using an analogue of a classical linear regression in the setting of tropical geometry. In our approach, we consider a tropical polytope, a convex hull with respect to the tropical metric closest to the data points. We show a condition that we can guarantee that an estimated tree from the method has at most a Robinson–Foulds (RF) distance of four from the ground truth, and computational experiments with simulated data and empirical data from Clavicipitaceae, which contains more than 4000 genes, show the method works well.
Abstract.—While the majority of gene histories found in a clade of organisms are expected to be g... more Abstract.—While the majority of gene histories found in a clade of organisms are expected to be generated by a common process (e.g. the coalescent process), it is well-known that numerous other coexisting processes (e.g. horizontal gene transfers, gene duplication and subsequent neofunctionalization) will cause some genes to exhibit a history quite distinct from those of the majority of genes. Such “outlying ” gene trees are considered to be biologically interesting and identifying these genes has become an important problem in phylogenetics. In this paper we propose and implement, kdetrees, a nonparametric method of estimating distributions of phylogenetic trees, with the goal of identifying trees which are significantly different from the rest of the trees in the sample. Our approach mimics the common statistical technique of kernel density estimation, using tree distances to define kernels. In contrast to parametric models, such as the coalescent, nonparametric approaches avoid t...
Mathematics
In this paper, we propose clustering methods for use on data described as tropically convex. Our ... more In this paper, we propose clustering methods for use on data described as tropically convex. Our approach is similar to clustering methods used in the Euclidean space, where we identify groupings of similar observations using tropical analogs of K-means and hierarchical clustering in the Euclidean space. We provide results from computational experiments on generic simulated data as well as an application to phylogeny using ultrametrics, demonstrating the efficacy of these methods.
Algorithms
Unweighted Pair Group Method with Arithmetic Mean (UPGMA) is one of the most popular distance-bas... more Unweighted Pair Group Method with Arithmetic Mean (UPGMA) is one of the most popular distance-based methods to reconstruct an equidistant phylogenetic tree from a distance matrix computed from an alignment of sequences. Since we use equidistant trees as gene trees for phylogenomic analyses under the multi-species coalescent model and since an input distance matrix computed from an alignment of each gene in a genome is estimated via the maximum likelihood estimators, it is important to conduct a robust analysis on UPGMA. Stochastic safety radius, introduced by Steel and Gascuel, provides a lower bound for the probability that a phylogenetic tree reconstruction method returns the true tree topology from a given distance matrix. In this article, we compute the stochastic safety radius of UPGMA for a phylogenetic tree with n leaves. Computational experiments show an improved gap between empirical probabilities estimated from random samples and the true tree topology from UPGMA, increasi...
Algorithms
Unweighted Pair Group Method with Arithmetic Mean (UPGMA) is one of the most popular distance-bas... more Unweighted Pair Group Method with Arithmetic Mean (UPGMA) is one of the most popular distance-based methods to reconstruct an equidistant phylogenetic tree from a distance matrix computed from an alignment of sequences. Since we use equidistant trees as gene trees for phylogenomic analyses under the multi-species coalescent model and since an input distance matrix computed from an alignment of each gene in a genome is estimated via the maximum likelihood estimators, it is important to conduct a robust analysis on UPGMA. Stochastic safety radius, introduced by Steel and Gascuel, provides a lower bound for the probability that a phylogenetic tree reconstruction method returns the true tree topology from a given distance matrix. In this article, we compute the stochastic safety radius of UPGMA for a phylogenetic tree with n leaves. Computational experiments show an improved gap between empirical probabilities estimated from random samples and the true tree topology from UPGMA, increasi...
Linear Algebra and Its Applications with R, 2021
Discrete & Computational Geometry
We study the behavior of phylogenetic tree shapes in the tropical geometric interpretation of tre... more We study the behavior of phylogenetic tree shapes in the tropical geometric interpretation of tree space. Tree shapes are formally referred to as tree topologies; a tree topology can also be thought of as a tree combinatorial type, which is given by the tree’s branching configuration and leaf labeling. We use the tropical line segment as a framework to define notions of variance as well as invariance of tree topologies: we provide a combinatorial search theorem that describes all tree topologies occurring along a tropical line segment, as well as a setting under which tree topologies do not change along a tropical line segment. Our study is motivated by comparison to the moduli space endowed with a geodesic metric proposed by Billera, Holmes, and Vogtmann (referred to as BHV space); we consider the tropical geometric setting as an alternative framework to BHV space for sets of phylogenetic trees. We give an algorithm to compute tropical line segments which is lower in computational ...
We have developed software called Phylotree as a toolkit for running experiments to study gene co... more We have developed software called Phylotree as a toolkit for running experiments to study gene cophylogenies for genome evolution using distance-based methods. In particular, the toolkit has been instrumental in conducting processing-heavy experiments with the new “difference of means” statistical method. Phylotree was used to run experiments using simulated data as well as biological sequences of well known host and parasite species, and is distributed with data and configuration files allowing these experiments to be reproduced
Cornell University - arXiv, Nov 26, 2021
arXiv (Cornell University), Sep 29, 2022
Bioinformatics, 2020
Motivation Due to new technology for efficiently generating genome data, machine learning methods... more Motivation Due to new technology for efficiently generating genome data, machine learning methods are urgently needed to analyze large sets of gene trees over the space of phylogenetic trees. However, the space of phylogenetic trees is not Euclidean, so ordinary machine learning methods cannot be directly applied. In 2019, Yoshida et al. introduced the notion of tropical principal component analysis (PCA), a statistical method for visualization and dimensionality reduction using a tropical polytope with a fixed number of vertices that minimizes the sum of tropical distances between each data point and its tropical projection. However, their work focused on the tropical projective space rather than the space of phylogenetic trees. We focus here on tropical PCA for dimension reduction and visualization over the space of phylogenetic trees. Results Our main results are 2-fold: (i) theoretical interpretations of the tropical principal components over the space of phylogenetic trees, nam...
arXiv (Cornell University), Nov 3, 2009
Discrete Optimization, Jun 1, 2005
Lecture Notes in Computer Science, 2004
Journal of Symbolic Computation, Oct 1, 2004
arXiv (Cornell University), Oct 10, 2008
Mathematics
When we apply comparative phylogenetic analyses to genome data, it poses a significant problem an... more When we apply comparative phylogenetic analyses to genome data, it poses a significant problem and challenge that some of the given species (or taxa) often have missing genes (i.e., data). In such a case, we have to impute a missing part of a gene tree from a sample of gene trees. In this short paper, we propose a novel method to infer the missing part of a phylogenetic tree using an analogue of a classical linear regression in the setting of tropical geometry. In our approach, we consider a tropical polytope, a convex hull with respect to the tropical metric closest to the data points. We show a condition that we can guarantee that an estimated tree from the method has at most a Robinson–Foulds (RF) distance of four from the ground truth, and computational experiments with simulated data and empirical data from Clavicipitaceae, which contains more than 4000 genes, show the method works well.
Abstract.—While the majority of gene histories found in a clade of organisms are expected to be g... more Abstract.—While the majority of gene histories found in a clade of organisms are expected to be generated by a common process (e.g. the coalescent process), it is well-known that numerous other coexisting processes (e.g. horizontal gene transfers, gene duplication and subsequent neofunctionalization) will cause some genes to exhibit a history quite distinct from those of the majority of genes. Such “outlying ” gene trees are considered to be biologically interesting and identifying these genes has become an important problem in phylogenetics. In this paper we propose and implement, kdetrees, a nonparametric method of estimating distributions of phylogenetic trees, with the goal of identifying trees which are significantly different from the rest of the trees in the sample. Our approach mimics the common statistical technique of kernel density estimation, using tree distances to define kernels. In contrast to parametric models, such as the coalescent, nonparametric approaches avoid t...
Mathematics
In this paper, we propose clustering methods for use on data described as tropically convex. Our ... more In this paper, we propose clustering methods for use on data described as tropically convex. Our approach is similar to clustering methods used in the Euclidean space, where we identify groupings of similar observations using tropical analogs of K-means and hierarchical clustering in the Euclidean space. We provide results from computational experiments on generic simulated data as well as an application to phylogeny using ultrametrics, demonstrating the efficacy of these methods.
Algorithms
Unweighted Pair Group Method with Arithmetic Mean (UPGMA) is one of the most popular distance-bas... more Unweighted Pair Group Method with Arithmetic Mean (UPGMA) is one of the most popular distance-based methods to reconstruct an equidistant phylogenetic tree from a distance matrix computed from an alignment of sequences. Since we use equidistant trees as gene trees for phylogenomic analyses under the multi-species coalescent model and since an input distance matrix computed from an alignment of each gene in a genome is estimated via the maximum likelihood estimators, it is important to conduct a robust analysis on UPGMA. Stochastic safety radius, introduced by Steel and Gascuel, provides a lower bound for the probability that a phylogenetic tree reconstruction method returns the true tree topology from a given distance matrix. In this article, we compute the stochastic safety radius of UPGMA for a phylogenetic tree with n leaves. Computational experiments show an improved gap between empirical probabilities estimated from random samples and the true tree topology from UPGMA, increasi...
Algorithms
Unweighted Pair Group Method with Arithmetic Mean (UPGMA) is one of the most popular distance-bas... more Unweighted Pair Group Method with Arithmetic Mean (UPGMA) is one of the most popular distance-based methods to reconstruct an equidistant phylogenetic tree from a distance matrix computed from an alignment of sequences. Since we use equidistant trees as gene trees for phylogenomic analyses under the multi-species coalescent model and since an input distance matrix computed from an alignment of each gene in a genome is estimated via the maximum likelihood estimators, it is important to conduct a robust analysis on UPGMA. Stochastic safety radius, introduced by Steel and Gascuel, provides a lower bound for the probability that a phylogenetic tree reconstruction method returns the true tree topology from a given distance matrix. In this article, we compute the stochastic safety radius of UPGMA for a phylogenetic tree with n leaves. Computational experiments show an improved gap between empirical probabilities estimated from random samples and the true tree topology from UPGMA, increasi...
Linear Algebra and Its Applications with R, 2021
Discrete & Computational Geometry
We study the behavior of phylogenetic tree shapes in the tropical geometric interpretation of tre... more We study the behavior of phylogenetic tree shapes in the tropical geometric interpretation of tree space. Tree shapes are formally referred to as tree topologies; a tree topology can also be thought of as a tree combinatorial type, which is given by the tree’s branching configuration and leaf labeling. We use the tropical line segment as a framework to define notions of variance as well as invariance of tree topologies: we provide a combinatorial search theorem that describes all tree topologies occurring along a tropical line segment, as well as a setting under which tree topologies do not change along a tropical line segment. Our study is motivated by comparison to the moduli space endowed with a geodesic metric proposed by Billera, Holmes, and Vogtmann (referred to as BHV space); we consider the tropical geometric setting as an alternative framework to BHV space for sets of phylogenetic trees. We give an algorithm to compute tropical line segments which is lower in computational ...
We have developed software called Phylotree as a toolkit for running experiments to study gene co... more We have developed software called Phylotree as a toolkit for running experiments to study gene cophylogenies for genome evolution using distance-based methods. In particular, the toolkit has been instrumental in conducting processing-heavy experiments with the new “difference of means” statistical method. Phylotree was used to run experiments using simulated data as well as biological sequences of well known host and parasite species, and is distributed with data and configuration files allowing these experiments to be reproduced
Cornell University - arXiv, Nov 26, 2021