Bayesian parameter estimation for automatic annotation of gene functions using observational data and phylogenetic trees - PubMed (original) (raw)
Bayesian parameter estimation for automatic annotation of gene functions using observational data and phylogenetic trees
George G Vega Yon et al. PLoS Comput Biol. 2021.
Abstract
Gene function annotation is important for a variety of downstream analyses of genetic data. But experimental characterization of function remains costly and slow, making computational prediction an important endeavor. Phylogenetic approaches to prediction have been developed, but implementation of a practical Bayesian framework for parameter estimation remains an outstanding challenge. We have developed a computationally efficient model of evolution of gene annotations using phylogenies based on a Bayesian framework using Markov Chain Monte Carlo for parameter estimation. Unlike previous approaches, our method is able to estimate parameters over many different phylogenetic trees and functions. The resulting parameters agree with biological intuition, such as the increased probability of function change following gene duplication. The method performs well on leave-one-out cross-validation, and we further validated some of the predictions in the experimental scientific literature.
Conflict of interest statement
The authors have declared that no competing interests exist.
Figures
Fig 1. ROC curve for each estimation method.
As reflected in the parameter estimates, ROC curves are very similar across all four methods.
Fig 2. Sensitivity analysis.
Boxplots of MAEs as a function of single parameter updates. Sub-figures A1 through D show how the MAEs change as a function of fixing the given parameter value ranging from 0 to 1. In each of these plots, the white box indicates the parameter value used to generate the data (i.e., the “correct” parameter value). The last boxplot, sub-figure E, shows the distribution of the MAEs as a function of the amount of available annotation data, that is, how prediction error changes as we randomly remove annotations from the available data.
Fig 3. Number of annotated leaves for all 138 trees by type of annotation.
Most of the annotations available in GO are positive assertions of function, that is, 1s. Furthermore, as observed in the figure, all of the trees used in this study have two “Not” (i.e., 0) annotations, which is a direct consequence of the selection criteria used, as we explicitly set a minimum of two annotations of each type per tree+function.
Fig 4. Number of internal nodes by type of event.
Each bar represents one of the 138 trees used in the paper by type of event (mostly speciation). The embedded histogram shows the distribution of the prevalance of duplication events per tree. The minority of the events, about 20%, are duplications.
Fig 5. Mean Absolute Error versus number of annotations on 138 trees.
Each point represents a single tree+function colored by the number of negative annotations (“absent”). The x-axis is in log-scale, and the figure includes a locally estimated scatterplot smoothing [LOESS] curve with a 95% confidence interval.
Fig 6. Low MAE predictions.
The first set of annotations, first column, shows the experimental annotations of the term GO:0001730 for PTHR11258. The second column shows the 95% C.I. of the predicted annotations. The column ranges from 0 (left end) to 1 (right-end). Bars closer to the left are colored red to indicate that lack of function is suggested, while bars closer to the right are colored blue to indicate function is suggested. Depth of color corresponds to strength of inference. The AUC for this analysis is 0.91 and the Mean Absolute Error is 0.34.
Fig 7. High MAE predictions.
The first set of annotations, first column, shows the experimental annotations of the term GO:0004571 for PTHR45679. The second column shows the 95% C.I. of the predicted annotations using leave-one-out. Bars closer to the left are colored red to indicate that lack of function is suggested, while bars closer to the right are colored blue to indicate function is suggested. Depth of color corresponds to strength of inference. The AUC for this analysis is 0.33 and the Mean Absolute Error is 0.52.
Fig 8. ROC curve for aphylo and SIFTER predictions.
This figure includes 184 annotations on 147 proteins. Of the 184 annotations, 18 were negative. The corresponding AUC for these curves are 0.72 for aphylo, and 0.60 and 0.52 for SIFTER using truncation level 1 and truncation level 3 respectively.
Fig 9. Comparison of our 220 proposed annotations to annotations currently in the GO database.
Of the 220, 46 were not found in the GO database, 8 of which we proposed as negative assertions. Of the remaining 174 which we found in the GO database, five were inconsistent (having both a present and absent annotation), two had an experimental evidence code, and the remainder, 167, corresponded to annotations without experimental evidence.
Fig 10. Distribution of AUCs and MAEs for the scenario with partially annotated trees and mislabeling.
The x-axis shows the proportion of missing annotations, while the y-axis shows the score (AUC or MAE).
Fig 11. Empirical bias distribution for the Fully annotated scenario by type of prior, parameter, and number of leaves.
Fig 12. Empirical bias distribution for the partially annotated scenario by parameter and proportion of missing labels.
Fig 13. Difference in the number of input proteins used for predictions.
Each bar represents a single annotation (prediction) to be made. The y-axis shows the difference between the number of input proteins used by aphylo, and SIFTER. Negative values indicate that SIFTER included more proteins as input for making the prediction, whereas positive values indicate that aphylo included more proteins as input. A paired t-test shows that, on average, SIFTER included more proteins than aphylo for each one of its calculations.
Fig 14. Scatter plot comparing all 184 prediction scores between SIFTER with truncation level one, and SIFTER with truncation level 3.
Red triangles correspond to negative annotations, while blue triangles mark positive annotations. The six negative annotations with a black border highlight observations that were correctly classified with truncation one, but were misclassified using truncation level 3. The coordinates were jittered to avoid overlapping.
References
Publication types
MeSH terms
LinkOut - more resources
Full Text Sources
Other Literature Sources