mvPPT: A Highly Efficient and Sensitive Pathogenicity Prediction Tool for Missense Variants (original) (raw)

Journal Article

Shi-Yuan Tong

Methodology, Software, Formal analysis, Investigation, Writing – original draft, Visualization

Jing’an District Central Hospital of Shanghai, State Key Laboratory of Medical Neurobiology, MOE Frontiers Center for Brain Science, Institutes of Brain Science, Fudan University

Shanghai 200032

China

Search for other works by this author on:

Ke Fan ,

Ke Fan

Validation, Formal analysis, Investigation, Writing – original draft

Jing’an District Central Hospital of Shanghai, State Key Laboratory of Medical Neurobiology, MOE Frontiers Center for Brain Science, Institutes of Brain Science, Fudan University

Shanghai 200032

China

Search for other works by this author on:

Zai-Wei Zhou ,

Zai-Wei Zhou

Methodology, Investigation, Writing – original draft

Shanghai Xunyin Biotechnology Co., Ltd.

Shanghai 201802

China

Search for other works by this author on:

Lin-Yun Liu ,

Lin-Yun Liu

Validation, Writing – original draft

Jing’an District Central Hospital of Shanghai, State Key Laboratory of Medical Neurobiology, MOE Frontiers Center for Brain Science, Institutes of Brain Science, Fudan University

Shanghai 200032

China

Search for other works by this author on:

Shu-Qing Zhang ,

Shu-Qing Zhang

Validation, Data curation, Writing – original draft

Jing’an District Central Hospital of Shanghai, State Key Laboratory of Medical Neurobiology, MOE Frontiers Center for Brain Science, Institutes of Brain Science, Fudan University

Shanghai 200032

China

Search for other works by this author on:

Yinghui Fu ,

Yinghui Fu

Validation, Data curation, Writing – original draft

Jing’an District Central Hospital of Shanghai, State Key Laboratory of Medical Neurobiology, MOE Frontiers Center for Brain Science, Institutes of Brain Science, Fudan University

Shanghai 200032

China

Search for other works by this author on:

Guang-Zhong Wang ,

Guang-Zhong Wang

Writing – original draft, Data curation

CAS Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences

Shanghai 200031

China

Search for other works by this author on:

Ying Zhu ,

Ying Zhu

Conceptualization, Formal analysis, Resources, Writing – review & editing, Supervision, Project administration

Huashan Hospital, State Key Laboratory of Medical Neurobiology, MOE Frontiers Center for Brain Science, Institutes of Brain Science, Fudan University

Shanghai 200032

China

Search for other works by this author on:

Yong-Chun Yu

Conceptualization, Resources, Writing – review & editing, Visualization, Supervision, Project administration

Jing’an District Central Hospital of Shanghai, State Key Laboratory of Medical Neurobiology, MOE Frontiers Center for Brain Science, Institutes of Brain Science, Fudan University

Shanghai 200032

China

Search for other works by this author on:

Peer review under responsibility of Beijing Institute of Genomics, Chinese Academy of Sciences / China National Center for Bioinformation and Genetics Society of China.

Author Notes

Received:

14 September 2021

Revision received:

19 May 2022

Published:

05 August 2022

Cite

Shi-Yuan Tong, Ke Fan, Zai-Wei Zhou, Lin-Yun Liu, Shu-Qing Zhang, Yinghui Fu, Guang-Zhong Wang, Ying Zhu, Yong-Chun Yu, mvPPT: A Highly Efficient and Sensitive Pathogenicity Prediction Tool for Missense Variants, Genomics, Proteomics & Bioinformatics, Volume 21, Issue 2, April 2023, Pages 414–426, https://doi.org/10.1016/j.gpb.2022.07.005
Close

Navbar Search Filter Mobile Enter search term Search

Abstract

Next-generation sequencing technologies both boost the discovery of variants in the human genome and exacerbate the challenges of pathogenic variant identification. In this study, we developed Pathogenicity Prediction Tool for missense variants (mvPPT), a highly sensitive and accurate missense variant classifier based on gradient boosting. mvPPT adopts high-confidence training sets with a wide spectrum of variant profiles, and extracts three categories of features, including scores from existing prediction tools, frequencies (allele frequencies, amino acid frequencies, and genotype frequencies), and genomic context. Compared with established predictors, mvPPT achieves superior performance in all test sets, regardless of data source. In addition, our study also provides guidance for training set and feature selection strategies, as well as reveals highly relevant features, which may further provide biological insights into variant pathogenicity. mvPPT is freely available at http://www.mvppt.club/.

Introduction

Whole-exome sequencing (WES) and whole-genome sequencing (WGS) enable large-scale parallel assessment of genetic variants and have been increasingly adopted in clinical diagnosis, which makes interpreting the effect of the identified variants a serious challenge [1,2]. Unlike synonymous variants and loss-of-function variants for which the impact on the protein can be relatively easy to predict, missense variants that often lead to inconclusive genomic outcomes remain a major challenge in pathogenicity interpretation. Compared with the reference genome, a human exome on average contains around 20,000 single nucleotide variants (SNVs) and approximately half are missense variants [3,4]. Nevertheless, the effects of most missense variants on proteins are unclear, as experimental validation of large numbers of variants is limited by efficiency and cost. To address these limitations, many computational tools have been developed to predict the potential impact of variants [5–21]. Early prediction models compute pathogenicity scores based on a single property of variants, such as evolutionary conservation [8,10,15] and protein structure/function [16,17]. And recent ensemble methods achieve higher classification accuracy by integrating information from individual predictors [5–7,9,11–13,20]. Although these existing tools have made significant contributions to the prediction of the hazard of genetic variants, the sensitivity of prediction still needs to be improved when assessing the pathogenicity of massive variants in clinical scenarios.

While the existing tools provide positive predictive power, their prediction results are often inconsistent with each other [6,18]. It is believed that the predictive power of current ensemble methods is hampered by the lack of appropriate training data and incomplete features [7,8,22]. For training set inclusion, the widely-adopted strategies to create training sets include using variants from disease databases only [7,9] or using variants from both disease and population databases to balance the ratio of benign and pathogenic variants in the dataset [5,6]. However, there is no conclusion on which strategy results in the best performance. Specifically, existing ensemble tools mostly train machine learning models on variants with known labels in disease databases such as ClinVar [23] and/or Human Gene Mutation Database (HGMD) [24]. However, variants in disease databases may cluster around well-described disease genes, i.e., the more a gene has been studied, the more variants on this gene are likely to be discovered. Unfortunately, it is not clear yet whether the clustering of variants on certain genes introduces bias in the prediction of variant pathogenicity. Moreover, since each resource maintains different variant inclusion criteria, vast variants in ClinVar and HGMD databases were labeled with conflict or even opposite clinical significance [25,26], which might attenuate the prediction accuracy of the computational tools. To further expand the training data, some existing tools include sequence variants from population databases, such as Exome Sequencing Project (ESP) [6,11,27]. Most of these tools consider the sequence variants in general populations above a certain allele frequency (AF) as benign; however, how to choose a proper AF threshold for defining neutral training variants remains a question.

The common features adopted by most of the present ensemble models are scores computed by individual predictors based on amino acid or nucleotide conservation, and biochemical properties of the amino acid substitutions [5,6,9,12,13]. While these scores are proven to be highly relevant to variant deleteriousness, other features linked to variant pathogenicity have been shown strong correlations with human diseases. For example, AF has been used as an important criterion in deleterious selection in practice for a long time but was rarely considered in an ensemble model. ClinPred adds AF as input features for the first time and is shown to be more effective than many other ensemble machines [7]. Similarly, genotype frequency (GF) and amino acid frequency (AAF) contain hints of natural selection, which may provide extra information for pathogenicity inference. Additionally, recent studies have shown that intolerance to variation is a strong predictor of human disease relevance, emphasizing the role of genomic context in variant pathogenicity prediction [28–30].

As the algorithm is the “brain” of a machine learning model, the efficiency of a model is largely dependent on algorithm selection. Varieties of machine learning approaches, such as logistic regression [9,13], support vector machine (SVM) [9], random forest [6,13], and boosting algorithms [5,7] have been implemented in variant classification. In general, tree-based approaches achieve higher accuracy and precision according to prior studies [5–7,19]; however, few studies have systematically evaluated the effects of different algorithms on pathogenic variant prioritization. LightGBM is a gradient boosting framework that uses tree-based learning algorithms [31]. Unlike random forests where the component trees are trained independently, in gradient boosting, trees are built in a stepwise manner, where each successive tree is optimized on the residuals of the prediction of the preceding tree. In a previous study, it has been demonstrated that compared with other gradient boosting frameworks such as XGBoost and Catboost, LightGBM converges on a solution that generalizes better [32].

Considering the aforementioned observations, we introduce Pathogenicity Prediction Tool for missense variants (mvPPT), a novel gradient boosting machine for missense variant pathogenicity prediction. By selecting 62 features (including scores from individual predictors, AF/GF/AAF, and genomic context information) and adopting high-confidence variant training sets, mvPPT demonstrates a best-to-date performance in variant pathogenicity prediction, paving its way in molecular diagnosis and clinical scenario applications. mvPPT and pre-computed scores of missense variants in the human exome can be accessed through http://www.mvppt.club/.

Method

Web Resources

Web resources of all databases and software used in this study are listed in Table S1.

Missense Variant Annotation

Variants were annotated by the latest version of ANNOVAR software (version 2019Oct24) [33], with gene-based annotation set to ensGene (assembly version hg19). Variants whose functional consequences were marked as nonsynonymous SNVs were selected. Further removing loss-of-function (stop gain or stop loss) variants ensured that our model was trained and evaluated nearly exclusively on missense variants.

Training Set

Training set variants were collected from disease databases: ClinVar (2020.7), HGMD (Pro version 2020.3), and UniProt (2020.6) [34], as well as a population database from Genome Aggregation Database (gnomAD) genomes (version 2.1.1) [28]. Each variant in ClinVar has a review status tag reporting the level of review supporting the assertion of clinical significance, and a clinical significance tag labeling variants as pathogenic, likely pathogenic, uncertain significance, likely benign, and benign. To select variants with reliable tags, we kept variants with review status of “criteria provided” from submitters and “reviewed by expert panel”. The variants were further filtered according to their significance tag: variants that were categorized as (1) benign or likely benign and (2) pathogenic or likely pathogenic were selected as negative (benign) and positive (pathogenic) labels, respectively. The variants in HGMD were labeled by seven different tags, including disease-causing mutation (DM), disease-causing mutation? (DM?), disease-associated polymorphism (DP), disease-associated polymorphism with supporting functional evidence (DFP), in vitro/laboratory or in vivo functional polymorphism (FP), polymorphic or rare variants reported in the literature (FTV), and retired entry (R). The variants with the DM and DM? labels were reported to be disease-causing in the original literature report. The question mark denotes that a degree of doubt has been found regarding pathogenicity. We only kept the variants with the “DM” label in this study. For UniProt, there are three variant labels: Disease, Polymorphism, and Unclassified. These labels were curated from literature reports. We kept the variants labeled with “Disease” and “Polymorphism” in this study. All variants with conflict labels in different databases were excluded. Population variants were obtained from gnomAD genomes (version 2.1.1), which combines variation data from 15,708 individuals. To avoid any bias, the population variants were further filtered to remove any variants in disease databases. Ten-fold cross-validation was implemented through the Python package scikit-learn (version 0.23.2).

Cross-validation in Algorithm Selection and Feature Selection

To avoid overfitting, we designed the ten-fold cross-validation procedure as follows: (1) we divided the variants from disease and 1000 Genomes Project (1KGP) [35] databases into ten subsets; (2) in each round, we selected nine subsets of variants from disease databases to generate the training set, and combined the remaining one subset of variants from disease databases with one subset of 1KGP variants to form the test set.

Cross-validation in Training Set Analysis

The same cross-validation procedure was conducted as above, except that in each round, the training set was generated by using variants from disease databases only, or by adding variants from gnomAD as benign variants. More specifically, we first divided the variants from disease databases into ten subsets. In each round, we selected 90% of variants from the disease databases and combined them with variants in gnomAD genomes passing different AF thresholds (all, removing singleton, AF > 0.0001, AF > 0.001, and AF > 0.01) to generate the training sets for comparison; the remaining 10% of variants from disease databases were then combined with 10% variants from 1KGP to form the test set.

Test Set

An independent test set was generated by combining variants from Vereniging Klinisch Genetische Laboratoriumdiagnostiek (VKGL, 2020.9) [36], VariSNP (2017.2.16) [37], Database of Curated Mutations (DoCM, version 3.2) [38], Database of Pathogenic Variants (DPV, 2020.12.29) [39], Consequence-Agnostic Pathogenicity Interpretation of Clinical Exome (CAPICE, version 4) test [19], and MetaLR/SVM_Test [9]. To guarantee a second independent test set with variants that have never been used in any tools’ training set, we used PubMed to search for papers reporting new genetic disease-causing genes. In total, we found five papers covering seven genes, and none of these genes include pathogenic variants in the training or the test sets in our study [40–44]. All missense variants on these genes reported in the literature were collected. Population variants were obtained from 1KGP. For each simulated exome, we randomly selected 1000 neutral variants from 1KGP without replacement and added one disease-causing variant. The random seed was set to 1. To validate the robustness of our model and avoid overfitting, variants that were used in any training set or our features’ training data were discarded from all test sets, and only variants with comprehensive scores required by all comparator models were included.

Features

mvPPT adopted 62 features from three categories: (A) pathogenicity likelihood scores assessed by different component tools, including Sorting Tolerant From Intolerant (SIFT) [10], MutationAssessor [15], Protein Variation Effect Analyzer (PROVEAN) [45], GERP++RS [46], phyloP [47], phastCons [48], and SIte-specific PHYlogenetic analysis (SiPhy) [49]. (B) AFs, GFs, and AAFs of variants estimated from 125,748 exomes in gnomAD (version 2.1.1); and (C) genomic context of the variant, i.e., region/gene-based information from Gene Variation Intolerance Rank (GeVIR) [29], VIRLoF [29], oe_mis_upper (from gnomAD), Haploinsufficiency Predictions (HIP) [50], Constrained Coding Regions (CCRs) [51], Interpro domain [52], and amino acid sequences before and after mutation. To avoid overfitting, the seven tools we used in category A did not generate scores based on machine learning algorithms. We annotated datasets with ANNOVAR using the database for nonsynonymous SNPs’ functional predictions (dbNSFP, v.4.1a) [53,54] to generate some of the required prediction scores from different component tools, including Interpro domain, MutationAssessor, phyloP, GERP++RS, phastCons, PROVEAN, and SiPhy. Mutations located in the Interpro domains were recorded as 1 and the rest were recorded as 0. AFs, GFs, and AAFs of each variant in different populations were obtained from the gnomAD exomes. AFs, AAFs, homozygous frequencies (HomFs), and heterozygous frequencies (HetFs) were assigned 0, and wild-type frequencies (WtFs) were assigned 1, if the variant was not present in the database. The GeVIR, VIRLoF, oe_mis_upper, HIP, and CCRs scores were downloaded from their respective websites. One-hot encoding has been applied to amino acid sequence, representing each amino acid with a binary vector of length 20 with a single non-zero value. All the features were selected to provide complementary information, and they either did not require training or their training data are publicly available to allow exclusion from our data.

Outlier Detection and Gene Ontology Enrichment Analysis

The interquartile range (IQR) was used to identify outliers. The IQR criterion is summarized as follows:

1. Compute the first and third quartiles, Q1j and Q3j⁠, for each peptide j⁠, and then its IQR: IQRj=Q3j-Q1j⁠.
1. For each peptide j⁠, observation yij is flagged as an outlier if yij<Q1j-k×IQRj or yij>Q3j+k×IQRj⁠, where k=1.5⁠.

Gene Ontology (GO) enrichment analysis was performed with the R package clusterProfiler [55–58].

Metrics for Performance Evaluation

We used 11 different metrics to evaluate the performance of the prediction tools. A detailed description of the metrics is provided in Table S2.

mvPPT Training

mvPPT was trained using the Python package LightGBM (version 2.3.1) [31], and parameters were tuned by Bayesian optimization (version 1.2.0). The random status was set as 1 throughout the model training process. For Bayesian optimization process, the number of iterations was set as 100 (n_iter = 100) and the number of steps of random exploration was set as 15 (init_points = 15). The ranges of the hyperparameters in the LightGBM for Bayesian optimization were set as follows: num_leaves (24, 45), feature_fraction (0.1, 0.9), bagging_fraction (0.8, 1), max_depth (5, 8.99), lambda_l1 (0, 5), lambda_l2 (0, 3), min_split_gain (0.001, 0.1), and min_child_weight (5, 50). After the parameter optimization process, the final used values of the parameters were as follows: num_leaves = 45, min_child_weight = 6.163, learning_rate = 0.01, bagging_fraction = 0.870, feature_fraction = 0.632, lambda_l1 = 0. 921, lambda_l2 = 0.193, min_gain_to_split = 0.039, and max_depth = 9.

Scores from Existing Tools

The scores for REVEL [6], ClinPred [7], PrimateAI [8], MetaSVM/MetaLR [9], VEST4 [11], MVP [12], PolyPhen-2 [16], and FATHMM-XF [20] were obtained from dbNSFP v4.1a. The scores for Mendelian Clinically Applicable Pathogenicity (M-CAP) [5], MISsense deleTeriousness predICtor (MISTIC) [13], CAPICE [19], Combined Annotation Dependent Depletion (CADD) [21], and ReVe [22] were downloaded from their respective websites.

Statistical Analysis

Wilcoxon matched-pairs signed-rank test was conducted using the stats module in SciPy Python package (version 1.5.4). Adjusted P value in GO enrichment analysis was calculated by the R package clusterProfiler. All the metrics in this study were calculated based on the scikit-learn Python package.

Results

The Prediction Model was Refined with Various Algorithm and Feature Selection

The performance of a machine learning model is mainly determined by the algorithm, the features, and the training set used. Therefore, we designed mvPPT by careful selection of the algorithm, features, and training set (Figure 1).

Figure 1

mvPPT workflow The detailed mvPPT construction process is shown. High-confidence variant sets were extracted from ClinVar, HGMD, and UniProt. Models were trained using LightGBM with parameters tuned by Bayesian global optimization. Ten-fold cross-validation is carried out to verify the effectiveness of the prediction model. mvPPT is built after algorithm selection, feature selection, and training set analysis. mvPPT, Pathogenicity Prediction Tool for missense variants; HGMD, Human Gene Mutation Database; 1KGP, 1000 Genomes Project; SVM, support vector machine; GBM, gradient boosting machine; gnomAD, Genome Aggregation Database; AF, allele frequency.

We first benchmarked the performance of ten commonly used algorithms, including SVM, naive Bayes, logistic regression, decision tree, random forest, extra forest, gradient boosting machine (GBM), AdaBoost, LightGBM, and bagging on the data test set mentioned above. The performance of each algorithm was evaluated using the area under the receiver operating characteristic curve (AUROC) and the area under the precision-recall curve (AUPRC). We found that the performance of the ensemble learning (random forest, extra forest, GBM, AdaBoost, LightGBM, and bagging) achieved the higher AUROC and AUPRC (Figure 2). Among them, LightGBM has the highest AUROC (0.970 ± 0.001) and AUPRC (0.952 ± 0.002) (Figure 2).

Figure 2

Performance comparison of different algorithms Performance comparison of models using different algorithms. A. The ROC for models trained on different algorithms. B. The PRC for models trained on different algorithms. Below: barplot quantifications of the AUROC and the AUPRC values by using ten-fold cross-validation. P value was calculated by Wilcoxon matched-pairs signed-rank test. ROC, operating characteristic curve; PRC, precision-recall curve; AUROC, area under the receiver operating characteristic curve; AUPRC, area under the precision-recall curve.

To obtain a proper feature space, we extracted three types of features, including prediction scores (category A), frequency features (category B), and region-based features (category C) (Figure 1; Table 1). We evaluated the performance of models with the following combinations of feature categories: A, A + B, A + C, and A + B + C (Table 1). Overall, adding category B or category C greatly boosted the performance, compared with the model including only category A, which is the case of most previous ensemble machines. Specifically, the model including A + B + C achieved the highest performance (Figure 3 A and B). As pathogenic variants in databases may be enriched with low-frequency variants, we also assessed the performance of different models on rare variants. When all the variants in the test set were rare (AF = 0, based on gnomAD exomes), the main contributions came from category A and category C, and adding category B showed few but slightly positive impacts (Figure 3C and D). Altogether, both category B and category C improved forecasting accuracy and were thus included in mvPPT.

Figure 3

The prediction power of features by category A. AUROC obtained from models trained on each combination of categories. B. AUPRC obtained from models trained on each combination of categories. C. AUROC evaluating the performance of models trained on different combinations of categories when all the variants in the test set were rare (AF = 0, based on gnomAD exomes). D. AUPRC evaluating the performance of models trained on different combinations of categories when all the variants in the test set were rare (AF = 0, based on gnomAD exomes). AUROC and AUPRC from ten-fold cross-validation were plotted. P value was calculated by Wilcoxon matched-pairs signed-rank test.

Table 1

Features associated with missense variants mined in this work

Category	Feature	Definition
A	MutationAssessor, SIFT, PROVEAN, GERP++RS, phyloP100way_vertebrate, phyloP30way_mammalian, phyloP17way_primate, phastCons100way_vertebrate, phastCons30way_mammalian, phastCons17way_primate, SiPhy_29way_logOdds	Pathogenicity likelihood scores generated by 7 non-machine learning-based component tools
B	gnomAD_WtFs, gnomAD_AFs, gnomAD_HetFs, gnomAD_HomFs, gnomAD_AAFs	AAFs, AFs, and GFs of each variant estimated from 125,748 exomes in gnomAD (version 2.1.1)
C	GeVIR, VIRLoF, HIP, CCRs, Interpro_domain, oe_mis_upper, RefAA_A, RefAA_C, RefAA_D, RefAA_E, RefAA_F, RefAA_G, RefAA_H, RefAA_I, RefAA_K, RefAA_L, RefAA_M, RefAA_N, RefAA_P, RefAA_Q, RefAA_R, RefAA_S, RefAA_T, RefAA_V, RefAA_W, RefAA_Y, AltAA_A, AltAA_C, AltAA_D, AltAA_E, AltAA_F, AltAA_G, AltAA_H, AltAA_I, AltAA_K, AltAA_L, AltAA_M, AltAA_N, AltAA_P, AltAA_Q, AltAA_R, AltAA_S, AltAA_T, AltAA_V, AltAA_W, AltAA_Y	Region/gene-based scores

Category	Feature	Definition
A	MutationAssessor, SIFT, PROVEAN, GERP++RS, phyloP100way_vertebrate, phyloP30way_mammalian, phyloP17way_primate, phastCons100way_vertebrate, phastCons30way_mammalian, phastCons17way_primate, SiPhy_29way_logOdds	Pathogenicity likelihood scores generated by 7 non-machine learning-based component tools
B	gnomAD_WtFs, gnomAD_AFs, gnomAD_HetFs, gnomAD_HomFs, gnomAD_AAFs	AAFs, AFs, and GFs of each variant estimated from 125,748 exomes in gnomAD (version 2.1.1)
C	GeVIR, VIRLoF, HIP, CCRs, Interpro_domain, oe_mis_upper, RefAA_A, RefAA_C, RefAA_D, RefAA_E, RefAA_F, RefAA_G, RefAA_H, RefAA_I, RefAA_K, RefAA_L, RefAA_M, RefAA_N, RefAA_P, RefAA_Q, RefAA_R, RefAA_S, RefAA_T, RefAA_V, RefAA_W, RefAA_Y, AltAA_A, AltAA_C, AltAA_D, AltAA_E, AltAA_F, AltAA_G, AltAA_H, AltAA_I, AltAA_K, AltAA_L, AltAA_M, AltAA_N, AltAA_P, AltAA_Q, AltAA_R, AltAA_S, AltAA_T, AltAA_V, AltAA_W, AltAA_Y	Region/gene-based scores

Note: gnomAD_WtFs, wild-type frequencies estimated from all exomes in gnomAD; gnomAD_AFs, allele frequencies estimated from all exomes in gnomAD; gnomAD_HetFs, heterozygous frequencies estimated from all exomes in gnomAD; gnomAD_HomFs, homozygous frequencies estimated from all exomes in gnomAD; gnomAD_AAFs, amino acid frequencies estimated from all exomes in gnomAD; RefAA, reference amino acid; AltAA, alternate amino acid; AAF, amino acid frequency; AF, allele frequency; GF, genotype frequency.

Table 1

Features associated with missense variants mined in this work

Category	Feature	Definition
A	MutationAssessor, SIFT, PROVEAN, GERP++RS, phyloP100way_vertebrate, phyloP30way_mammalian, phyloP17way_primate, phastCons100way_vertebrate, phastCons30way_mammalian, phastCons17way_primate, SiPhy_29way_logOdds	Pathogenicity likelihood scores generated by 7 non-machine learning-based component tools
B	gnomAD_WtFs, gnomAD_AFs, gnomAD_HetFs, gnomAD_HomFs, gnomAD_AAFs	AAFs, AFs, and GFs of each variant estimated from 125,748 exomes in gnomAD (version 2.1.1)
C	GeVIR, VIRLoF, HIP, CCRs, Interpro_domain, oe_mis_upper, RefAA_A, RefAA_C, RefAA_D, RefAA_E, RefAA_F, RefAA_G, RefAA_H, RefAA_I, RefAA_K, RefAA_L, RefAA_M, RefAA_N, RefAA_P, RefAA_Q, RefAA_R, RefAA_S, RefAA_T, RefAA_V, RefAA_W, RefAA_Y, AltAA_A, AltAA_C, AltAA_D, AltAA_E, AltAA_F, AltAA_G, AltAA_H, AltAA_I, AltAA_K, AltAA_L, AltAA_M, AltAA_N, AltAA_P, AltAA_Q, AltAA_R, AltAA_S, AltAA_T, AltAA_V, AltAA_W, AltAA_Y	Region/gene-based scores

Category	Feature	Definition
A	MutationAssessor, SIFT, PROVEAN, GERP++RS, phyloP100way_vertebrate, phyloP30way_mammalian, phyloP17way_primate, phastCons100way_vertebrate, phastCons30way_mammalian, phastCons17way_primate, SiPhy_29way_logOdds	Pathogenicity likelihood scores generated by 7 non-machine learning-based component tools
B	gnomAD_WtFs, gnomAD_AFs, gnomAD_HetFs, gnomAD_HomFs, gnomAD_AAFs	AAFs, AFs, and GFs of each variant estimated from 125,748 exomes in gnomAD (version 2.1.1)
C	GeVIR, VIRLoF, HIP, CCRs, Interpro_domain, oe_mis_upper, RefAA_A, RefAA_C, RefAA_D, RefAA_E, RefAA_F, RefAA_G, RefAA_H, RefAA_I, RefAA_K, RefAA_L, RefAA_M, RefAA_N, RefAA_P, RefAA_Q, RefAA_R, RefAA_S, RefAA_T, RefAA_V, RefAA_W, RefAA_Y, AltAA_A, AltAA_C, AltAA_D, AltAA_E, AltAA_F, AltAA_G, AltAA_H, AltAA_I, AltAA_K, AltAA_L, AltAA_M, AltAA_N, AltAA_P, AltAA_Q, AltAA_R, AltAA_S, AltAA_T, AltAA_V, AltAA_W, AltAA_Y	Region/gene-based scores

As population variants tend to have higher AFs than pathogenic variants, modes including AFs as features may perform better in test sets including population variants. Therefore, we excluded 1KGP variants from the test set and re-conducted the comparison models that displayed similar performance on test sets with or without population variants (Figure S1).

Training Data Prefiltration Improves the Model Performance

We observed that genomic locations of variants recorded in disease databases are likely to be biased by interests of the research field, i.e., variants in the databases are likely to be enriched in “hotspots” of the human genome. To evaluate the enrichment pattern of variants from different databases on genome, we calculated the ratio between the number of missense variants in each gene and the length of the gene’s protein-coding sequence (VPR). We found that VPR in disease databases is much variable than that in gnomAD, with the coefficient of variation (CV) of 0.815% for gnomAD and 2.35% for disease databases (Figure 4 A–C). Outlier detection based on IQR detected 1.12% of the genes as outliers (VPR > mean + 1.5 × IQR) in gnomAD, but 13.48% of the genes as outliers in disease databases (Figure 4D; Table S3). Likewise, when plotting the number of variants against the length of the coding sequence, we observed a significant positive correlation for variants in gnomAD, but not for variants in the pathogenic databases (Figure 4E and F). GO enrichment analysis revealed that outlier genes in gnomAD are enriched in pathways associated with immune response, which are known to be hotspots of positive selection (Figure 4G). In contrast, top enriched pathways of outlier genes in pathogenic databases include gland development (adjusted P = 1.33E−12), regulation of body fluid levels (adjusted P = 8.61E−12), and response to an inorganic substance (adjusted P = 9.70E−11) (Figure 4H), reflecting that different variant enrichment patterns are there in the disease and population databases.

Figure 4

The enrichment pattern of variants from different databases and AFs selected The VPR was calculated. A. After being centered, VPR in gnomAD genomes was plotted against the ranking of protein-coding sequence length. B. After being centered, VPR in pathogenic databases was plotted against the ranking of protein-coding sequence length. C. The CV for VPR in gnomAD genomes and pathogenic databases. D. The percentage of outlier genes related to gnomAD genomes and pathogenic databases. E. Variant number profile across all genes in gnomAD genomes. F. Variant number profile across all genes in pathogenic databases. R 2 corresponds to the coefficient of determination. G. Bubble chart showing GO enrichment analysis of variants in gnomAD genomes. H. Bubble chart showing GO enrichment analysis of variants in pathogenic databases. I. AUROC obtained from models trained on training sets combining gnomAD variants selected according to different AF thresholds. J. AUPRC obtained from models trained on training sets combining gnomAD variants selected according to different AF thresholds. “Control” represents the model without incorporating gnomAD variants, “No singleton” represents the model incorporating gnomAD variants with singletons removed, and “All” represents the model with all gnomAD variants incorporated. VPR, ratio between the number of missense variants in each gene and the length of the gene’s protein-coding sequence; CV, coefficient of variation; GO, Gene Ontology.

To further test if aggregation of genetic variants on genes impairs the model performance, we down-sampled pathogenic variants on genes with large numbers of pathogenic variants. Specifically, on genes with VPR greater than a set threshold (0.008, 0.015, 0.040, 0.065, and 0.090), we randomly selected a fixed number of variants and combined them with variants from other genes to form the down-sampled training sets (Figure S2A). The down-sampled training sets were then fed to models using different feature combinations (A + B and A + B + C). To avoid a similar variant enrichment pattern in the test set, we randomly selected one pathogenic variant from each gene to form a test set. Ten test sets were created in each round of the ten-fold cross-validation. We found that the predictive power of the models reduced with down-sampling (Figure S2B and C), which is possible because the number of variants available for learning is largely reduced with down-sampling. Adding category C slowed down the reduction. However, overall, down-sampling appears to attenuate the performance of the model, and thus is disfavored (Figure S2B and C).

Next, we compared the performance of models trained on six training sets generated by different strategies. In our assessment, we found that adding variants from the population database skewed the distribution of benign variant AFs toward zero, making it similar to that of pathogenic variants (Figure S3). Including neutral variants from the population database have a positive impact on the model, with adding variants without singletons displaying the highest performance (Figure 4I and J). In contrast, adding the full variants from gnomAD jeopardizes the model performance, possibly due to the inclusion of variants in the population database that are not true benign variants (Figure 4I and J). To further investigate whether this improvement is due to the correlation of AFs between training and test sets, we divided the original test sets into bins based on AFs (based on gnomAD exomes) and tested the performance of the models on variants within a specific AF range. We found that including population variants removing singletons slightly enhances the performance of the model in most of the bins, especially the low-AF bins, where most of the pathogenic variants are located, likely because it expands the training set while excluding unreliable (singleton) samples (Table S4). In contrast, the performance of the model drops with AF cutoff increases, probably due to the lack of rare benign variants in the training set.

Based on the aforementioned analyses, mvPPT was finally trained using LightGBM (tuned by Bayesian optimization) on variants from three disease databases and gnomAD with singletons removed, using all features from categories A, B, and C. The correlation among the individual features and relative importance of these features are shown in Figure S4 and the description of the training set is shown in Table S5.

mvPPT Outperforms Existing Prediction Tools

For assessment, we collected variants from VariSNP, VKGL, DPV, DoCM, MetaLR/SVM_Test, and CAPICE_Test, to generate an independent test set. Variants that were used in any training set or in our features’ training data were discarded from the test set, and only variants with comprehensive scores required by all comparators were included. In total, there are 175,144 variants in the test set with 168,222 benign variants and 6922 pathogenic variants (Table S6).

Using the new test set, the performance of mvPPT was benchmarked against 15 prediction tools that are widely used and readily implemented, including MVP, CAPICE, FATHMM-XF, REVEL, M-CAP (version 1.4), ClinPred, ReVe, PrimateAI, MetaSVM, MetaLR, MISTIC, CADD (version 1.4), PolyPhen-2 HDIV, PolyPhen-2 HVAR, and VEST (version 4). Among all these tools, mvPPT has the highest AUROC of 0.960 and the highest AUPRC of 0.791 (Figure 5). MISTIC has the second-best overall performance, with AUROC of 0.920 and AUPRC of 0.565. Besides, we also calculated other metrics include accuracy, precision [also known as positive predictive value (PPV)], sensitivity [also known as true positive rate (TPR)], F1 score, log loss, Matthews correlation coefficient (MCC), true negative rate (TNR; also known as specificity), false positive rate (FPR), and diagnostic odd ratio (DOR) (Table 2). M-CAP has the highest sensitivity of 0.953 (second-best: mvPPT, 0.888), but this comes at the cost of a low precision of 0.078 (best: mvPPT, 0.323). PrimateAI has the highest specificity of 0.925 (second-best: mvPPT, 0.923) and lowest FPR of 0.075 (second-best: mvPPT, 0.077). Here, mvPPT has the highest accuracy of 0.922, the highest F1 score of 0.473, the highest DOR of 95.102, the highest MCC value of 0.508, the highest precision of 0.323, and the lowest log loss of 2.697 (Table 2).

Performance comparison of mvPPT andexistingprediction tools A. ROCs of mvPPT and 15 established prediction methods when tested on an independent test set. B. PRCs of mvPPT and 15 established prediction methods.

Figure 5

Performance comparison of mvPPT andexistingprediction tools A. ROCs of mvPPT and 15 established prediction methods when tested on an independent test set. B. PRCs of mvPPT and 15 established prediction methods.

Table 2

Overview oftheperformance of mvPPT in comparison to other tools inan independenttest set

Predictor	AUROC	AUPRC	Accuracy	Precision	Sensitivity	F1 score	Log loss	MCC	TNR	FPR	DOR
mvPPT	0.960	0.719	0.922	0.323	0.888	0.473	2.697	0.508	0.923	0.077	95.102
MISTIC	0.920	0.565	0.863	0.203	0.839	0.327	4.718	0.371	0.864	0.136	33.238
M-CAP	0.901	0.360	0.551	0.078	0.953	0.144	15.507	0.190	0.534	0.466	23.381
MVP	0.896	0.323	0.773	0.134	0.869	0.232	7.853	0.285	0.769	0.231	22.063
MetaLR	0.890	0.418	0.830	0.160	0.782	0.266	5.889	0.303	0.831	0.169	17.726
REVEL	0.888	0.342	0.849	0.171	0.733	0.278	5.204	0.305	0.854	0.146	16.087
MetaSVM	0.886	0.320	0.847	0.175	0.772	0.286	5.271	0.320	0.851	0.149	19.266
ClinPred	0.846	0.289	0.683	0.096	0.829	0.171	10.942	0.208	0.677	0.323	10.157
ReVe	0.838	0.282	0.608	0.080	0.856	0.147	13.545	0.179	0.598	0.402	8.817
CAPICE	0.812	0.334	0.733	0.101	0.731	0.178	9.224	0.200	0.733	0.267	7.457
VEST	0.792	0.189	0.634	0.079	0.780	0.144	12.638	0.163	0.628	0.372	5.997
FATHMM-XF	0.755	0.165	0.586	0.069	0.761	0.127	14.285	0.134	0.579	0.421	4.388
Polyphen2_HVAR	0.724	0.099	0.687	0.079	0.651	0.141	10.809	0.141	0.689	0.311	4.128
PrimateAI	0.711	0.124	0.901	0.145	0.308	0.197	3.426	0.164	0.925	0.075	5.505
CADD	0.706	0.087	0.417	0.054	0.835	0.102	20.153	0.093	0.399	0.601	3.354
Polyphen2_HDIV	0.684	0.073	0.545	0.061	0.737	0.114	15.711	0.107	0.537	0.463	3.250

Predictor	AUROC	AUPRC	Accuracy	Precision	Sensitivity	F1 score	Log loss	MCC	TNR	FPR	DOR
mvPPT	0.960	0.719	0.922	0.323	0.888	0.473	2.697	0.508	0.923	0.077	95.102
MISTIC	0.920	0.565	0.863	0.203	0.839	0.327	4.718	0.371	0.864	0.136	33.238
M-CAP	0.901	0.360	0.551	0.078	0.953	0.144	15.507	0.190	0.534	0.466	23.381
MVP	0.896	0.323	0.773	0.134	0.869	0.232	7.853	0.285	0.769	0.231	22.063
MetaLR	0.890	0.418	0.830	0.160	0.782	0.266	5.889	0.303	0.831	0.169	17.726
REVEL	0.888	0.342	0.849	0.171	0.733	0.278	5.204	0.305	0.854	0.146	16.087
MetaSVM	0.886	0.320	0.847	0.175	0.772	0.286	5.271	0.320	0.851	0.149	19.266
ClinPred	0.846	0.289	0.683	0.096	0.829	0.171	10.942	0.208	0.677	0.323	10.157
ReVe	0.838	0.282	0.608	0.080	0.856	0.147	13.545	0.179	0.598	0.402	8.817
CAPICE	0.812	0.334	0.733	0.101	0.731	0.178	9.224	0.200	0.733	0.267	7.457
VEST	0.792	0.189	0.634	0.079	0.780	0.144	12.638	0.163	0.628	0.372	5.997
FATHMM-XF	0.755	0.165	0.586	0.069	0.761	0.127	14.285	0.134	0.579	0.421	4.388
Polyphen2_HVAR	0.724	0.099	0.687	0.079	0.651	0.141	10.809	0.141	0.689	0.311	4.128
PrimateAI	0.711	0.124	0.901	0.145	0.308	0.197	3.426	0.164	0.925	0.075	5.505
CADD	0.706	0.087	0.417	0.054	0.835	0.102	20.153	0.093	0.399	0.601	3.354
Polyphen2_HDIV	0.684	0.073	0.545	0.061	0.737	0.114	15.711	0.107	0.537	0.463	3.250

Note: mvPPT, Pathogenicity Prediction Tool for missense variants; AUROC, area under the receiver operating characteristic curve; AUPRC, area under the precision-recall curve; MCC, Matthews correlation coefficient; TNR, true negative rate; FPR, false positive rate; DOR, diagnostic odd ratio. The best scores in each column are bolded.

Table 2

Overview oftheperformance of mvPPT in comparison to other tools inan independenttest set

Predictor	AUROC	AUPRC	Accuracy	Precision	Sensitivity	F1 score	Log loss	MCC	TNR	FPR	DOR
mvPPT	0.960	0.719	0.922	0.323	0.888	0.473	2.697	0.508	0.923	0.077	95.102
MISTIC	0.920	0.565	0.863	0.203	0.839	0.327	4.718	0.371	0.864	0.136	33.238
M-CAP	0.901	0.360	0.551	0.078	0.953	0.144	15.507	0.190	0.534	0.466	23.381
MVP	0.896	0.323	0.773	0.134	0.869	0.232	7.853	0.285	0.769	0.231	22.063
MetaLR	0.890	0.418	0.830	0.160	0.782	0.266	5.889	0.303	0.831	0.169	17.726
REVEL	0.888	0.342	0.849	0.171	0.733	0.278	5.204	0.305	0.854	0.146	16.087
MetaSVM	0.886	0.320	0.847	0.175	0.772	0.286	5.271	0.320	0.851	0.149	19.266
ClinPred	0.846	0.289	0.683	0.096	0.829	0.171	10.942	0.208	0.677	0.323	10.157
ReVe	0.838	0.282	0.608	0.080	0.856	0.147	13.545	0.179	0.598	0.402	8.817
CAPICE	0.812	0.334	0.733	0.101	0.731	0.178	9.224	0.200	0.733	0.267	7.457
VEST	0.792	0.189	0.634	0.079	0.780	0.144	12.638	0.163	0.628	0.372	5.997
FATHMM-XF	0.755	0.165	0.586	0.069	0.761	0.127	14.285	0.134	0.579	0.421	4.388
Polyphen2_HVAR	0.724	0.099	0.687	0.079	0.651	0.141	10.809	0.141	0.689	0.311	4.128
PrimateAI	0.711	0.124	0.901	0.145	0.308	0.197	3.426	0.164	0.925	0.075	5.505
CADD	0.706	0.087	0.417	0.054	0.835	0.102	20.153	0.093	0.399	0.601	3.354
Polyphen2_HDIV	0.684	0.073	0.545	0.061	0.737	0.114	15.711	0.107	0.537	0.463	3.250

Predictor	AUROC	AUPRC	Accuracy	Precision	Sensitivity	F1 score	Log loss	MCC	TNR	FPR	DOR
mvPPT	0.960	0.719	0.922	0.323	0.888	0.473	2.697	0.508	0.923	0.077	95.102
MISTIC	0.920	0.565	0.863	0.203	0.839	0.327	4.718	0.371	0.864	0.136	33.238
M-CAP	0.901	0.360	0.551	0.078	0.953	0.144	15.507	0.190	0.534	0.466	23.381
MVP	0.896	0.323	0.773	0.134	0.869	0.232	7.853	0.285	0.769	0.231	22.063
MetaLR	0.890	0.418	0.830	0.160	0.782	0.266	5.889	0.303	0.831	0.169	17.726
REVEL	0.888	0.342	0.849	0.171	0.733	0.278	5.204	0.305	0.854	0.146	16.087
MetaSVM	0.886	0.320	0.847	0.175	0.772	0.286	5.271	0.320	0.851	0.149	19.266
ClinPred	0.846	0.289	0.683	0.096	0.829	0.171	10.942	0.208	0.677	0.323	10.157
ReVe	0.838	0.282	0.608	0.080	0.856	0.147	13.545	0.179	0.598	0.402	8.817
CAPICE	0.812	0.334	0.733	0.101	0.731	0.178	9.224	0.200	0.733	0.267	7.457
VEST	0.792	0.189	0.634	0.079	0.780	0.144	12.638	0.163	0.628	0.372	5.997
FATHMM-XF	0.755	0.165	0.586	0.069	0.761	0.127	14.285	0.134	0.579	0.421	4.388
Polyphen2_HVAR	0.724	0.099	0.687	0.079	0.651	0.141	10.809	0.141	0.689	0.311	4.128
PrimateAI	0.711	0.124	0.901	0.145	0.308	0.197	3.426	0.164	0.925	0.075	5.505
CADD	0.706	0.087	0.417	0.054	0.835	0.102	20.153	0.093	0.399	0.601	3.354
Polyphen2_HDIV	0.684	0.073	0.545	0.061	0.737	0.114	15.711	0.107	0.537	0.463	3.250

To further evaluate the robustness of mvPPT, we proceeded to random sampling. We repeated the random sampling for 20 rounds. In each round, 20% of the variants in the independent test set were sampled. The results showed that mvPPT displayed the highest efficiency and robustness (Figure S5). Since mvPPT included AFs as features, we then tested the performance of our model when lacking AF information. We compared the predictive power of mvPPT with existing methods on variants with different AF levels (based on gnomAD exomes). As shown in Figure S6, mvPPT performed the best on variants with different AF levels.

For further assessment, we assembled a test set with pathogenic variants from DoCM, a highly curated database of known, disease-causing mutations in cancer-derived from literature, and benign variants randomly selected from VariSNP and VKGL. mvPPT again achieved the best performance in this test set (Figure S7).

Performance of mvPPT on Pathogenic Variants within Novel Disease-causing Genes

To further evaluate the performance of our predictor on variants in new disease-causing genes (i.e., genes which were reported as causative genes of Mendelian diseases for the first time but have not been included in disease database yet), we collected seven disease-causing genes from five recent publications [40–44], and 62 missense variants were retained with complete scores on all comparators (Table S7). We simulated 62 exomes of Mendelian diseases, by selecting one disease-causing variant and randomly selecting 1000 neutral variants from 1KGP. For each simulated exome, we calculated the percentage of predicted pathogenic variants obtained by different predictors, according to the authors’ recommended threshold (Figure 6 A; Table S8). The ranking of pathogenic variants among all variants in each simulated exome is presented in Figure 6B. Among all predictors, PrimateAI generated the shortest list of pathogenic variants, followed by MISTIC and mvPPT. However, only 41 and 44 of the 62 variants were predicted as pathogenic by PrimateAI and MISTIC, respectively (Table S8). CADD identified 100% of these 62 variants as pathogenic, but may cause a plenty of false positives (PPV = 0.615 ± 0.002) (Figure 6A; Table S8). Instead, mvPPT performed relatively well in both sensitivity (60/62) and PPV (0.137 ± 0.001) (Figure 6A; Table S8). To further evaluate the ability of each predictor in prioritizing the pathogenic variants, we computed the ranking of pathogenic variants among all variants in simulated exomes (Figure 6B). Pathogenic variants showed the best ranking in mvPPT (29 ± 9), significantly better than the rest of the tools (second-best: ClinPred 75 ± 11, P = 2.12E−08, Wilcoxon matched-pairs signed-rank test), further demonstrating the advantages of mvPPT on detecting pathogenic variants within novel disease-causing genes.

Evaluation of the different prediction tools using simulated disease exomes A. Distribution of the percentage of predicted pathogenic variants in the simulated disease exomes. B. Ranking of the pathogenic variants in the simulated disease exomes.

Figure 6

Evaluation of the different prediction tools using simulated disease exomes A. Distribution of the percentage of predicted pathogenic variants in the simulated disease exomes. B. Ranking of the pathogenic variants in the simulated disease exomes.

Discussion

Missense variants as the most common category of SNVs have important implications for human genetic diseases. Although a variety of variant pathogenicity assessment tools have been established and have made important contributions to genetic variant evaluation, there is still room for improvement in prediction accuracy and precision, which is of great importance for the explanation of the tremendous number of genetic variants. In this study, we present a comprehensive prediction tool, mvPPT, and demonstrate that the performance of mvPPT is superior to other existing comparators in AUROC, AUPRC, F1 score, and many other metrics. We found that the improvement in prediction probably resulted from the careful selection of the algorithms, the features, and the training sets.

Boosting algorithms are widely adopted by many ensemble models of variant classification to improve the accuracy of prediction. mvPPT adopted a recently developed LightGBM algorithm proven to outperform the existing boosting frameworks on both efficiency and accuracy. Other than the boosting algorithms, we also tried a deep neural network framework and found that our model gave higher accuracy and precision than 10-layer, 15-layer, and 20-layer fully connected neural networks (data not shown). Compared with traditional machine learning algorithms, deep learning performs better as the scale of data increases [59]. We expect more accurate prediction models based on deep neural networks in the near future, with the accumulation of larger amounts of high-confidence training data. Recently developed deep learning models such as primateAI, SpliceAI [60], and EVE [61] provided new perspectives of variant pathogenicity prediction, which use surrounding DNA sequences as input, without requiring explicit features. These approaches could be possibly further improved with larger training data, as well as incorporating sequence conservation, constraints, and protein structure information.

In this study, mvPPT adopted 62 features belonging to three categories. Other than commonly used features extracted from previous predictors (category A), mvPPT included two categories of features associated with allele/genotype/amino acid frequencies (category B) and genetic constraint of adjunction regions (category C). Our benchmarking studies revealed that category B and category C contributed significantly to the prediction. Among them, features in category C contributed the most as a whole, and adding category B further promotes the performance (Figure S4). The high predictive power of features in categories B and C could be explained by the fact that natural selection constantly eliminates deleterious variants during evolution, and thus deleterious variants tend to locate in more conserved and intraspecies constrained genomic regions with lower AF/GF/AAF in human populations compared with neutral mutations. Considering that the pathogenic variants in disease databases are likely to have small AF and GF due to existing pathogenic variant selection criteria, we further tested our model on rare variants with AF = 0 and confirmed that the model works well even AF is not functioning as a feature in this case. Other than the features we used, protein structural changes corresponding to changes in amino acid sequence may also be critical predictors of variant pathogenicity. Newly developed protein structure prediction tools, such as AlphaFold2 [62,63], have made it possible to include protein structure information in future tools.

In addition to the algorithm and feature selection, the improvement in the performance of mvPPT also came from a cautious selection of the training set. While most of the prevailing variant classifiers trained their models on single databases, previous studies have uncovered considerable disagreement among databases [7,9]. Some tools also enrolled variants in general population databases (e.g., Exome Aggregation Consortium [64]) as benign variants, to increase the size of the training set. However, this setting may add noise to the training data, as not all variants in population databases are truly benign. In this study, we found that adding full sets of variants from gnomAD attenuates the predictive performance of our model. On the contrary, applying appropriate filters to population variants assures high-quality training data, and improves the predictive ability of our model. The possible explanation of this observation is that the singletons are more likely to be contaminated with false benign variants or less confident variants compared with non-singletons. Therefore, we examined the labels of singletons from gnomAD that we added to the training set in ClinVar. Among gnomAD with labels in ClinVar, we found that 8.1% of them were benign (labeled as “benign”, “likely benign”, and “benign/likely benign”), 78.5% were labeled as “uncertain”, 7.4% were labeled as “conflicting interpretations of pathogenicity”, and 6.0% were pathogenic (labeled as “pathogenic”, “likely pathogenic”, and “pathogenic/likely pathogenic”). In contrast, among the overlapping non-singleton missense variants, 47.8% were benign, 34.8% were labeled as “uncertain”, 15.3% were labeled as “conflicting interpretations of pathogenicity”, and 2.1% were pathogenic. Overall, our observation highlights the importance of maintaining a balance between size and purity of the training set, as well as provides practical guidance of training set selection.

As reported by previous studies, we observed that the pathogenic variants in disease databases are enriched in certain genes. However, our down-sampling experiments suggest that the aggregation of variants has few effects on pathogenicity prediction. This can be possibly explained by the fact that most features of variants are independent of genes. Furthermore, we found that adding region/gene-based information (category C) slowed down the effects of down-sampling, suggesting that incorporating genomic context further lessens the impacts on the variant aggregation (Figure S2B and C).

In conclusion, we developed an ensemble classifier, mvPPT, for predicting the pathogenicity of missense variants, and demonstrated that mvPPT achieved superior performance compared with other established prediction tools. Particularly, in clinical data, mvPPT showed the highest accuracy and robustness in classifying variants associated with both Mendelian diseases and cancer. Therefore, mvPPT promises to facilitate a better clinical interpretation of missense variants with uncertain significance. For convenient use, we built a searchable website and all pre-computed mvPPT scores are available at http://www.mvppt.club/.

Code Availability

The mvPPT scores for potential missense variants in the human genome are available at http://www.mvppt.club/. The mvPPT codes are available at https://ngdc.cncb.ac.cn/biocode/tools/BT007292 as well as at https://github.com/tongshiyuan/mvPPT for noncommercial purposes.

Competing Interests

Zai-Wei Zhou is a current employee of Shanghai Xunyin Biotechnology Co., Ltd. All the other authors have declared no competing interests.

CRediT Authorship Contribution Statement

Shi-Yuan Tong: Methodology, Software, Formal analysis, Investigation, Writing – original draft, Visualization. Ke Fan: Validation, Formal analysis, Investigation, Writing – original draft. Zai-Wei Zhou: Methodology, Investigation, Writing – original draft. Lin-Yun Liu: Validation, Writing – original draft. Shu-Qing Zhang: Validation, Data curation, Writing – original draft. Yinghui Fu: Validation, Data curation, Writing – original draft. Guang-Zhong Wang: Writing – original draft, Data curation. Ying Zhu: Conceptualization, Formal analysis, Resources, Writing – review & editing, Supervision, Project administration. Yong-Chun Yu: Conceptualization, Resources, Writing – review & editing, Visualization, Supervision, Project administration. All authors have read and approved the final manuscript.

Supplementary Material

Supplementary data to this article can be found online at https://doi.org/10.1016/j.gpb.2022.07.005.

Acknowledgments

This work is supported by the National Key R&D Program of China (Grant No. 2021ZD0202500), the Shanghai Natural Science Foundation, China (Grant No. 20ZR1403800), the National Natural Science Foundation of China (Grant Nos. 31900476, 82071259, 31930044, and 31725012), the Shanghai Municipal Science and Technology Major Project (Grant No. 2018SHZDZX01) and ZJ Lab, the Shanghai Center for Brain Science and Brain-Inspired Technology, China, the Foundation of Shanghai Municipal Education Commission, China (Grant No. 2019-01-07-00-07-E00062), and the Collaborative Innovation Program of Shanghai Municipal Health Commission, China (Grant No. 2020CXJQ01). We are thankful to Dr. Yvette Chin for English language editing.

References

Lee

Deignan

J.L.

Dorrani

Strom

S.P.

Kantarci

Quintero-Rivera

et al.

Clinical exome sequencing for genetic identification of rare Mendelian disorders

JAMA

2014

;

312

1880

–

1887

Yang

Muzny

D.M.

Reid

J.G.

Bainbridge

M.N.

Willis

Ward

P.A.

et al.

Clinical whole-exome sequencing for the diagnosis of Mendelian disorders

N Engl J Med

2013

;

369

1502

–

1511

Shihab

H.A.

Gough

Mort

Cooper

D.N.

Day

I.N.

Gaunt

T.R.

Ranking non-synonymous single nucleotide polymorphisms based on disease concepts

Hum Genomics

2014

;

P.C.

Levy

Huang

Stockwell

T.B.

Walenz

B.P.

et al.

Genetic variation in an individual human exome

PLoS Genet

2008

;

e1000160

Jagadeesh

K.A.

Wenger

A.M.

Berger

M.J.

Guturu

Stenson

P.D.

Cooper

D.N.

et al.

M-CAP eliminates a majority of variants of uncertain significance in clinical exomes at high sensitivity

Nat Genet

2016

;

1581

–

1586

Ioannidis

N.M.

Rothstein

J.H.

Pejaver

Middha

McDonnell

S.K.

Baheti

et al.

REVEL: an ensemble method for predicting the pathogenicity of rare missense variants

Am J Hum Genet

2016

;

877

–

885

Alirezaie

Kernohan

K.D.

Hartley

Majewski

Hocking

T.D.

ClinPred: prediction tool to identify disease-relevant nonsynonymous single-nucleotide variants

Am J Hum Genet

2018

;

103

474

–

483

Sundaram

Gao

Padigepati

S.R.

McRae

J.F.

Kosmicki

J.A.

et al.

Predicting the clinical impact of human mutation with deep neural networks

Nat Genet

2018

;

1161

–

1170

Dong

Wei

Jian

Gibbs

Boerwinkle

Wang

et al.

Comparison and integration of deleteriousness prediction methods for nonsynonymous SNVs in whole exome sequencing studies

Hum Mol Genet

2015

;

2125

–

2137

Kumar

Henikoff

P.C.

Predicting the effects of coding non-synonymous variants on protein function using the SIFT algorithm

Nat Protoc

2009

;

1073

–

1081

Carter

Douville

Stenson

P.D.

Cooper

D.N.

Karchin

Identifying Mendelian disease genes with the variant effect scoring tool

BMC Genomics

2013

;

Zhang

Zhao

Chen

Long

J.J.

Chung

W.K.

et al.

MVP predicts the pathogenicity of missense variants by deep learning

Nat Commun

2021

;

510

Chennen

Weber

Lornage

Kress

Bohm

Thompson

et al.

MISTIC: a prediction tool to reveal disease-relevant deleterious missense variants

PLoS One

2020

;

e0236962

Chapman

Winlaw

Dunwoodie

S.L.

Giannoulatou

VPOT: a customizable variant prioritization ordering tool for annotated variants

Genomics Proteomics Bioinformatics

2019

;

540

–

545

Reva

Antipin

Sander

Predicting the functional impact of protein mutations: application to cancer genomics

Nucleic Acids Res

2011

;

e118

Adzhubei

I.A.

Schmidt

Peshkin

Ramensky

V.E.

Gerasimova

Bork

et al.

A method and server for predicting damaging missense mutations

Nat Methods

2010

;

248

–

249

Schwarz

J.M.

Cooper

D.N.

Schuelke

Seelow

MutationTaster2: mutation prediction for the deep-sequencing age

Nat Methods

2014

;

361

–

362

Liu

Gibbs

R.A.

Boerwinkle

Polychronakos

H.Q.

Gene-specific function prediction for non-synonymous mutations in monogenic diabetes genes

PLoS One

2014

;

e104452

van der Velde

K.J.

de Ridder

van Dijk

A.D.J.

Soudis

Zwerwer

L.R.

et al.

CAPICE: a computational method for Consequence-Agnostic Pathogenicity Interpretation of Clinical Exome variations

Genome Med

2020

;

Rogers

M.F.

Shihab

H.A.

Mort

Cooper

D.N.

Gaunt

T.R.

Campbell

FATHMM-XF: accurate prediction of pathogenic point mutations via extended features

Bioinformatics

2018

;

511

–

513

Rentzsch

Witten

Cooper

G.M.

Shendure

Kircher

CADD: predicting the deleteriousness of variants throughout the human genome

Nucleic Acids Res

2019

;

D886

–

D894

Zhao

Zhang

Shi

Chen

et al.

Performance evaluation of pathogenicity-computation methods for missense variants

Nucleic Acids Res

2018

;

7793

–

7804

Landrum

M.J.

Lee

J.M.

Benson

Brown

G.R.

Chao

Chitipiralla

et al.

ClinVar: improving access to variant interpretations and supporting evidence

Nucleic Acids Res

2018

;

D1062

–

D1067

Stenson

P.D.

Mort

Ball

E.V.

Evans

Hayden

Heywood

et al.

The Human Gene Mutation Database: towards a comprehensive repository of inherited mutation data for medical research, genetic diagnosis and next-generation sequencing studies

Hum Genet

2017

;

136

665

–

677

Peterson

T.A.

Doughty

Kann

M.G.

Towards precision medicine: advances in computational approaches for the analysis of human variants

J Mol Biol

2013

;

425

4047

–

4063

Salnikova

L.E.

Kolobkov

D.S.

Sviridova

D.A.

Abilev

S.K.

An overview of germline variations in genes of primary immunodeficiences through integrative analysis of ClinVar, HGMD and dbSNP databases

Hum Genet

2021

;

140

1379

–

1393

Tennessen

J.A.

Bigham

A.W.

O’Connor

T.D.

Kenny

E.E.

Gravel

et al.

Evolution and functional impact of rare coding variation from deep sequencing of human exomes

Science

2012

;

337

–

Karczewski

K.J.

Francioli

L.C.

Tiao

Cummings

B.B.

Alfoldi

Wang

et al.

The mutational constraint spectrum quantified from variation in 141,456 humans

Nature

2020

;

581

434

–

443

Abramovs

Brass

Tassabehji

GeVIR is a continuous gene-level metric that uses variant distribution patterns to prioritize disease candidate genes

Nat Genet

2020

;

–

Vitsios

Dhindsa

R.S.

Middleton

Gussow

A.B.

Petrovski

Prioritizing non-coding regions based on human genomic constraint and sequence context with deep learning

Nat Commun

2021

;

1504

Meng

Finley

Wang

Chen

et al.

LightGBM: a highly efficient gradient boosting decision tree

31st Annual Conference on Neural Information Processing Systems

2017

3149

–

3157

Anghel

Papandreou

Parnell

De Palma

Pozidis

Benchmarking and optimization of gradient boosting decision tree algorithms

. arXiv

2018

;1809.04559.

Wang

Hakonarson

ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data

Nucleic Acids Res

2010

;

e164

UniProt Consortium

UniProt: a worldwide hub of protein knowledge

Nucleic Acids Res

2019

;

D506

–

D515

1000 Genomes Project Consortium

Auton

Brooks

L.D.

Durbin

R.M.

Garrison

E.P.

Kang

H.M.

et al.

A global reference for human genetic variation

Nature

2015

;

526

–

Fokkema

I.F.A.C.

van der Velde

K.J.

Slofstra

M.K.

Ruivenkamp

C.A.L.

Vogel

M.J.

Pfundt

et al.

Dutch genome diagnostic laboratories accelerated and improved variant interpretation and increased accuracy by sharing data

Hum Mutat

2019

;

2230

–

2238

Schaafsma

G.C.

Vihinen

VariSNP, a benchmark database for variations from dbSNP

Hum Mutat

2015

;

161

–

166

Ainscough

B.J.

Griffith

Coffman

A.C.

Wagner

A.H.

Kunisaki

Choudhary

M.N.

et al.

DoCM: a database of curated mutations in cancer

Nat Methods

2016

;

806

–

807

Suzuki

Kurosawa

Fukuda

Ijima

Sumazaki

Saito

et al.

Japanese pathogenic variant database: DPV

Transl Sci Rare Dis

2018

;

133

–

137

Fliedner

Kirchner

Wiesener

van de Beek

Waisfisz

van Haelst

et al.

Variants in SCAF4 cause a neurodevelopmental disorder and are associated with impaired mRNA processing

Am J Hum Genet

2020

;

107

544

–

554

Palencia-Campos

Aoto

P.C.

Machal

E.M.F.

Rivera-Barahona

Soto-Bielicka

Bertinetti

et al.

Germline and mosaic variants in PRKACA and PRKACB cause a multiple congenital malformation syndrome

Am J Hum Genet

2020

;

107

977

–

988

Tsai

M.H.

Muir

A.M.

Wang

W.J.

Kang

Y.N.

Yang

K.C.

Chao

N.H.

et al.

Pathogenic variants in CEP85L cause sporadic and familial posterior predominant lissencephaly

Neuron

2020

;

106

237

–

245

Hadjadj

Castro

C.N.

Tusseau

Stolzenberg

M.C.

Mazerolles

Aladjidi

et al.

Early-onset autoimmunity associated with SOCS1 haploinsufficiency

Nat Commun

2020

;

5341

Lessel

Zeitler

D.M.

Reijnders

M.R.F.

Kazantsev

Nia

F.H.

Bartholomaus

et al.

Germline AGO2 mutations impair RNA interference and human neurological development

Nat Commun

2020

;

5797

Choi

Sims

G.E.

Murphy

Miller

J.R.

Chan

A.P.

Predicting the functional effect of amino acid substitutions and indels

PLoS One

2012

;

e46688

Davydov

E.V.

Goode

D.L.

Sirota

Cooper

G.M.

Sidow

Batzoglou

Identifying a high fraction of the human genome to be under selective constraint using GERP++

PLoS Comput Biol

2010

;

e1001025

Pollard

K.S.

Hubisz

M.J.

Rosenbloom

K.R.

Siepel

Detection of nonneutral substitution rates on mammalian phylogenies

Genome Res

2010

;

110

–

121

Siepel

Bejerano

Pedersen

J.S.

Hinrichs

A.S.

Hou

Rosenbloom

et al.

Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes

Genome Res

2005

;

1034

–

1050

Garber

Guttman

Clamp

Zody

M.C.

Friedman

Xie

Identifying novel constrained elements by exploiting biased substitution patterns

Bioinformatics

2009

;

i54

–

i62

Huang

Lee

Marcotte

E.M.

Hurles

M.E.

Characterising and predicting haploinsufficiency in the human genome

PLoS Genet

2010

;

e1001154

Havrilla

J.M.

Pedersen

B.S.

Layer

R.M.

Quinlan

A.R.

A map of constrained coding regions in the human genome

Nat Genet

2019

;

–

Jones

Binns

Chang

H.Y.

Fraser

McAnulla

et al.

InterProScan 5: genome-scale protein function classification

Bioinformatics

2014

;

1236

–

1240

Liu

Jian

Boerwinkle

dbNSFP: a lightweight database of human nonsynonymous SNPs and their functional predictions

Hum Mutat

2011

;

894

–

899

Liu

Mou

Dong

dbNSFP v4: a comprehensive database of transcript-specific functional predictions and annotations for human nonsynonymous and splice-site SNVs

Genome Med

2020

;

103

Ashburner

Ball

C.A.

Blake

J.A.

Botstein

Butler

Cherry

J.M.

et al.

Gene Ontology: tool for the unification of biology

Nat Genet

2000

;

–

Muruganujan

Ebert

Huang

Thomas

P.D.

PANTHER version 14: more genomes, a new PANTHER GO-slim and improvements in enrichment analysis tools

Nucleic Acids Res

2019

;

D419

–

D426

Gene Ontology Consortium

The Gene Ontology resource: enriching a GOld mine

Nucleic Acids Res

2021

;

D325

–

D334

Wang

L.G.

Han

Q.Y.

clusterProfiler: an R package for comparing biological themes among gene clusters

OMICS

2012

;

284

–

287

Cao

Liu

Tan

Song

Shu

et al.

Deep learning and its applications in biomedicine

Genomics Proteomics Bioinformatics

2018

;

–

Jaganathan

Panagiotopoulou

S.K.

McRae

J.F.

Darbandi

S.F.

Knowles

Y.I.

et al.

Predicting splicing from primary sequence with deep learning

Cell

2019

;

176

535

–

548

Frazer

Notin

Dias

Gomez

Min

J.K.

Brock

et al.

Disease variant prediction with deep generative models of evolutionary data

Nature

2021

;

599

–

Jumper

Evans

Pritzel

Green

Figurnov

Ronneberger

et al.

Highly accurate protein structure prediction with AlphaFold

Nature

2021

;

596

583

–

589

Varadi

Anyango

Deshpande

Nair

Natassia

Yordanova

et al.

AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models

Nucleic Acids Res

2022

;

D439

–

D444

Lek

Karczewski

K.J.

Minikel

E.V.

Samocha

K.E.

Banks

Fennell

et al.

Analysis of protein-coding genetic variation in 60,706 humans

Nature

2016

;

536

285

–

291

Author notes

Peer review under responsibility of Beijing Institute of Genomics, Chinese Academy of Sciences / China National Center for Bioinformation and Genetics Society of China.

© 2023 The Authors. Published by Elsevier B.V. and Science Press on behalf of Beijing Institute of Genomics, Chinese Academy of Sciences / China National Center for Bioinformation and Genetics Society of China.

This is an open access article under the terms of the http://creativecommons.org/licenses/by-nc-nd/4.0/ License, which permits use and distribution in any medium, provided the original work is properly cited, the use is non-commercial and no modifications or adaptations are made.

mvPPT: A Highly Efficient and Sensitive Pathogenicity Prediction Tool for Missense Variants (original) (raw)

Cite

Abstract

Introduction

Method

Web Resources

Missense Variant Annotation

Training Set

Cross-validation in Algorithm Selection and Feature Selection

Cross-validation in Training Set Analysis

Test Set

Features

Outlier Detection and Gene Ontology Enrichment Analysis

Metrics for Performance Evaluation

mvPPT Training

Scores from Existing Tools

Statistical Analysis

Results

The Prediction Model was Refined with Various Algorithm and Feature Selection

Training Data Prefiltration Improves the Model Performance

mvPPT Outperforms Existing Prediction Tools

Performance of mvPPT on Pathogenic Variants within Novel Disease-causing Genes

Discussion

Code Availability

Competing Interests

CRediT Authorship Contribution Statement

Supplementary Material

Acknowledgments

References

Author notes

Supplementary data

Citations

Views

Altmetric

Email alerts

Email alerts

Citing articles via

Latest

Most Cited

mvPPT: A Highly Efficient and Sensitive Pathogenicity Prediction Tool for Missense Variants (original) (raw)

Cite

Abstract

Introduction

Method

Web Resources

Missense Variant Annotation

Training Set

Cross-validation in Algorithm Selection and Feature Selection

Cross-validation in Training Set Analysis

Test Set

Features

Outlier Detection and Gene Ontology Enrichment Analysis

Metrics for Performance Evaluation

mvPPT Training

Scores from Existing Tools

Statistical Analysis

Results

The Prediction Model was Refined with Various Algorithm and Feature Selection

Training Data Prefiltration Improves the Model Performance

mvPPT Outperforms Existing Prediction Tools

Performance of mvPPT on Pathogenic Variants within Novel Disease-causing Genes

Discussion

Code Availability

Competing Interests

CRediT Authorship Contribution Statement

Supplementary Material

Acknowledgments

References

Author notes

Supplementary data

Citations

Views

Altmetric

Email alerts

Email alerts

Citing articles via

Latest

Most Read

Most Cited