CADD: predicting the deleteriousness of variants throughout the human genome - PubMed (original) (raw)

CADD: predicting the deleteriousness of variants throughout the human genome

Philipp Rentzsch et al. Nucleic Acids Res. 2019.

Abstract

Combined Annotation-Dependent Depletion (CADD) is a widely used measure of variant deleteriousness that can effectively prioritize causal variants in genetic analyses, particularly highly penetrant contributors to severe Mendelian disorders. CADD is an integrative annotation built from more than 60 genomic features, and can score human single nucleotide variants and short insertion and deletions anywhere in the reference assembly. CADD uses a machine learning model trained on a binary distinction between simulated de novo variants and variants that have arisen and become fixed in human populations since the split between humans and chimpanzees; the former are free of selective pressure and may thus include both neutral and deleterious alleles, while the latter are overwhelmingly neutral (or, at most, weakly deleterious) by virtue of having survived millions of years of purifying selection. Here we review the latest updates to CADD, including the most recent version, 1.4, which supports the human genome build GRCh38. We also present updates to our website that include simplified variant lookup, extended documentation, an Application Program Interface and improved mechanisms for integrating CADD scores into other tools or applications. CADD scores, software and documentation are available at https://cadd.gs.washington.edu.

PubMed Disclaimer

Figures

Figure 1.

Figure 1.

The CADD framework. (A) Training a CADD model requires the identification of variants that are fixed or nearly fixed in human populations, but are absent in the inferred genome sequence of the human-ape ancestor (proxy-neutral variants). The sequence composition of this variant set is used to draw a matching set of proxy-deleterious variants. Using more than 60 diverse annotations, a machine learning model is trained to classify variants as proxy-neutral versus proxy-deleterious. All potential SNVs of the human reference genome are annotated using the same features, and raw CADD scores are calculated. A PHRED conversion table is derived from the relative ranking of these model scores. (B) Users provide variant sets in VCF, and CADD uses the chromosome, position, reference allele and alternative allele columns from these files. Scores are either retrieved from pre-scored files, or else variants are fully annotated and the CADD score is calculated. The PHRED-scaled score is then looked up in the conversion table, and both scores returned to the user. Users may request output files containing variant annotations.

Figure 2.

Figure 2.

Performance of CADD in comparison to other scores. Different scores are compared by area under the receiver operating characteristic (AUROC) in terms of how well they separate known pathogenic variants (ClinVar pathogenic) from frequent exome variants (ExAC, mean allele frequency >5%, assumed to be neutral): (A) All variants of the two sets, and (B) missense variants only, with matching genes between the two sets. PolyPhen2 and PROVEAN, two dedicated protein missense variant scores, perform on par with CADD and Eigen, while all other scores have a lower AUROC. The performance of CADD GRCh38-v1.4 is not significantly different from the other CADD releases. The results for more missense scores and non-coding variants are shown in Supplementary Figure S1.

Figure 3.

Figure 3.

Comparison of CADD v1.3 and v1.4 in the UCSC Genome Browser: CADD GRCh38-v1.4 scores (light blue) in comparison to lifted scores of the models of CADD v1.3 (pink) and v1.4 (gray) originally obtained for the GRCh37 genome build. Each browser track shows the maximum CADD score of the three possible SNVs at each genomic position.

Figure 4.

Figure 4.

Available CADD services. (A) The web server

https://www.cadd.gs.washington.edu

provides a rich resource for obtaining CADD scores and the underlying annotations on which they are based, as well as scripts, documentation, etc. (B) There are several ways to obtain CADD scores. First, CADD scores can be calculated for SNVs and short InDels using offline scripts or our website. Second, pre-scored SNVs and InDels can be obtained from indexed files via the graphical website interface, API or through tabix.

References

    1. Shendure J., Balasubramanian S., Church G.M., Gilbert W., Rogers J., Schloss J.A., Waterston R.H.. DNA sequencing at 40: past, present and future. Nature. 2017; 550:345–353. - PubMed
    1. Cooper G.M., Shendure J.. Needles in stacks of needles: finding disease-causal variants in a wealth of genomic data. Nat. Rev. Genet. 2011; 12:628–640. - PubMed
    1. Cooper G.M., Goode D.L., Ng S.B., Sidow A., Bamshad M.J., Shendure J., Nickerson D.A.. Single-nucleotide evolutionary constraint scores highlight disease-causing mutations. Nat. Methods. 2010; 7:250–251. - PMC - PubMed
    1. Kichaev G., Yang W., Lindstrom S., Hormozdiari F., Eskin E., Price A.L., Kraft P., Pasaniuc B.. Integrating functional data to prioritize causal variants in statistical fine-mapping studies. PLoS Genet. 2014; 10:e1004722. - PMC - PubMed
    1. Kircher M., Witten D.M., Jain P., O’Roak B.J., Cooper G.M., Shendure J.. A general framework for estimating the relative pathogenicity of human genetic variants. Nat. Genet. 2014; 46:310–315. - PMC - PubMed

Publication types

MeSH terms

LinkOut - more resources