Effective variant filtering and expected candidate variant yield in studies of rare human disease - PubMed (original) (raw)
doi: 10.1038/s41525-021-00227-3.
Joe M Brown 2, Harriet Dashnow 2, Amelia D Wallace 2, Matt Velinder 2, Martin Tristani-Firouzi 3, Joshua D Schiffman 4, Tatiana Tvrdik 5, Rong Mao 6 7, D Hunter Best 3 5 7, Pinar Bayrak-Toydemir 6 7, Aaron R Quinlan 2 8 9
Affiliations
- PMID: 34267211
- PMCID: PMC8282602
- DOI: 10.1038/s41525-021-00227-3
Effective variant filtering and expected candidate variant yield in studies of rare human disease
Brent S Pedersen et al. NPJ Genom Med. 2021.
Abstract
In studies of families with rare disease, it is common to screen for de novo mutations, as well as recessive or dominant variants that explain the phenotype. However, the filtering strategies and software used to prioritize high-confidence variants vary from study to study. In an effort to establish recommendations for rare disease research, we explore effective guidelines for variant (SNP and INDEL) filtering and report the expected number of candidates for de novo dominant, recessive, and autosomal dominant modes of inheritance. We derived these guidelines using two large family-based cohorts that underwent whole-genome sequencing, as well as two family cohorts with whole-exome sequencing. The filters are applied to common attributes, including genotype-quality, sequencing depth, allele balance, and population allele frequency. The resulting guidelines yield ~10 candidate SNP and INDEL variants per exome, and 18 per genome for recessive and de novo dominant modes of inheritance, with substantially more candidates for autosomal dominant inheritance. For family-based, whole-genome sequencing studies, this number includes an average of three de novo, ten compound heterozygous, one autosomal recessive, four X-linked variants, and roughly 100 candidate variants following autosomal dominant inheritance. The slivar software we developed to establish and rapidly apply these filters to VCF files is available at https://github.com/brentp/slivar under an MIT license, and includes documentation and recommendations for best practices for rare disease analysis.
© 2021. The Author(s).
Conflict of interest statement
R.M., D.H.B., and P.B.-T. are employees of ARUP. The authors declare no competing interests.
Figures
Fig. 1. Evaluation of the impact of allele-balance and genotype-quality cutoffs on Mendelian violation rates for trio exomes.
We measured the number of Mendelian violations (_x_-axis) and transmissions (_y_-axis) as we varied allele balance within each plot. The genotype-quality cutoff applied is increased from 5 (A) to 10 (B) to 20 (C) for each plot. The line in each plot is drawn by varying the allele-balance cutoff and counting the number of variants that are predicted to be transmitted or apparent Mendelian violations. Dots in each plot indicate the exact rates at a given threshold. The chosen cutoff, marked with an asterisk, required a genotype quality 20 and an allele balance between 0.2 and 0.8. The false negative rate (FNR) for the allele-balance cutoff of 0.2–0.8 (in purple) is annotated for each genotype-quality cutoff.
Fig. 2. The effect of combined filters on the number of predicted de novo mutations in exome studies.
The number of candidate de novo variants for each of 149 exome trios. In each column, a point represents the number of de novo mutations per trio. Moving right along the plot, each column adds filters to the column that precedes it. The first column uses only the sample information derived above, where AB is allele balance (alternate reads/(alternate reads + reference reads) and GQ is genotype quality. The second column adds filters on gnomAD allele frequency (AF); this reduces the average number of candidates. The third column further requires that the variant is “impactful,” according to slivar.
Fig. 3. The number of candidate variants that follow different inheritance modes per exome.
The number of candidate variants for 149 exome trios are separated by inheritance mode and colored by variant class. Variants deemed impactful by slivar using annotations from VEP, snpEff, and bcftools. Counts for autosomal dominant variants are shown in a separate plot due to the much larger numbers. Each point represents the number of candidate variants for a single family (_y_-axis) passing the inheritance mode (_x_-axis), genotype-quality, population allele-frequency, and allele-balance filters. Gray bars indicate the mean number for each class and inheritance mode. Points are jittered slightly to allow viewing more samples simultaneously.
Fig. 4. Candidate autosomal de novo variants per genome identified by GATK and DeepVariant outside of low-complexity regions.
A cohort of 94 WGS trios from the Rare Genomes Project were screened for candidate de novo mutations using GATK (A) and DeepVariant (B). Variants lying in low-complexity regions were excluded. The leftmost boxplot within each subplot requires a depth ≥10, an allele balance between 0.2 and 0.8 along with a genotype quality (GQ) ≥20. Lines within the boxplot are determined from the quartiles of the data. The next box requires that the allele frequency in gnomAD is <0.01. The third box lowers the allele-frequency cutoff in gnomAD of <0.001. The final box excludes candidate de novo variants where the allele balance (of the homozygous call) in the parent is ≥2%. Supplementary Fig. 13 presents the analogous plots when including low-complexity regions.
Fig. 5. The number of candidate variants that follow different inheritance modes per genome using two different variant callers.
A Only “impactful” variants as determined by slivar using annotations from VEP, snpEff, or bcftools are shown. B The set of variants is extended to include synonymous, UTR, and conserved intron regions (but not all intronic). Counts for autosomal dominant variants are shown in a separate plot due to the much larger numbers. Each dot represents the number of candidate variants (_y_-axis) passing the inheritance mode (_x_-axis), genotype-quality, population allele-frequency, and allele-balance filters for a single family. Gray bars indicate the mean number for each class and inheritance mode. We show Fig. 5A for the sarcoma replication cohort in Supplementary Fig. 14.
Similar articles
- The struggle to find reliable results in exome sequencing data: filtering out Mendelian errors.
Patel ZH, Kottyan LC, Lazaro S, Williams MS, Ledbetter DH, Tromp H, Rupert A, Kohram M, Wagner M, Husami A, Qian Y, Valencia CA, Zhang K, Hostetter MK, Harley JB, Kaufman KM. Patel ZH, et al. Front Genet. 2014 Feb 12;5:16. doi: 10.3389/fgene.2014.00016. eCollection 2014. Front Genet. 2014. PMID: 24575121 Free PMC article. - CompoundHetVIP: Compound Heterozygous Variant Identification Pipeline.
Miller DB, Piccolo SR. Miller DB, et al. F1000Res. 2020 Oct 8;9:1211. doi: 10.12688/f1000research.26848.2. eCollection 2020. F1000Res. 2020. PMID: 33680433 Free PMC article. - Whole-exome sequencing reveals diverse modes of inheritance in sporadic mild to moderate sensorineural hearing loss in a pediatric population.
Kim NK, Kim AR, Park KT, Kim SY, Kim MY, Nam JY, Woo SJ, Oh SH, Park WY, Choi BY. Kim NK, et al. Genet Med. 2015 Nov;17(11):901-11. doi: 10.1038/gim.2014.213. Epub 2015 Feb 26. Genet Med. 2015. PMID: 25719458 - exomeSuite: Whole exome sequence variant filtering tool for rapid identification of putative disease causing SNVs/indels.
Maranhao B, Biswas P, Duncan JL, Branham KE, Silva GA, Naeem MA, Khan SN, Riazuddin S, Hejtmancik JF, Heckenlively JR, Riazuddin SA, Lee PL, Ayyagari R. Maranhao B, et al. Genomics. 2014 Feb-Mar;103(2-3):169-76. doi: 10.1016/j.ygeno.2014.02.006. Epub 2014 Mar 3. Genomics. 2014. PMID: 24603341 Free PMC article. Clinical Trial. - Exome sequencing reveals predominantly de novo variants in disorders with intellectual disability (ID) in the founder population of Finland.
Järvelä I, Määttä T, Acharya A, Leppälä J, Jhangiani SN, Arvio M, Siren A, Kankuri-Tammilehto M, Kokkonen H, Palomäki M, Varilo T, Fang M, Hadley TD, Jolly A, Linnankivi T, Paetau R, Saarela A, Kälviäinen R, Olme J, Nouel-Saied LM, Cornejo-Sanchez DM, Llaci L, Lupski JR, Posey JE, Leal SM, Schrauwen I. Järvelä I, et al. Hum Genet. 2021 Jul;140(7):1011-1029. doi: 10.1007/s00439-021-02268-1. Epub 2021 Mar 12. Hum Genet. 2021. PMID: 33710394 Free PMC article.
Cited by
- Saliva-derived DNA is suitable for the detection of clonal haematopoiesis of indeterminate potential.
O'Reilly RL, Burke J, Harraka P, Yeh P, Howlett K, Behrouzfar K, Rewse A, Tsimiklis H, Giles GG, Bubb KJ, Nicholls SJ, Milne RL, Southey MC. O'Reilly RL, et al. Sci Rep. 2024 Aug 14;14(1):18917. doi: 10.1038/s41598-024-69398-0. Sci Rep. 2024. PMID: 39143154 Free PMC article. - Case Review: Whole-Exome Sequencing Analyses Identify Carriers of a Known Likely Pathogenic Intronic BRCA1 Variant in Ovarian Cancer Cases Clinically Negative for Pathogenic BRCA1 and BRCA2 Variants.
Alenezi WM, Fierheller CT, Revil T, Serruya C, Mes-Masson AM, Foulkes WD, Provencher D, El Haffaf Z, Ragoussis J, Tonin PN. Alenezi WM, et al. Genes (Basel). 2022 Apr 15;13(4):697. doi: 10.3390/genes13040697. Genes (Basel). 2022. PMID: 35456503 Free PMC article. - Heterogeneous clinical features in Cockayne syndrome patients and siblings carrying the same CSA mutations.
Chikhaoui A, Kraoua I, Calmels N, Bouchoucha S, Obringer C, Zayoud K, Montagne B, M'rad R, Abdelhak S, Laugel V, Ricchetti M, Turki I, Yacoub-Youssef H. Chikhaoui A, et al. Orphanet J Rare Dis. 2022 Mar 5;17(1):121. doi: 10.1186/s13023-022-02257-1. Orphanet J Rare Dis. 2022. PMID: 35248096 Free PMC article. - Whole genome sequencing in ROHHAD trios proved inconclusive: what's beyond?
Grossi A, Rusmini M, Cusano R, Massidda M, Santamaria G, Napoli F, Angelelli A, Fava D, Uva P, Ceccherini I, Maghnie M. Grossi A, et al. Front Genet. 2023 Aug 7;14:1031074. doi: 10.3389/fgene.2023.1031074. eCollection 2023. Front Genet. 2023. PMID: 37609037 Free PMC article. - Improving variant calling using population data and deep learning.
Chen NC, Kolesnikov A, Goel S, Yun T, Chang PC, Carroll A. Chen NC, et al. BMC Bioinformatics. 2023 May 12;24(1):197. doi: 10.1186/s12859-023-05294-0. BMC Bioinformatics. 2023. PMID: 37173615 Free PMC article.
References
Grants and funding
- UM1 HG008900/HG/NHGRI NIH HHS/United States
- UM1 HG006504/HG/NHGRI NIH HHS/United States
- UL1 TR001863/TR/NCATS NIH HHS/United States
- T32 HG008962/HG/NHGRI NIH HHS/United States
- R01 GM124355/GM/NIGMS NIH HHS/United States
- U24 HG008956/HG/NHGRI NIH HHS/United States
LinkOut - more resources
Full Text Sources