Effective variant filtering and expected candidate variant yield in studies of rare human disease - PubMed (original) (raw)

doi: 10.1038/s41525-021-00227-3.

Joe M Brown 2, Harriet Dashnow 2, Amelia D Wallace 2, Matt Velinder 2, Martin Tristani-Firouzi 3, Joshua D Schiffman 4, Tatiana Tvrdik 5, Rong Mao 6 7, D Hunter Best 3 5 7, Pinar Bayrak-Toydemir 6 7, Aaron R Quinlan 2 8 9

Affiliations

Effective variant filtering and expected candidate variant yield in studies of rare human disease

Brent S Pedersen et al. NPJ Genom Med. 2021.

Abstract

In studies of families with rare disease, it is common to screen for de novo mutations, as well as recessive or dominant variants that explain the phenotype. However, the filtering strategies and software used to prioritize high-confidence variants vary from study to study. In an effort to establish recommendations for rare disease research, we explore effective guidelines for variant (SNP and INDEL) filtering and report the expected number of candidates for de novo dominant, recessive, and autosomal dominant modes of inheritance. We derived these guidelines using two large family-based cohorts that underwent whole-genome sequencing, as well as two family cohorts with whole-exome sequencing. The filters are applied to common attributes, including genotype-quality, sequencing depth, allele balance, and population allele frequency. The resulting guidelines yield ~10 candidate SNP and INDEL variants per exome, and 18 per genome for recessive and de novo dominant modes of inheritance, with substantially more candidates for autosomal dominant inheritance. For family-based, whole-genome sequencing studies, this number includes an average of three de novo, ten compound heterozygous, one autosomal recessive, four X-linked variants, and roughly 100 candidate variants following autosomal dominant inheritance. The slivar software we developed to establish and rapidly apply these filters to VCF files is available at https://github.com/brentp/slivar under an MIT license, and includes documentation and recommendations for best practices for rare disease analysis.

© 2021. The Author(s).

PubMed Disclaimer

Conflict of interest statement

R.M., D.H.B., and P.B.-T. are employees of ARUP. The authors declare no competing interests.

Figures

Fig. 1

Fig. 1. Evaluation of the impact of allele-balance and genotype-quality cutoffs on Mendelian violation rates for trio exomes.

We measured the number of Mendelian violations (_x_-axis) and transmissions (_y_-axis) as we varied allele balance within each plot. The genotype-quality cutoff applied is increased from 5 (A) to 10 (B) to 20 (C) for each plot. The line in each plot is drawn by varying the allele-balance cutoff and counting the number of variants that are predicted to be transmitted or apparent Mendelian violations. Dots in each plot indicate the exact rates at a given threshold. The chosen cutoff, marked with an asterisk, required a genotype quality 20 and an allele balance between 0.2 and 0.8. The false negative rate (FNR) for the allele-balance cutoff of 0.2–0.8 (in purple) is annotated for each genotype-quality cutoff.

Fig. 2

Fig. 2. The effect of combined filters on the number of predicted de novo mutations in exome studies.

The number of candidate de novo variants for each of 149 exome trios. In each column, a point represents the number of de novo mutations per trio. Moving right along the plot, each column adds filters to the column that precedes it. The first column uses only the sample information derived above, where AB is allele balance (alternate reads/(alternate reads + reference reads) and GQ is genotype quality. The second column adds filters on gnomAD allele frequency (AF); this reduces the average number of candidates. The third column further requires that the variant is “impactful,” according to slivar.

Fig. 3

Fig. 3. The number of candidate variants that follow different inheritance modes per exome.

The number of candidate variants for 149 exome trios are separated by inheritance mode and colored by variant class. Variants deemed impactful by slivar using annotations from VEP, snpEff, and bcftools. Counts for autosomal dominant variants are shown in a separate plot due to the much larger numbers. Each point represents the number of candidate variants for a single family (_y_-axis) passing the inheritance mode (_x_-axis), genotype-quality, population allele-frequency, and allele-balance filters. Gray bars indicate the mean number for each class and inheritance mode. Points are jittered slightly to allow viewing more samples simultaneously.

Fig. 4

Fig. 4. Candidate autosomal de novo variants per genome identified by GATK and DeepVariant outside of low-complexity regions.

A cohort of 94 WGS trios from the Rare Genomes Project were screened for candidate de novo mutations using GATK (A) and DeepVariant (B). Variants lying in low-complexity regions were excluded. The leftmost boxplot within each subplot requires a depth ≥10, an allele balance between 0.2 and 0.8 along with a genotype quality (GQ) ≥20. Lines within the boxplot are determined from the quartiles of the data. The next box requires that the allele frequency in gnomAD is <0.01. The third box lowers the allele-frequency cutoff in gnomAD of <0.001. The final box excludes candidate de novo variants where the allele balance (of the homozygous call) in the parent is ≥2%. Supplementary Fig. 13 presents the analogous plots when including low-complexity regions.

Fig. 5

Fig. 5. The number of candidate variants that follow different inheritance modes per genome using two different variant callers.

A Only “impactful” variants as determined by slivar using annotations from VEP, snpEff, or bcftools are shown. B The set of variants is extended to include synonymous, UTR, and conserved intron regions (but not all intronic). Counts for autosomal dominant variants are shown in a separate plot due to the much larger numbers. Each dot represents the number of candidate variants (_y_-axis) passing the inheritance mode (_x_-axis), genotype-quality, population allele-frequency, and allele-balance filters for a single family. Gray bars indicate the mean number for each class and inheritance mode. We show Fig. 5A for the sarcoma replication cohort in Supplementary Fig. 14.

Similar articles

Cited by

References

    1. Chong JX, et al. The genetic basis of mendelian phenotypes: discoveries, challenges, and opportunities. Am. J. Hum. Genet. 2015;97:199–215. doi: 10.1016/j.ajhg.2015.06.009. - DOI - PMC - PubMed
    1. Deciphering Developmental Disorders Study. Prevalence and architecture of de novo mutations in developmental disorders. Nature. 2017;542:433–438. doi: 10.1038/nature21062. - DOI - PMC - PubMed
    1. Li H. Toward better understanding of artifacts in variant calling from high-coverage samples. Bioinformatics. 2014;30:2843–2851. doi: 10.1093/bioinformatics/btu356. - DOI - PMC - PubMed
    1. Paila U, Chapman BA, Kirchner R, Quinlan AR. GEMINI: integrative exploration of genetic variation and genome annotations. PLoS Comput. Biol. 2013;9:e1003153. doi: 10.1371/journal.pcbi.1003153. - DOI - PMC - PubMed
    1. Kumaran M, Subramanian U, Devarajan B. Performance assessment of variant calling pipelines using human whole exome sequencing and simulated data. BMC Bioinform. 2019;20:342. doi: 10.1186/s12859-019-2928-9. - DOI - PMC - PubMed

Grants and funding

LinkOut - more resources