Genotype and SNP calling from next-generation sequencing data - PubMed (original) (raw)

Review

Genotype and SNP calling from next-generation sequencing data

Rasmus Nielsen et al. Nat Rev Genet. 2011 Jun.

Abstract

Meaningful analysis of next-generation sequencing (NGS) data, which are produced extensively by genetics and genomics studies, relies crucially on the accurate calling of SNPs and genotypes. Recently developed statistical methods both improve and quantify the considerable uncertainty associated with genotype calling, and will especially benefit the growing number of studies using low- to medium-coverage data. We review these methods and provide a guide for their use in NGS studies.

PubMed Disclaimer

Conflict of interest statement

Competing interests statement

The authors declare no competing financial interests.

Figures

Figure 1

Figure 1. Steps for converting raw next-generation sequencing data into a final set of SNP or genotype calls

Pre-processing steps (shown in yellow) transform the raw data from next-generation sequencing technology into a set of aligned reads that have a measure of confidence, or quality score, associated with the bases of each read. The per-base quality scores produced by base-calling algorithms may need to be recalibrated to accurately reflect the true error rates. Depending on the number of samples and the depth of coverage, either a multi-sample calling procedure (green) or a single-sample calling procedure (orange) may then be applied to obtain SNP or genotype calls and associated quality scores. Note that the multi-sample procedure may include a linkage-based analysis, which can substantially improve the accuracy of SNP or genotype calls. Finally, post-processing (purple) uses both known data and simple heuristics to filter the set of SNPs and/or improve the associated quality scores. Optional, although recommended, steps are shown in dashed lines.

Figure 2

Figure 2. A comparison of three genotype callers

A subset of the data (chromosome 20, bases 20,000,000–25,000,000) for the 62 CEU individuals in both the HapMap Public Release no. 28 and the 1000 Genomes Pilot Project was genotype-called using the following methods: GATK Unified Genotyper, applied to each individual independently (blue); GATK Unified Genotyper applied to all individuals collectively (red); and GATK Unified Genotyper applied to all individuals collectively, followed by Beagle using linkage disequilibrium (LD) information for genotype calling (black). For each of several quality thresholds, genotype calls with quality greater than the threshold were compared to HapMap data. Every such threshold thus entails both a proportion of called HapMap data and accuracy, relative to HapMap. For high call rates, genotyping the individuals collectively and using the LD-based method Beagle provided marked improvements.

Figure 3

Figure 3. The power of association mapping for next-generation sequencing data

Simulations of the power to detect association (_p_-value <0.05; dashed line) using various approaches for genotype calling at a 5% significance level. For each effect size, 50,000 simulations were performed for 1,000 cases and 1,000 controls assuming a population minor allele frequency (MAF) of 1% and a disease prevalence of 10%. The individual depth was simulated assuming a Poisson distribution with mean coverage of 4× and the sequence reads were sampled from the true genotypes, assuming an error rate of 1%. Genotype probabilities were calculated either by assuming a uniform genotype prior (red and light green) or by using the inferred MAF and Hardy–Weinberg equilibrium (purple and orange). Genotypes were called based on either the highest genotype probability (red and purple) or only for genotypes with a posterior-genotype probability (PP) >95% (light green). The called genotypes were tested using logistic regression, whereas the score statistic used the probability of the genotype, and therefore effectively integrates over the uncertainty in the genotype calls.

Figure 4

Figure 4. The site frequency spectrum in next-generation sequencing data

Fifty megabases of sequence were simulated for 50 individuals assuming a mean (Poisson-distributed) sequencing depth of 4× per individual, a per-site error rate of 0.003 and that 2% of all sites were variable. Because of the presence of missing data for one method, the data for all methods were subsampled down to a sample size of 20 chromosomes, using only called genotypes. Panels a and b show the site frequency spectrum (SFS) for all sites and for sites with a probability of harbouring a SNP of >95%, respectively. Both panels show the true SFS (True), the SFS using genotype calls (GC) obtained by always choosing the genotype with the highest posterior probability (Max (GC)), and when only calling genotypes with a posterior probability of >95% (GC >0.95). Notice that genotype-based inferences tend to overestimate the amount of singletons. The excess of singletons can be reduced or eliminated by using priors, or filtering processes, that are biased against singletons. However, such procedures will typically tend to introduce other biases.

Similar articles

Cited by

References

    1. Metzker M. Sequencing technologies — the next generation. Nature Rev Genet. 2010;11:31–46. This article provides an excellent Review of NGS technologies and their applications. - PubMed
    1. Li R, et al. The sequence and de novo assembly of the giant panda genome. Nature. 2010;463:311–317. - PMC - PubMed
    1. Ng SB, et al. Exome sequencing identifies the cause of a mendelian disorder. Nature Genet. 2010;42:30–35. - PMC - PubMed
    1. Nagalakshmi U, et al. The transcriptional landscape of the yeast genome defined by RNA sequencing. Science. 2008;320:1344–1349. - PMC - PubMed
    1. Guttman M, et al. Ab initio reconstruction of cell type-specific transcriptomes in mouse reveals the conserved multi-exonic structure of lincRNAs. Nature Biotech. 2010;28:503–510. - PMC - PubMed

Publication types

MeSH terms

Grants and funding

LinkOut - more resources