SNP detection and genotyping from low-coverage sequencing data on multiple diploid samples - PubMed (original) (raw)

SNP detection and genotyping from low-coverage sequencing data on multiple diploid samples

Si Quang Le et al. Genome Res. 2011 Jun.

Abstract

Reductions in the cost of sequencing have enabled whole-genome sequencing to identify sequence variants segregating in a population. An efficient approach is to sequence many samples at low coverage, then to combine data across samples to detect shared variants. Here, we present methods to discover and genotype single-nucleotide polymorphism (SNP) sites from low-coverage sequencing data, making use of shared haplotype (linkage disequilibrium) information. For each population, we first collect SNP candidates based on independent sequence calls per site. We then use MARGARITA with genotype or phased haplotype data from the same samples to collect 20 ancestral recombination graphs (ARGs). We refine the posterior probability of SNP candidates by considering possible mutations at internal branches of the 40 marginal ancestral trees inferred from the 20 ARGs at the left and right flanking genotype sites. Using a population genetic prior distribution on tree-branch length and Bayesian inference, we determine a posterior probability of the SNP being real and also the most probable phased genotype call for each individual. We present experiments on both simulation data and real data from the 1000 Genomes Project to prove the applicability of the methods. We also explore the relative tradeoff between sequencing depth and the number of sequenced samples.

PubMed Disclaimer

Figures

Figure 1.

Figure 1.

Discovery and false-positive rates of QCALL for 400 samples with 4.0× coverage sequencing data. LDA and LDA,FW10 stand for linkage disequilibrium analysis without FW10 and with FW10. The same notation is applied to NLDA.

Figure 2.

Figure 2.

SNP discovery power for different sequencing strategies, all using 1600× data, plotted as a function of the number of non-reference alleles present in the sequenced samples.

Figure 3.

Figure 3.

SNP discovery power for different sequencing strategies as a function of the non-reference allele frequency in the population. The continuous lines show empirical results from the simulation with the allele frequency estimated from all 3000 simulated haplotypes, and the dashed lines present calculations based on sampling with marginal discovery rates per sample from Figure 2.

Figure 4.

Figure 4.

Marginal discovery rates as a function of non-reference allele count in 43 samples, from the CEU simulation and from 1000 Genomes Project data evaluated at HapMap 2 sites not in HapMap 3, on the 43 sequenced samples overlapping HapMap 2.

Figure 5.

Figure 5.

An illustrative example of a coalescent tree for four samples (eight haplotypes). Given a value at the root, A in this example, and a mutation from A to C in this example, we can infer genotypes for the four samples and, hence, compute the probability of data D conditional on this configuration. We estimate the likelihood of D given a tree t, p(D | t), by summing over all possible root values and mutations in t.

Figure 6.

Figure 6.

Two mutations at two edges of a singleton (edges connected to haplotypes fourth or eighth) lead to the same genotype configuration.

Similar articles

Cited by

References

    1. The 1000 Genomes Project Consortium 2010. A map of human genome variation from population scale sequencing. Nature 467: 1061–1073 - PMC - PubMed
    1. Albers CA, Lunter G, MacArthur DG, McVean G, Ouwehand WH, Durbin R 2011. Dindel: Accurate indel calls from short-read data. Genome Res (this issue). doi: 10.1101/gr.112326.110 - PMC - PubMed
    1. Browning BL, Yu Z 2009. Simultaneous genotype calling and haplotype phasing improves genotype accuracy and reduces false-positive associations for genome-wide association studies. Am J Hum Genet 85: 847–861 - PMC - PubMed
    1. Chen GK, Marjoram P, Wall JD 2009. Fast and flexible simulation of DNA sequence data. Genome Res 19: 136–142 - PMC - PubMed
    1. Howie BN, Donnelly P, Marchini J 2009. A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genet 5: e1000529 doi: 10.1371/journal.pgen.1000529 - PMC - PubMed

Publication types

MeSH terms

LinkOut - more resources