The characterization of twenty sequenced human genomes - PubMed (original) (raw)

. 2010 Sep 9;6(9):e1001111.

doi: 10.1371/journal.pgen.1001111.

Kevin V Shianna, Dongliang Ge, Jessica M Maia, Mingfu Zhu, Jason P Smith, Elizabeth T Cirulli, Jacques Fellay, Samuel P Dickson, Curtis E Gumbs, Erin L Heinzen, Anna C Need, Elizabeth K Ruzzo, Abanish Singh, C Ryan Campbell, Linda K Hong, Katharina A Lornsen, Alexander M McKenzie, Nara L M Sobreira, Julie E Hoover-Fong, Joshua D Milner, Ruth Ottman, Barton F Haynes, James J Goedert, David B Goldstein

Affiliations

The characterization of twenty sequenced human genomes

Kimberly Pelak et al. PLoS Genet. 2010.

Abstract

We present the analysis of twenty human genomes to evaluate the prospects for identifying rare functional variants that contribute to a phenotype of interest. We sequenced at high coverage ten "case" genomes from individuals with severe hemophilia A and ten "control" genomes. We summarize the number of genetic variants emerging from a study of this magnitude, and provide a proof of concept for the identification of rare and highly-penetrant functional variants by confirming that the cause of hemophilia A is easily recognizable in this data set. We also show that the number of novel single nucleotide variants (SNVs) discovered per genome seems to stabilize at about 144,000 new variants per genome, after the first 15 individuals have been sequenced. Finally, we find that, on average, each genome carries 165 homozygous protein-truncating or stop loss variants in genes representing a diverse set of pathways.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Figure 1

Figure 1. Average per-genome overlap between SNVs in genomic databases and SNVs identified by whole-genome sequencing.

On average, 3,473,639 SNVs were observed in each genome (Table S2). A per-genome average of 87.28% of these SNVs were present in the dbSNP database (version 129, validated) (Table S3).

Figure 2

Figure 2. Concordance between sequencing and genotyping calls.

The sequenced samples were also run on either the Illumina Human 1M-Duo v3 BeadChip or the Illumina 610-Quad BeadChip. The concordance rate between the sequencing and the Illumina BeadChip genotype calls is plotted against sequencing coverage of the autosomes. A data point is plotted for each of the twenty genomes.

Figure 3

Figure 3. Coding indel length distribution.

Shown is a side-by-side comparison of the length of the coding indels in this study as compared to a previous publication . (A) Indel lengths observed in J.C. Venter's exome versus (B) indel lengths observed in this study. The data from our study have been restricted to the canonical genes or transcripts that are captured by the Agilent SureSelect Targeted Enrichment system. Indels that are a multiple of 3bp in length are marked in green.

Figure 4

Figure 4. Rank of the F8 gene as the number of control genomes increases.

The gene ranking was ordered by the number of case genomes that carried protein-truncating or stop loss variants, in homozygous form or on the X-chromosome, that were not present in control genomes in homozygous form. Ranking was performed with a “gene prioritization” function implemented in the SVA software tool (Text S1). Protein-truncating variants were defined as SNVs that cause a premature stop codon, and insertions or deletions that cause a frameshift coding change. The ranks represent an average taken from five permutations. When comparing 10 hemophilia cases to just one control, F8 ranks in the top 40 genes. Once 5 or more controls are available, it ranks in the top 5 genes.

Figure 5

Figure 5. Number of novel SNVs and novel knocked-out genes as the number of genomes increases.

The total number of novel variants, and the total number of novel genes containing protein truncating or stop loss variants, continues to drop as additional genomes are added to the analysis. Shown are the number of unique SNVs (A) and unique genes carrying a homozygous protein-truncating or stop loss variant (B) per genome, as a function of the number of genomes already considered. The genomes were added in a random order to both analyses, and 1000 permutations were performed and averaged.

Similar articles

Cited by

References

    1. Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009;25:1754–1760. - PMC - PubMed
    1. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009;25:2078–2079. - PMC - PubMed
    1. Bentley DR, Balasubramanian S, Swerdlow HP, Smith GP, Milton J, et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature. 2008;456:53–59. - PMC - PubMed
    1. Wang J, Wang W, Li R, Li Y, Tian G, et al. The diploid genome sequence of an Asian individual. Nature. 2008;456:60–65. - PMC - PubMed
    1. Wheeler DA, Srinivasan M, Egholm M, Shen Y, Chen L, et al. The complete genome of an individual by massively parallel DNA sequencing. Nature. 2008;452:872–876. - PubMed

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources