Characterization of missing human genome sequences and copy-number polymorphic insertions (original) (raw)

Nature Methods volume 7, pages 365–371 (2010)Cite this article

Subjects

Abstract

The extent of human genomic structural variation suggests that there must be portions of the genome yet to be discovered, annotated and characterized at the sequence level. We present a resource and analysis of 2,363 new insertion sequences corresponding to 720 genomic loci. We found that a substantial fraction of these sequences are either missing, fragmented or misassigned when compared to recent de novo sequence assemblies from short-read next-generation sequence data. We determined that 18–37% of these new insertions are copy-number polymorphic, including loci that show extensive population stratification among Europeans, Asians and Africans. Complete sequencing of 156 of these insertions identified new exons and conserved noncoding sequences not yet represented in the reference genome. We developed a method to accurately genotype these new insertions by mapping next-generation sequencing datasets to the breakpoint, thereby providing a means to characterize copy-number status for regions previously inaccessible to single-nucleotide polymorphism microarrays.

This is a preview of subscription content, access via your institution

Access options

Subscribe to this journal

Receive 12 print issues and online access

$259.00 per year

only $21.58 per issue

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Additional access options:

Similar content being viewed by others

Accession codes

Accessions

Gene Expression Omnibus

References

  1. International Human Genome Sequencing Consortium. Finishing the euchromatic sequence of the human genome. Nature 431, 931–945 (2004).
  2. Levy, S. et al. The diploid genome sequence of an individual human. PLoS Biol. 5, e254 (2007).
    Article Google Scholar
  3. Wheeler, D.A. et al. The complete genome of an individual by massively parallel DNA sequencing. Nature 452, 872–876 (2008).
    Article CAS Google Scholar
  4. Bentley, D.R. et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456, 53–59 (2008).
    Article CAS Google Scholar
  5. Wang, J. et al. The diploid genome sequence of an Asian individual. Nature 456, 60–65 (2008).
    Article CAS Google Scholar
  6. McKernan, K.J. et al. Sequence and structural variation in a human genome uncovered by short-read, massively parallel ligation sequencing using two-base encoding. Genome Res. published online, doi:10.1101/gr.091868.109 (22 June 2009).
  7. Li, R. et al. Building the sequence map of the human pan-genome. Nat. Biotechnol. 28, 57–63 (2010).
    Article CAS Google Scholar
  8. Hormozdiari, F., Alkan, C., Eichler, E.E. & Sahinalp, S.C. Combinatorial algorithms for structural variation detection in high-throughput sequenced genomes. Genome Res. 19, 1270–1278 (2009).
    Article CAS Google Scholar
  9. Tuzun, E. et al. Fine-scale structural variation of the human genome. Nat. Genet. 37, 727–732 (2005).
    Article CAS Google Scholar
  10. Eichler, E.E. et al. Completing the map of human genetic variation. Nature 447, 161–165 (2007).
    Article CAS Google Scholar
  11. Kidd, J.M. et al. Mapping and sequencing of structural variation from eight human genomes. Nature 453, 56–64 (2008).
    Article CAS Google Scholar
  12. Bovee, D. et al. Closing gaps in the human genome with fosmid resources generated from multiple individuals. Nat. Genet. 40, 96–101 (2008).
    Article CAS Google Scholar
  13. The Chimpanzee Sequencing and Analysis Consortium. Initial sequence of the chimpanzee genome and comparison with the human genome. Nature 437, 69–87 (2005).
  14. Perry, G.H. et al. The fine-scale and complex architecture of human copy-number variation. Am. J. Hum. Genet. 82, 685–695 (2008).
    Article CAS Google Scholar
  15. Weir, B.S. Genetic Data Analysis II (Sinauer, Sunderland, Massachusetts, USA, 1996).
  16. Redon, R. et al. Global variation in copy number in the human genome. Nature 444, 444–454 (2006).
    Article CAS Google Scholar
  17. Enattah, N.S. et al. Identification of a variant associated with adult-type hypolactasia. Nat. Genet. 30, 233–237 (2002).
    Article CAS Google Scholar
  18. Pruitt, K.D., Tatusova, T., Klimke, W. & Maglott, D.R. NCBI Reference Sequences: current status, policy and new initiatives. Nucleic Acids Res. 37, D32–D36 (2009).
    Article CAS Google Scholar
  19. Wang, E.T. et al. Alternative isoform regulation in human tissue transcriptomes. Nature 456, 470–476 (2008).
    Article CAS Google Scholar
  20. Paten, B., Herrero, J., Beal, K., Fitzgerald, S. & Birney, E. Enredo and Pecan: genome-wide mammalian consistency-based multiple alignment with paralogs. Genome Res. 18, 1814–1828 (2008).
    Article CAS Google Scholar
  21. Paten, B. et al. Genome-wide nucleotide-level mammalian ancestor reconstruction. Genome Res. 18, 1829–1843 (2008).
    Article CAS Google Scholar
  22. Cooper, G.M. et al. Distribution and intensity of constraint in mammalian genomic sequence. Genome Res. 15, 901–913 (2005).
    Article CAS Google Scholar
  23. McCarroll, S.A. et al. Integrated detection and population-genetic analysis of SNPs and copy number variation. Nat. Genet. 40, 1166–1174 (2008).
    Article CAS Google Scholar
  24. Conrad, D.F. et al. Origins and functional impact of copy number variation in the human genome. Nature 464, 704–712 (2009).
    Article Google Scholar
  25. Korbel, J.O. et al. Paired-end mapping reveals extensive structural variation in the human genome. Science 318, 420–426 (2007).
    Article CAS Google Scholar
  26. Alkan, C. et al. Personalized copy number and segmental duplication maps using next-generation sequencing. Nat. Genet. 41, 1061–1067 (2009).
    Article CAS Google Scholar
  27. Parsons, J.D. Miropeats: graphical DNA sequence comparisons. Comput. Appl. Biosci. 11, 615–619 (1995).
    CAS PubMed Google Scholar

Download references

Acknowledgements

We thank C. Campbell, G. Cooper, T. Marques-Bonet for thoughtful discussion, P. Sudmant for assistance with Illumina sequence data and members of the University of Washington and Washington University Genomes Centers for assistance with data generation. J.M.K. is supported by a US National Science Foundation Graduate Research Fellowship. This work was supported by the US National Institutes of Health grant HG004120 to E.E.E. E.E.E. receives funds as an Investigator of the Howard Hughes Medical Institute.

Author information

Authors and Affiliations

  1. Department of Genome Sciences, University of Washington School of Medicine, Seattle, Washington, USA
    Jeffrey M Kidd, Francesca Antonacci, Hillary S Hayden, Can Alkan, Maika Malig, Rajinder Kaul & Evan E Eichler
  2. Agilent Laboratories, Santa Clara, California, USA
    Nick Sampas, Paige Anderson, Anya Tsalenko, N Alice Yamada, Peter Tsang & Laurakay Bruhn
  3. Washington University Genome Sequencing Center, School of Medicine, St. Louis, Missouri, USA
    Tina Graves, Robert Fulton, Joelle Kallicki & Richard K Wilson
  4. Department of Genetics and Microbiology, University of Bari, Bari, Italy
    Mario Ventura & Giuliana Giannuzzi
  5. Howard Hughes Medical Institute, Seattle, Washington, USA
    Evan E Eichler

Authors

  1. Jeffrey M Kidd
    You can also search for this author inPubMed Google Scholar
  2. Nick Sampas
    You can also search for this author inPubMed Google Scholar
  3. Francesca Antonacci
    You can also search for this author inPubMed Google Scholar
  4. Tina Graves
    You can also search for this author inPubMed Google Scholar
  5. Robert Fulton
    You can also search for this author inPubMed Google Scholar
  6. Hillary S Hayden
    You can also search for this author inPubMed Google Scholar
  7. Can Alkan
    You can also search for this author inPubMed Google Scholar
  8. Maika Malig
    You can also search for this author inPubMed Google Scholar
  9. Mario Ventura
    You can also search for this author inPubMed Google Scholar
  10. Giuliana Giannuzzi
    You can also search for this author inPubMed Google Scholar
  11. Joelle Kallicki
    You can also search for this author inPubMed Google Scholar
  12. Paige Anderson
    You can also search for this author inPubMed Google Scholar
  13. Anya Tsalenko
    You can also search for this author inPubMed Google Scholar
  14. N Alice Yamada
    You can also search for this author inPubMed Google Scholar
  15. Peter Tsang
    You can also search for this author inPubMed Google Scholar
  16. Rajinder Kaul
    You can also search for this author inPubMed Google Scholar
  17. Richard K Wilson
    You can also search for this author inPubMed Google Scholar
  18. Laurakay Bruhn
    You can also search for this author inPubMed Google Scholar
  19. Evan E Eichler
    You can also search for this author inPubMed Google Scholar

Contributions

J.M.K., N.S., F.A., A.T., R.K. and E.E.E. analyzed data. N.S., P.A., A.T., N.A.Y., P.T. and L.B. performed array CGH and copy-number analysis. F.A., M.V. and G.G. performed FISH experiments. C.A. assembled contigs. T.G., R.F., H.S.H., M.M., J.K., R.K. and R.K.W. performed clone characterization and sequencing. J.M.K., R.K., L.B. and E.E.E. designed the study. J.M.K. and E.E.E. wrote the paper with contributions from the other authors.

Corresponding author

Correspondence toEvan E Eichler.

Ethics declarations

Competing interests

N.S., P.A., A.T., N.A.Y., P.T. and L.B. are employees of Agilent Technologies. E.E.E. is a scientific advisory board member for Pacific Biosciences.

Supplementary information

Rights and permissions

About this article

Cite this article

Kidd, J., Sampas, N., Antonacci, F. et al. Characterization of missing human genome sequences and copy-number polymorphic insertions.Nat Methods 7, 365–371 (2010). https://doi.org/10.1038/nmeth.1451

Download citation