Characterization of missing human genome sequences and copy-number polymorphic insertions (original) (raw)

Nature Methods volume 7, pages 365–371 (2010)Cite this article



The extent of human genomic structural variation suggests that there must be portions of the genome yet to be discovered, annotated and characterized at the sequence level. We present a resource and analysis of 2,363 new insertion sequences corresponding to 720 genomic loci. We found that a substantial fraction of these sequences are either missing, fragmented or misassigned when compared to recent de novo sequence assemblies from short-read next-generation sequence data. We determined that 18–37% of these new insertions are copy-number polymorphic, including loci that show extensive population stratification among Europeans, Asians and Africans. Complete sequencing of 156 of these insertions identified new exons and conserved noncoding sequences not yet represented in the reference genome. We developed a method to accurately genotype these new insertions by mapping next-generation sequencing datasets to the breakpoint, thereby providing a means to characterize copy-number status for regions previously inaccessible to single-nucleotide polymorphism microarrays.

Accession codes


Gene Expression Omnibus


We thank C. Campbell, G. Cooper, T. Marques-Bonet for thoughtful discussion, P. Sudmant for assistance with Illumina sequence data and members of the University of Washington and Washington University Genomes Centers for assistance with data generation. J.M.K. is supported by a US National Science Foundation Graduate Research Fellowship. This work was supported by the US National Institutes of Health grant HG004120 to E.E.E. E.E.E. receives funds as an Investigator of the Howard Hughes Medical Institute.

Author information

Authors and Affiliations

  1. Department of Genome Sciences, University of Washington School of Medicine, Seattle, Washington, USA
    Jeffrey M Kidd, Francesca Antonacci, Hillary S Hayden, Can Alkan, Maika Malig, Rajinder Kaul & Evan E Eichler
  2. Agilent Laboratories, Santa Clara, California, USA
    Nick Sampas, Paige Anderson, Anya Tsalenko, N Alice Yamada, Peter Tsang & Laurakay Bruhn
  3. Washington University Genome Sequencing Center, School of Medicine, St. Louis, Missouri, USA
    Tina Graves, Robert Fulton, Joelle Kallicki & Richard K Wilson
  4. Department of Genetics and Microbiology, University of Bari, Bari, Italy
    Mario Ventura & Giuliana Giannuzzi
  5. Howard Hughes Medical Institute, Seattle, Washington, USA
    Evan E Eichler


  1. Jeffrey M Kidd
  2. Nick Sampas
  3. Francesca Antonacci
  4. Tina Graves
  5. Robert Fulton
  6. Hillary S Hayden
  7. Can Alkan
  8. Maika Malig
  9. Mario Ventura
  10. Giuliana Giannuzzi
  11. Joelle Kallicki
  12. Paige Anderson
  13. Anya Tsalenko
  14. N Alice Yamada
  15. Peter Tsang
  16. Rajinder Kaul
  17. Richard K Wilson
  18. Laurakay Bruhn
  19. Evan E Eichler
J.M.K., N.S., F.A., A.T., R.K. and E.E.E. analyzed data. N.S., P.A., A.T., N.A.Y., P.T. and L.B. performed array CGH and copy-number analysis. F.A., M.V. and G.G. performed FISH experiments. C.A. assembled contigs. T.G., R.F., H.S.H., M.M., J.K., R.K. and R.K.W. performed clone characterization and sequencing. J.M.K., R.K., L.B. and E.E.E. designed the study. J.M.K. and E.E.E. wrote the paper with contributions from the other authors.

Corresponding author

Correspondence toEvan E Eichler.

Ethics declarations

Competing interests

N.S., P.A., A.T., N.A.Y., P.T. and L.B. are employees of Agilent Technologies. E.E.E. is a scientific advisory board member for Pacific Biosciences.

Supplementary information

