SOAPindel: Efficient identification of indels from short paired reads (original) (raw)

  1. Ruiqiang Li1,7,
  2. Heng Li3,
  3. Jianliang Lu1,
  4. Yingrui Li1,
  5. Lars Bolund1,4,
  6. Mikkel H. Schierup2,8 and
  7. Jun Wang1,5,6,8
  8. 1BGI Shenzhen, Shenzhen 518000, China;
  9. 2Bioinformatics Research Centre, Aarhus University, DK 8000 Aarhus C, Denmark;
  10. 3Broad Institute, Cambridge, Massachusetts 02142, USA;
  11. 4Human Genetics, Aarhus University, DK 8000 Aarhus C, Denmark;
  12. 5The Novo Nordisk Foundation Center for Basic Metabolic Research, University of Copenhagen, 2200 Copenhagen, Denmark;
  13. 6Department of Biology, University of Copenhagen, 2200 Copenhagen, Denmark

Abstract

We present a new approach to indel calling that explicitly exploits that indel differences between a reference and a sequenced sample make the mapping of reads less efficient. We assign all unmapped reads with a mapped partner to their expected genomic positions and then perform extensive de novo assembly on the regions with many unmapped reads to resolve homozygous, heterozygous, and complex indels by exhaustive traversal of the de Bruijn graph. The method is implemented in the software SOAPindel and provides a list of candidate indels with quality scores. We compare SOAPindel to Dindel, Pindel, and GATK on simulated data and find similar or better performance for short indels (<10 bp) and higher sensitivity and specificity for long indels. A validation experiment suggests that SOAPindel has a false-positive rate of ∼10% for long indels (>5 bp), while still providing many more candidate indels than other approaches.

Footnotes

This article is distributed exclusively by Cold Spring Harbor Laboratory Press for the first six months after the full-issue publication date (see http://genome.cshlp.org/site/misc/terms.xhtml). After six months, it is available under a Creative Commons License (Attribution-NonCommercial 3.0 Unported License), as described at http://creativecommons.org/licenses/by-nc/3.0/.