SHRiMP: accurate mapping of short color-space reads - PubMed (original) (raw)

SHRiMP: accurate mapping of short color-space reads

Stephen M Rumble et al. PLoS Comput Biol. 2009 May.

Abstract

The development of Next Generation Sequencing technologies, capable of sequencing hundreds of millions of short reads (25-70 bp each) in a single run, is opening the door to population genomic studies of non-model species. In this paper we present SHRiMP - the SHort Read Mapping Package: a set of algorithms and methods to map short reads to a genome, even in the presence of a large amount of polymorphism. Our method is based upon a fast read mapping technique, separate thorough alignment methods for regular letter-space as well as AB SOLiD (color-space) reads, and a statistical model for false positive hits. We use SHRiMP to map reads from a newly sequenced Ciona savignyi individual to the reference genome. We demonstrate that SHRiMP can accurately map reads to this highly polymorphic genome, while confirming high heterozygosity of C. savignyi in this second individual. SHRiMP is freely available at http://compbio.cs.toronto.edu/shrimp.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Figure 1

Figure 1. Data flow and processing within the SHRiMP.

Candidate mapping locations are first discovered by the seed scanner and then validated by the vectorized Smith-Waterman algorithm, computing only a score. Top scoring hits are then fully aligned by a platform-specific algorithm (i.e. letter-space for Solexa data and color-space for SOLiD data). Statistical confidence for the final mappings are then computed using the PROBCALC utility.

Figure 2

Figure 2. Two representations of the color-space (dibase) encoding used by the AB SOLiD sequencing system.

A: The standard representation, with the first and second letter of the queried pair along the horizontal and vertical axes, respectively. B: The equivalent Finite State Automaton representation, with edges labelled with the readouts and nodes corresponding to the basepairs of the underlying genome.

Figure 3

Figure 3. Various mutation and error events, and their effects on the color-code readouts.

The reference genome is labeled G and the read R. A: A perfect alignment; B: In case of a sequencing error (the 2 should have been read as a 0) the rest of the read no longer matches the genome in letter-space; C: In case of a SNP two adjacent colors do not match the genome, but all subsequent letters do match. However, D: only 3 of the 9 possible color changes represent valid SNPs; E: the rules for deciding which insertion and deletion events are valid are even more complex, as indels can also change adjacent color readouts.

Figure 4

Figure 4. Color-space (dibase) sequence alignment.

A: The Dynamic Programming (DP) representation, B: recurrences, and C: alignment of a letter space sequence to a color-space read with a sequencing error. Within the DP matrix we simultaneously align all of the four possible translations (vertical) to the reference genome (horizontal); however the alignment can transition between translations by paying the crossover penalty. This is illustrated by the fourth recurrence, where the third index (formula image) corresponds to the translation currently being used. In the alignment (C) after the sequencing error, the original translation of the read (starting from a T) no longer matches, but a different one (starting from a C) does.

Figure 5

Figure 5. Size distribution of indels.

(A) and distance between adjacent SNPs (B) detected by SHRiMP. The distance between adjacent SNPs shows a clear 3-periodicity, due to the fact that a significant fraction of the non-repetitive C. savignyi genome is coding.

Figure 6

Figure 6. SHRiMP Hashing technique & Vectorized Alignment algorithm.

A: Overview of the k-mer filtering stage within SHRiMP: A window is moved along the genome. If a particular read has a preset number of k-mers within the window the vectorized Smith-Waterman stage is run to align the read to the genome. B: Schematic of the vectorized-implementation of the Needleman-Wunsch algorithm. The red cells are the vector being computed, on the basis of the vectors computed in the last step (yellow) and the next-to-last (blue). The match/mismatch vector for the diagonal is determined by comparing one sequence with the other one reversed (indicated by the red arrow below). To obtain the set of match/mismatch positions for the next diagonal, the lower sequence needs to be shifted to the right.

Similar articles

Cited by

References

    1. Wang J, Wang W, Li R, Li Y, Tian G, et al. The diploid genome sequence of an asian individual. Nature. 2008;456:60–65. - PMC - PubMed
    1. Bentley DR, Balasubramanian S, Swerdlow HP, Smith GP, Milton J, et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature. 2008;456:53–59. - PMC - PubMed
    1. Ley TJ, Mardis ER, Ding L, Fulton B, Mclellan MD, et al. DNA sequencing of a cytogenetically normal acute myeloid leukaemia genome. Nature. 2008;456:66–72. - PMC - PubMed
    1. Bowtie. http://bowtie-bio.sourceforge.net.
    1. mapreads. http://www.solidsoftwaretools.com/gf/project/mapreads.

Publication types

MeSH terms

LinkOut - more resources