SHRiMP: accurate mapping of short color-space reads - PubMed (original) (raw)
SHRiMP: accurate mapping of short color-space reads
Stephen M Rumble et al. PLoS Comput Biol. 2009 May.
Abstract
The development of Next Generation Sequencing technologies, capable of sequencing hundreds of millions of short reads (25-70 bp each) in a single run, is opening the door to population genomic studies of non-model species. In this paper we present SHRiMP - the SHort Read Mapping Package: a set of algorithms and methods to map short reads to a genome, even in the presence of a large amount of polymorphism. Our method is based upon a fast read mapping technique, separate thorough alignment methods for regular letter-space as well as AB SOLiD (color-space) reads, and a statistical model for false positive hits. We use SHRiMP to map reads from a newly sequenced Ciona savignyi individual to the reference genome. We demonstrate that SHRiMP can accurately map reads to this highly polymorphic genome, while confirming high heterozygosity of C. savignyi in this second individual. SHRiMP is freely available at http://compbio.cs.toronto.edu/shrimp.
Conflict of interest statement
The authors have declared that no competing interests exist.
Figures
Figure 1. Data flow and processing within the SHRiMP.
Candidate mapping locations are first discovered by the seed scanner and then validated by the vectorized Smith-Waterman algorithm, computing only a score. Top scoring hits are then fully aligned by a platform-specific algorithm (i.e. letter-space for Solexa data and color-space for SOLiD data). Statistical confidence for the final mappings are then computed using the PROBCALC utility.
Figure 2. Two representations of the color-space (dibase) encoding used by the AB SOLiD sequencing system.
A: The standard representation, with the first and second letter of the queried pair along the horizontal and vertical axes, respectively. B: The equivalent Finite State Automaton representation, with edges labelled with the readouts and nodes corresponding to the basepairs of the underlying genome.
Figure 3. Various mutation and error events, and their effects on the color-code readouts.
The reference genome is labeled G and the read R. A: A perfect alignment; B: In case of a sequencing error (the 2 should have been read as a 0) the rest of the read no longer matches the genome in letter-space; C: In case of a SNP two adjacent colors do not match the genome, but all subsequent letters do match. However, D: only 3 of the 9 possible color changes represent valid SNPs; E: the rules for deciding which insertion and deletion events are valid are even more complex, as indels can also change adjacent color readouts.
Figure 4. Color-space (dibase) sequence alignment.
A: The Dynamic Programming (DP) representation, B: recurrences, and C: alignment of a letter space sequence to a color-space read with a sequencing error. Within the DP matrix we simultaneously align all of the four possible translations (vertical) to the reference genome (horizontal); however the alignment can transition between translations by paying the crossover penalty. This is illustrated by the fourth recurrence, where the third index () corresponds to the translation currently being used. In the alignment (C) after the sequencing error, the original translation of the read (starting from a T) no longer matches, but a different one (starting from a C) does.
Figure 5. Size distribution of indels.
(A) and distance between adjacent SNPs (B) detected by SHRiMP. The distance between adjacent SNPs shows a clear 3-periodicity, due to the fact that a significant fraction of the non-repetitive C. savignyi genome is coding.
Figure 6. SHRiMP Hashing technique & Vectorized Alignment algorithm.
A: Overview of the k-mer filtering stage within SHRiMP: A window is moved along the genome. If a particular read has a preset number of k-mers within the window the vectorized Smith-Waterman stage is run to align the read to the genome. B: Schematic of the vectorized-implementation of the Needleman-Wunsch algorithm. The red cells are the vector being computed, on the basis of the vectors computed in the last step (yellow) and the next-to-last (blue). The match/mismatch vector for the diagonal is determined by comparing one sequence with the other one reversed (indicated by the red arrow below). To obtain the set of match/mismatch positions for the next diagonal, the lower sequence needs to be shifted to the right.
Similar articles
- SHRiMP2: sensitive yet practical SHort Read Mapping.
David M, Dzamba M, Lister D, Ilie L, Brudno M. David M, et al. Bioinformatics. 2011 Apr 1;27(7):1011-2. doi: 10.1093/bioinformatics/btr046. Epub 2011 Jan 28. Bioinformatics. 2011. PMID: 21278192 - Fast and accurate short read alignment with Burrows-Wheeler transform.
Li H, Durbin R. Li H, et al. Bioinformatics. 2009 Jul 15;25(14):1754-60. doi: 10.1093/bioinformatics/btp324. Epub 2009 May 18. Bioinformatics. 2009. PMID: 19451168 Free PMC article. - ComB: SNP calling and mapping analysis for color and nucleotide space platforms.
Souaiaia T, Frazier Z, Chen T. Souaiaia T, et al. J Comput Biol. 2011 Jun;18(6):795-807. doi: 10.1089/cmb.2011.0027. Epub 2011 May 12. J Comput Biol. 2011. PMID: 21563978 Free PMC article. - A survey of sequence alignment algorithms for next-generation sequencing.
Li H, Homer N. Li H, et al. Brief Bioinform. 2010 Sep;11(5):473-83. doi: 10.1093/bib/bbq015. Epub 2010 May 11. Brief Bioinform. 2010. PMID: 20460430 Free PMC article. Review. - Mapping RNA-seq Reads with STAR.
Dobin A, Gingeras TR. Dobin A, et al. Curr Protoc Bioinformatics. 2015 Sep 3;51:11.14.1-11.14.19. doi: 10.1002/0471250953.bi1114s51. Curr Protoc Bioinformatics. 2015. PMID: 26334920 Free PMC article. Review.
Cited by
- Synergistic response to climate stressors in coral is associated with genotypic variation in baseline expression.
Dilworth J, Million WC, Ruggeri M, Hall ER, Dungan AM, Muller EM, Kenkel CD. Dilworth J, et al. Proc Biol Sci. 2024 Mar 27;291(2019):20232447. doi: 10.1098/rspb.2023.2447. Epub 2024 Mar 27. Proc Biol Sci. 2024. PMID: 38531406 Free PMC article. - Aberrant miR-29 is a predictive feature of severe phenotypes in pediatric Crohn's disease.
Shumway AJ, Shanahan MT, Hollville E, Chen K, Beasley C, Villanueva JW, Albert S, Lian G, Cure MR, Schaner M, Zhu LC, Bantumilli S, Deshmukh M, Furey TS, Sheikh SZ, Sethupathy P. Shumway AJ, et al. JCI Insight. 2024 Feb 22;9(4):e168800. doi: 10.1172/jci.insight.168800. JCI Insight. 2024. PMID: 38385744 Free PMC article. - Integrative genome-scale analyses reveal post-transcriptional signatures of early human small intestinal development in a directed differentiation organoid model.
Hung YH, Capeling M, Villanueva JW, Kanke M, Shanahan MT, Huang S, Cubitt R, Rinaldi VD, Schimenti JC, Spence JR, Sethupathy P. Hung YH, et al. BMC Genomics. 2023 Oct 26;24(1):641. doi: 10.1186/s12864-023-09743-1. BMC Genomics. 2023. PMID: 37884859 Free PMC article. - BLEND: a fast, memory-efficient and accurate mechanism to find fuzzy seed matches in genome analysis.
Firtina C, Park J, Alser M, Kim JS, Cali DS, Shahroodi T, Ghiasi NM, Singh G, Kanellopoulos K, Alkan C, Mutlu O. Firtina C, et al. NAR Genom Bioinform. 2023 Jan 20;5(1):lqad004. doi: 10.1093/nargab/lqad004. eCollection 2023 Mar. NAR Genom Bioinform. 2023. PMID: 36685727 Free PMC article. - Repeat infections with chlamydia in women may be more transcriptionally active with lower responses from some immune genes.
Huston WM, Lawrence A, Wee BA, Thomas M, Timms P, Vodstrcil LA, McNulty A, McIvor R, Worthington K, Donovan B, Phillips S, Chen MY, Fairley CK, Hocking JS. Huston WM, et al. Front Public Health. 2022 Oct 10;10:1012835. doi: 10.3389/fpubh.2022.1012835. eCollection 2022. Front Public Health. 2022. PMID: 36299763 Free PMC article.
References
- Bowtie. http://bowtie-bio.sourceforge.net.
Publication types
MeSH terms
LinkOut - more resources
Full Text Sources
Other Literature Sources