Using quality scores and longer reads improves accuracy of Solexa read mapping - PubMed (original) (raw)
Comparative Study
Using quality scores and longer reads improves accuracy of Solexa read mapping
Andrew D Smith et al. BMC Bioinformatics. 2008.
Abstract
Background: Second-generation sequencing has the potential to revolutionize genomics and impact all areas of biomedical science. New technologies will make re-sequencing widely available for such applications as identifying genome variations or interrogating the oligonucleotide content of a large sample (e.g. ChIP-sequencing). The increase in speed, sensitivity and availability of sequencing technology brings demand for advances in computational technology to perform associated analysis tasks. The Solexa/Illumina 1G sequencer can produce tens of millions of reads, ranging in length from approximately 25-50 nt, in a single experiment. Accurately mapping the reads back to a reference genome is a critical task in almost all applications. Two sources of information that are often ignored when mapping reads from the Solexa technology are the 3' ends of longer reads, which contain a much higher frequency of sequencing errors, and the base-call quality scores.
Results: To investigate whether these sources of information can be used to improve accuracy when mapping reads, we developed the RMAP tool, which can map reads having a wide range of lengths and allows base-call quality scores to determine which positions in each read are more important when mapping. We applied RMAP to analyze data re-sequenced from two human BAC regions for varying read lengths, and varying criteria for use of quality scores. RMAP is freely available for downloading at http://rulai.cshl.edu/rmap/.
Conclusion: Our results indicate that significant gains in Solexa read mapping performance can be achieved by considering the information in 3' ends of longer reads, and appropriately using the base-call quality scores. The RMAP tool we have developed will enable researchers to effectively exploit this information in targeted re-sequencing projects.
Figures
Figure 1
Comparison of mapping accuracy of RMAPM criterion under different parameter combinations. Comparison of mapping accuracy for reads of different lengths, and allowing different numbers of mismatches without using quality scores. Both the target (BAC) region coverage (a) and the mapping selectivity (b) are displayed. The mean of these two measures is presented in (c) as mapping accuracy. Standard error of displayed values was always ≤ 1.0% and usually < 0.1%, as estimated by mapping reads obtained from the second lane of the same sequencing run of the same BAC regions (this applies also to values in Figure 2).
Figure 2
Mapping accuracy of RMAPQ criterion under varying parameters. Reads with length from 25–36 nt were mapped and 0,1, or 2 mismatches were allowed at high quality bases defined by quality score cutoffs of 4 (d) or 8 (a-d). For reference, mapping performance of RMAPM criterion with at most 2 mismatches is also shown. (a) The BAC coverage; (b) the mapping selectivity; (c) the overall mapping accuracy (equal to the mean of the BAC coverage and selectivity).(d) 2-D performance comparison in both BAC coverage and selectivity of RMAPM and RMAPQ.
Similar articles
- Updates to the RMAP short-read mapping software.
Smith AD, Chung WY, Hodges E, Kendall J, Hannon G, Hicks J, Xuan Z, Zhang MQ. Smith AD, et al. Bioinformatics. 2009 Nov 1;25(21):2841-2. doi: 10.1093/bioinformatics/btp533. Epub 2009 Sep 7. Bioinformatics. 2009. PMID: 19736251 Free PMC article. - The GNUMAP algorithm: unbiased probabilistic mapping of oligonucleotides from next-generation sequencing.
Clement NL, Snell Q, Clement MJ, Hollenhorst PC, Purwar J, Graves BJ, Cairns BR, Johnson WE. Clement NL, et al. Bioinformatics. 2010 Jan 1;26(1):38-45. doi: 10.1093/bioinformatics/btp614. Epub 2009 Oct 27. Bioinformatics. 2010. PMID: 19861355 Free PMC article. - Re-alignment of the unmapped reads with base quality score.
Peng X, Wang J, Zhang Z, Xiao Q, Li M, Pan Y. Peng X, et al. BMC Bioinformatics. 2015;16 Suppl 5(Suppl 5):S8. doi: 10.1186/1471-2105-16-S5-S8. Epub 2015 Mar 18. BMC Bioinformatics. 2015. PMID: 25860434 Free PMC article. - The Genome Sequencer FLX System--longer reads, more applications, straight forward bioinformatics and more complete data sets.
Droege M, Hill B. Droege M, et al. J Biotechnol. 2008 Aug 31;136(1-2):3-10. doi: 10.1016/j.jbiotec.2008.03.021. Epub 2008 Jun 21. J Biotechnol. 2008. PMID: 18616967 Review. - De novo sequencing of plant genomes using second-generation technologies.
Imelfort M, Edwards D. Imelfort M, et al. Brief Bioinform. 2009 Nov;10(6):609-18. doi: 10.1093/bib/bbp039. Brief Bioinform. 2009. PMID: 19933209 Review.
Cited by
- BLEND: a fast, memory-efficient and accurate mechanism to find fuzzy seed matches in genome analysis.
Firtina C, Park J, Alser M, Kim JS, Cali DS, Shahroodi T, Ghiasi NM, Singh G, Kanellopoulos K, Alkan C, Mutlu O. Firtina C, et al. NAR Genom Bioinform. 2023 Jan 20;5(1):lqad004. doi: 10.1093/nargab/lqad004. eCollection 2023 Mar. NAR Genom Bioinform. 2023. PMID: 36685727 Free PMC article. - Bioinformatics and Machine Learning Approaches to Understand the Regulation of Mobile Genetic Elements.
Giassa IC, Alexiou P. Giassa IC, et al. Biology (Basel). 2021 Sep 10;10(9):896. doi: 10.3390/biology10090896. Biology (Basel). 2021. PMID: 34571773 Free PMC article. Review. - Boosting the power of transcriptomics by developing an efficient gene expression profiling approach.
Wang J, Xu J, Yang X, Xu S, Zhang M, Lu F. Wang J, et al. Plant Biotechnol J. 2022 Jan;20(1):201-210. doi: 10.1111/pbi.13706. Epub 2021 Sep 23. Plant Biotechnol J. 2022. PMID: 34510693 Free PMC article. - Technology dictates algorithms: recent developments in read alignment.
Alser M, Rotman J, Deshpande D, Taraszka K, Shi H, Baykal PI, Yang HT, Xue V, Knyazev S, Singer BD, Balliu B, Koslicki D, Skums P, Zelikovsky A, Alkan C, Mutlu O, Mangul S. Alser M, et al. Genome Biol. 2021 Aug 26;22(1):249. doi: 10.1186/s13059-021-02443-7. Genome Biol. 2021. PMID: 34446078 Free PMC article. Review. - Levenshtein Distance, Sequence Comparison and Biological Database Search.
Berger B, Waterman MS, Yu YW. Berger B, et al. IEEE Trans Inf Theory. 2021 Jun;67(6):3287-3294. doi: 10.1109/tit.2020.2996543. Epub 2020 May 21. IEEE Trans Inf Theory. 2021. PMID: 34257466 Free PMC article.
References
- Margulies M, Egholm M, Altman W, Attiya S, Bader J, Bemben L, Berka J, Braverman M, Chen Y, Chen Z, Dewell S, Du L, Fierro J, Gomes X, Godwin B, He W, Helgesen S, Ho C, Ho C, Irzyk G, Jando S, Alenquer M, Jarvie T, Jirage K, Kim J, Knight J, Lanza J, Leamon J, Lefkowitz S, Lei M, Li J, Lohman K, Lu H, Makhijani V, McDade K, McKenna M, Myers E, Nickerson E, Nobile J, Plant R, Puc B, Ronan M, Roth G, Sarkis G, Simons J, Simpson J, Srinivasan M, Tartaro K, Tomasz A, Vogt K, Volkmer G, Wang S, Wang Y, Weiner M, Yu P, Begley R, Rothberg J. Genome sequencing in microfabricated high-density picolitre reactors. Nature. 2005;437:376–80. - PMC - PubMed
- Mikkelsen TS, Ku M, Jaffe DB, Issac B, Lieberman E, Giannoukos G, Alvarez P, Brockman W, Kim TK, Koche RP, Lee W, Mendenhall E, O'Donovan A, Presser A, Russ C, Xie X, Meissner A, Wernig M, Jaenisch R, Nusbaum C, Lander ES, Bernstein BE. Genome-wide maps of chromatin state in pluripotent and lineage-committed cells. Nature. 2007;448:553–560. doi: 10.1038/nature06008. - DOI - PMC - PubMed
- Robertson G, Hirst M, Bainbridge M, Bilenky M, Zhao Y, Zeng T, Euskirchen G, Bernier B, Varhol R, Delaney A, Thiessen N, Griffith OL, He A, Marra M, Snyder M, Jones S. Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing. Nature Methods. 2007;4:651–657. doi: 10.1038/nmeth1068. - DOI - PubMed
Publication types
MeSH terms
LinkOut - more resources
Full Text Sources
Other Literature Sources
Research Materials
Miscellaneous