SSAHA: a fast search method for large DNA databases - PubMed (original) (raw)
SSAHA: a fast search method for large DNA databases
Z Ning et al. Genome Res. 2001 Oct.
Abstract
We describe an algorithm, SSAHA (Sequence Search and Alignment by Hashing Algorithm), for performing fast searches on databases containing multiple gigabases of DNA. Sequences in the database are preprocessed by breaking them into consecutive k-tuples of k contiguous bases and then using a hash table to store the position of each occurrence of each k-tuple. Searching for a query sequence in the database is done by obtaining from the hash table the "hits" for each k-tuple in the query sequence and then performing a sort on the results. We discuss the effect of the tuple length k on the search speed, memory usage, and sensitivity of the algorithm and present the results of computational experiments which show that SSAHA can be three to four orders of magnitude faster than BLAST or FASTA, while requiring less memory than suffix tree methods. The SSAHA algorithm is used for high-throughput single nucleotide polymorphism (SNP) detection and very large scale sequence assembly. Also, it provides Web-based sequence search facilities for Ensembl projects.
Figures
Figure 1
Percentage of data remaining as the cutoff threshold N is varied, for different values of the word size k.
Similar articles
- Refined repetitive sequence searches utilizing a fast hash function and cross species information retrievals.
Reneker J, Shyu CR. Reneker J, et al. BMC Bioinformatics. 2005 May 3;6:111. doi: 10.1186/1471-2105-6-111. BMC Bioinformatics. 2005. PMID: 15869708 Free PMC article. - muBLASTP: database-indexed protein sequence search on multicore CPUs.
Zhang J, Misra S, Wang H, Feng WC. Zhang J, et al. BMC Bioinformatics. 2016 Nov 4;17(1):443. doi: 10.1186/s12859-016-1302-4. BMC Bioinformatics. 2016. PMID: 27809763 Free PMC article. - SST: an algorithm for finding near-exact sequence matches in time proportional to the logarithm of the database size.
Giladi E, Walker MG, Wang JZ, Volkmuth W. Giladi E, et al. Bioinformatics. 2002 Jun;18(6):873-7. doi: 10.1093/bioinformatics/18.6.873. Bioinformatics. 2002. PMID: 12075023 - The EMBL Nucleotide Sequence Database. Contributing and accessing data.
Hingamp P, van den Broek AE, Stoesser G, Baker W. Hingamp P, et al. Mol Biotechnol. 1999 Oct;12(3):255-67. doi: 10.1385/MB:12:3:255. Mol Biotechnol. 1999. PMID: 10631682 Review. - Review of alignment and SNP calling algorithms for next-generation sequencing data.
Mielczarek M, Szyda J. Mielczarek M, et al. J Appl Genet. 2016 Feb;57(1):71-9. doi: 10.1007/s13353-015-0292-7. Epub 2015 Jun 9. J Appl Genet. 2016. PMID: 26055432 Review.
Cited by
- LINCATRA: Two-cycle method to amplify RNA for transcriptome analysis from formalin-fixed paraffin-embedded tissue.
Bhamidimarri PM, Salameh L, Mahdami A, Abdullah HW, Mahboub B, Hamoudi R. Bhamidimarri PM, et al. Heliyon. 2024 Jun 12;10(12):e32896. doi: 10.1016/j.heliyon.2024.e32896. eCollection 2024 Jun 30. Heliyon. 2024. PMID: 38988576 Free PMC article. - VirusPredictor: XGBoost-based software to predict virus-related sequences in human data.
Liu G, Chen X, Luan Y, Li D. Liu G, et al. Bioinformatics. 2024 Mar 29;40(4):btae192. doi: 10.1093/bioinformatics/btae192. Bioinformatics. 2024. PMID: 38597887 Free PMC article. - Whole genome sequencing identifies novel mutations in malaria parasites resistant to artesunate (ATN) and to ATN + mefloquine combination.
Cassiano GC, Martinelli A, Mottin M, Neves BJ, Andrade CH, Ferreira PE, Cravo P. Cassiano GC, et al. Front Cell Infect Microbiol. 2024 Mar 1;14:1353057. doi: 10.3389/fcimb.2024.1353057. eCollection 2024. Front Cell Infect Microbiol. 2024. PMID: 38495651 Free PMC article. - Design, synthesis and mechanistic anticancer activity of new acetylated 5-aminosalicylate-thiazolinone hybrid derivatives.
Ramadan WS, Saber-Ayad MM, Saleh E, Abdu-Allah HHM, El-Shorbagi AA, Menon V, Tarazi H, Semreen MH, Soares NC, Hafezi S, Venkatakhalam T, Ahmed S, Kanie O, Hamoudi R, El-Awady R. Ramadan WS, et al. iScience. 2023 Dec 9;27(1):108659. doi: 10.1016/j.isci.2023.108659. eCollection 2024 Jan 19. iScience. 2023. PMID: 38235331 Free PMC article. - Creating and Using Minimizer Sketches in Computational Genomics.
Zheng H, Marçais G, Kingsford C. Zheng H, et al. J Comput Biol. 2023 Dec;30(12):1251-1276. doi: 10.1089/cmb.2023.0094. Epub 2023 Aug 30. J Comput Biol. 2023. PMID: 37646787 Free PMC article. Review.
References
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic Local Alignment Search Tool. J Mol Biol. 1990;215:403–410. - PubMed
- Altshuler D, Pollara VJ, Cowles CR, Van Etten WJ, Baldwin J, Linton L, Lander ES. An SNP map of the human genome generated by reduced representation shotgun sequencing. Nature. 2000;407:513–516. - PubMed
MeSH terms
Substances
LinkOut - more resources
Full Text Sources
Other Literature Sources
Research Materials