Indexing strategies for rapid searches of short words in genome sequences - PubMed (original) (raw)
Indexing strategies for rapid searches of short words in genome sequences
Christian Iseli et al. PLoS One. 2007.
Abstract
Searching for matches between large collections of short (14-30 nucleotides) words and sequence databases comprising full genomes or transcriptomes is a common task in biological sequence analysis. We investigated the performance of simple indexing strategies for handling such tasks and developed two programs, fetchGWI and tagger, that index either the database or the query set. Either strategy outperforms megablast for searches with more than 10,000 probes. FetchGWI is shown to be a versatile tool for rapidly searching multiple genomes, whose performance is limited in most cases by the speed of access to the filesystem. We have made publicly available a Web interface for searching the human, mouse, and several other genomes and transcriptomes with oligonucleotide queries.
Conflict of interest statement
Competing Interests: The authors have declared that no competing interests exist.
Figures
Figure 1. Word duplication in several genomes.
Analysis of the percentage of unique sequence tags in several genomes as a function of tag length.
Figure 2. Runtime comparisons on the human genome.
Runtime comparisons between fetchGWI using plain and compressed index files, tagger, and megablast. Each point is computed from the average of three runs on the human genome with different input data, except the last run done on the whole dataset. Only perfect matches are sought, except for the 2 experiments explicitly noted were 2 mismatches were allowed.
Figure 3. Runtime comparisons on multiple genomes.
Runtime comparisons between fetchGWI (using a compressed index file) and megablast. Each point is computed from the average of three runs on the combined genome of 9 species (human, mouse, honey bee, cattle, dog, drosophila, zebrafish, chimp, and rat) with different input data, except the last run done on the whole dataset. Only perfect matches were sought.
Figure 4. Runtime comparisons on combined index files.
Runtime comparisons of fetchGWI when using either multiple index files, or a single, combined, index file. Each point is computed from the average of three runs on the human and mouse genomes with different input data, except the last run done on the whole dataset Only perfect matches were sought.
Figure 5. Runtime comparisons on different filesystems.
Runtime comparisons of fetchGWI when the index file is stored either on a filesystem on local disks, or on a SFS cluster filesystem. Each point is computed from the average of three runs with different input data, except the last run done on the whole dataset. Only perfect matches were sought.
Similar articles
- Database indexing for production MegaBLAST searches.
Morgulis A, Coulouris G, Raytselis Y, Madden TL, Agarwala R, Schäffer AA. Morgulis A, et al. Bioinformatics. 2008 Aug 15;24(16):1757-64. doi: 10.1093/bioinformatics/btn322. Epub 2008 Jun 21. Bioinformatics. 2008. PMID: 18567917 Free PMC article. - MICA: desktop software for comprehensive searching of DNA databases.
Stokes WA, Glick BS. Stokes WA, et al. BMC Bioinformatics. 2006 Oct 3;7:427. doi: 10.1186/1471-2105-7-427. BMC Bioinformatics. 2006. PMID: 17018144 Free PMC article. - High speed BLASTN: an accelerated MegaBLAST search tool.
Chen Y, Ye W, Zhang Y, Xu Y. Chen Y, et al. Nucleic Acids Res. 2015 Sep 18;43(16):7762-8. doi: 10.1093/nar/gkv784. Epub 2015 Aug 6. Nucleic Acids Res. 2015. PMID: 26250111 Free PMC article. - Databases and software for the comparison of prokaryotic genomes.
Field D, Feil EJ, Wilson GA. Field D, et al. Microbiology (Reading). 2005 Jul;151(Pt 7):2125-2132. doi: 10.1099/mic.0.28006-0. Microbiology (Reading). 2005. PMID: 16000703 Review. - [Transcriptomes for serial analysis of gene expression].
Marti J, Piquemal D, Manchon L, Commes T. Marti J, et al. J Soc Biol. 2002;196(4):303-7. J Soc Biol. 2002. PMID: 12645300 Review. French.
Cited by
- Extensive remodeling of DC function by rapid maturation-induced transcriptional silencing.
Seguín-Estévez Q, Dunand-Sauthier I, Lemeille S, Iseli C, Ibberson M, Ioannidis V, Schmid CD, Rousseau P, Barras E, Geinoz A, Xenarios I, Acha-Orbea H, Reith W. Seguín-Estévez Q, et al. Nucleic Acids Res. 2014 Sep;42(15):9641-55. doi: 10.1093/nar/gku674. Epub 2014 Aug 7. Nucleic Acids Res. 2014. PMID: 25104025 Free PMC article. - CRISPR/Cas9 systems have off-target activity with insertions or deletions between target DNA and guide RNA sequences.
Lin Y, Cradick TJ, Brown MT, Deshmukh H, Ranjan P, Sarode N, Wile BM, Vertino PM, Stewart FJ, Bao G. Lin Y, et al. Nucleic Acids Res. 2014 Jun;42(11):7473-85. doi: 10.1093/nar/gku402. Epub 2014 May 16. Nucleic Acids Res. 2014. PMID: 24838573 Free PMC article. - Combinatorial patterns of graded RhoA activation and uniform F-actin depletion promote tissue curvature.
Denk-Lobnig M, Totz JF, Heer NC, Dunkel J, Martin AC. Denk-Lobnig M, et al. Development. 2021 Jun 1;148(11):dev199232. doi: 10.1242/dev.199232. Epub 2021 Jun 14. Development. 2021. PMID: 34124762 Free PMC article. - Processing and analyzing ChIP-seq data: from short reads to regulatory interactions.
Leleu M, Lefebvre G, Rougemont J. Leleu M, et al. Brief Funct Genomics. 2010 Dec;9(5-6):466-76. doi: 10.1093/bfgp/elq022. Epub 2010 Sep 22. Brief Funct Genomics. 2010. PMID: 20861161 Free PMC article. - Large-scale transcriptome data reveals transcriptional activity of fission yeast LTR retrotransposons.
Mourier T, Willerslev E. Mourier T, et al. BMC Genomics. 2010 Mar 12;11:167. doi: 10.1186/1471-2164-11-167. BMC Genomics. 2010. PMID: 20226011 Free PMC article.
References
- Lal A, Sui IM, Riggins G. Serial analysis of gene expression: Probing transcriptomes for molecular targets. Current Opinion in Molecular Therapeutics. 1999;1:720–726. - PubMed
MeSH terms
LinkOut - more resources
Full Text Sources
Other Literature Sources