Indexing strategies for rapid searches of short words in genome sequences - PubMed (original) (raw)

Indexing strategies for rapid searches of short words in genome sequences

Christian Iseli et al. PLoS One. 2007.

Abstract

Searching for matches between large collections of short (14-30 nucleotides) words and sequence databases comprising full genomes or transcriptomes is a common task in biological sequence analysis. We investigated the performance of simple indexing strategies for handling such tasks and developed two programs, fetchGWI and tagger, that index either the database or the query set. Either strategy outperforms megablast for searches with more than 10,000 probes. FetchGWI is shown to be a versatile tool for rapidly searching multiple genomes, whose performance is limited in most cases by the speed of access to the filesystem. We have made publicly available a Web interface for searching the human, mouse, and several other genomes and transcriptomes with oligonucleotide queries.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Figure 1

Figure 1. Word duplication in several genomes.

Analysis of the percentage of unique sequence tags in several genomes as a function of tag length.

Figure 2

Figure 2. Runtime comparisons on the human genome.

Runtime comparisons between fetchGWI using plain and compressed index files, tagger, and megablast. Each point is computed from the average of three runs on the human genome with different input data, except the last run done on the whole dataset. Only perfect matches are sought, except for the 2 experiments explicitly noted were 2 mismatches were allowed.

Figure 3

Figure 3. Runtime comparisons on multiple genomes.

Runtime comparisons between fetchGWI (using a compressed index file) and megablast. Each point is computed from the average of three runs on the combined genome of 9 species (human, mouse, honey bee, cattle, dog, drosophila, zebrafish, chimp, and rat) with different input data, except the last run done on the whole dataset. Only perfect matches were sought.

Figure 4

Figure 4. Runtime comparisons on combined index files.

Runtime comparisons of fetchGWI when using either multiple index files, or a single, combined, index file. Each point is computed from the average of three runs on the human and mouse genomes with different input data, except the last run done on the whole dataset Only perfect matches were sought.

Figure 5

Figure 5. Runtime comparisons on different filesystems.

Runtime comparisons of fetchGWI when the index file is stored either on a filesystem on local disks, or on a SFS cluster filesystem. Each point is computed from the average of three runs with different input data, except the last run done on the whole dataset. Only perfect matches were sought.

Similar articles

Cited by

References

    1. Liu G, Loraine AE, Shigeta R, Cline M, Cheng J, et al. Netaffx: Affymetrix probesets and annotations. Nucleic Acids Res. 2003;31:82–86. - PMC - PubMed
    1. Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL. Genbank. Nucleic Acids Res (Database issue) 2006;34:16–20. - PMC - PubMed
    1. Lal A, Sui IM, Riggins G. Serial analysis of gene expression: Probing transcriptomes for molecular targets. Current Opinion in Molecular Therapeutics. 1999;1:720–726. - PubMed
    1. Wei CL, Ng P, Chiu KP, Wong CH, Ang CC, et al. 5′ long serial analysis of gene expression (longsage) and 3′ longsage for transcriptome characterization and genome annotation. Proc Natl Acad Sci USA. 2004;32:11701–11706. - PMC - PubMed
    1. Brenner S, Williams SR, Vermaas EH, Storck T, Moon K, et al. In vitro cloning of complex mixtures of dna on microbeads: Physical separation of differentially expressed cdnas. Proc Natl Acad Sci USA. 2000;4:1665–1670. - PMC - PubMed

MeSH terms

LinkOut - more resources