Indexing strategies for rapid searches of short words in genome sequences - PubMed (original) (raw)

Indexing strategies for rapid searches of short words in genome sequences

Christian Iseli et al. PLoS One. 2007.

Abstract

Searching for matches between large collections of short (14-30 nucleotides) words and sequence databases comprising full genomes or transcriptomes is a common task in biological sequence analysis. We investigated the performance of simple indexing strategies for handling such tasks and developed two programs, fetchGWI and tagger, that index either the database or the query set. Either strategy outperforms megablast for searches with more than 10,000 probes. FetchGWI is shown to be a versatile tool for rapidly searching multiple genomes, whose performance is limited in most cases by the speed of access to the filesystem. We have made publicly available a Web interface for searching the human, mouse, and several other genomes and transcriptomes with oligonucleotide queries.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Figure 1. Word duplication in several genomes.

Analysis of the percentage of unique sequence tags in several genomes as a function of tag length.

Figure 2. Runtime comparisons on the human genome.

Runtime comparisons between fetchGWI using plain and compressed index files, tagger, and megablast. Each point is computed from the average of three runs on the human genome with different input data, except the last run done on the whole dataset. Only perfect matches are sought, except for the 2 experiments explicitly noted were 2 mismatches were allowed.

Figure 3. Runtime comparisons on multiple genomes.

Runtime comparisons between fetchGWI (using a compressed index file) and megablast. Each point is computed from the average of three runs on the combined genome of 9 species (human, mouse, honey bee, cattle, dog, drosophila, zebrafish, chimp, and rat) with different input data, except the last run done on the whole dataset. Only perfect matches were sought.

Figure 4. Runtime comparisons on combined index files.

Runtime comparisons of fetchGWI when using either multiple index files, or a single, combined, index file. Each point is computed from the average of three runs on the human and mouse genomes with different input data, except the last run done on the whole dataset Only perfect matches were sought.

Figure 5. Runtime comparisons on different filesystems.

Runtime comparisons of fetchGWI when the index file is stored either on a filesystem on local disks, or on a SFS cluster filesystem. Each point is computed from the average of three runs with different input data, except the last run done on the whole dataset. Only perfect matches were sought.

Cited by

Detection of genomic variation by selection of a 9 mb DNA region and high throughput sequencing.
Nikolaev SI, Iseli C, Sharp AJ, Robyr D, Rougemont J, Gehrig C, Farinelli L, Antonarakis SE. Nikolaev SI, et al. PLoS One. 2009 Aug 17;4(8):e6659. doi: 10.1371/journal.pone.0006659. PLoS One. 2009. PMID: 19684856 Free PMC article.
One-step generation of phenotype-expressing triple-knockout mice with heritable mutated alleles by the CRISPR/Cas9 system.
Fujii W, Onuma A, Sugiura K, Naito K. Fujii W, et al. J Reprod Dev. 2014;60(4):324-7. doi: 10.1262/jrd.2013-139. Epub 2014 May 4. J Reprod Dev. 2014. PMID: 25110137 Free PMC article.
Effect of C-type lectin 16 on dengue virus infection in Aedes aegypti salivary glands.
Chang YC, Liu WL, Fang PH, Li JC, Liu KL, Huang JL, Chen HW, Kao CF, Chen CH. Chang YC, et al. PNAS Nexus. 2024 May 16;3(5):pgae188. doi: 10.1093/pnasnexus/pgae188. eCollection 2024 May. PNAS Nexus. 2024. PMID: 38813522 Free PMC article.
GapmeR cellular internalization by macropinocytosis induces sequence-specific gene silencing in human primary T-cells.
Fazil MH, Ong ST, Chalasani ML, Low JH, Kizhakeyil A, Mamidi A, Lim CF, Wright GD, Lakshminarayanan R, Kelleher D, Verma NK. Fazil MH, et al. Sci Rep. 2016 Nov 24;6:37721. doi: 10.1038/srep37721. Sci Rep. 2016. PMID: 27883055 Free PMC article.
Functional analysis of the ABCs of eye color in Helicoverpa armigera with CRISPR/Cas9-induced mutations.
Khan SA, Reichelt M, Heckel DG. Khan SA, et al. Sci Rep. 2017 Jan 5;7:40025. doi: 10.1038/srep40025. Sci Rep. 2017. PMID: 28053351 Free PMC article.

References

1. Liu G, Loraine AE, Shigeta R, Cline M, Cheng J, et al. Netaffx: Affymetrix probesets and annotations. Nucleic Acids Res. 2003;31:82–86. - PMC - PubMed
1. Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL. Genbank. Nucleic Acids Res (Database issue) 2006;34:16–20. - PMC - PubMed
1. Lal A, Sui IM, Riggins G. Serial analysis of gene expression: Probing transcriptomes for molecular targets. Current Opinion in Molecular Therapeutics. 1999;1:720–726. - PubMed
1. Wei CL, Ng P, Chiu KP, Wong CH, Ang CC, et al. 5′ long serial analysis of gene expression (longsage) and 3′ longsage for transcriptome characterization and genome annotation. Proc Natl Acad Sci USA. 2004;32:11701–11706. - PMC - PubMed
1. Brenner S, Williams SR, Vermaas EH, Storck T, Moon K, et al. In vitro cloning of complex mixtures of dna on microbeads: Physical separation of differentially expressed cdnas. Proc Natl Acad Sci USA. 2000;4:1665–1670. - PMC - PubMed

MeSH terms

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

Indexing strategies for rapid searches of short words in genome sequences - PubMed (original) (raw)