SPLASH: structural pattern localization analysis by sequential histograms - PubMed (original) (raw)
SPLASH: structural pattern localization analysis by sequential histograms
A Califano. Bioinformatics. 2000 Apr.
Abstract
Motivation: The discovery of sparse amino acid patterns that match repeatedly in a set of protein sequences is an important problem in computational biology. Statistically significant patterns, that is patterns that occur more frequently than expected, may identify regions that have been preserved by evolution and which may therefore play a key functional or structural role. Sparseness can be important because a handful of non-contiguous residues may play a key role, while others, in between, may be changed without significant loss of function or structure. Similar arguments may be applied to conserved DNA patterns. Available sparse pattern discovery algorithms are either inefficient or impose limitations on the type of patterns that can be discovered.
Results: This paper introduces a deterministic pattern discovery algorithm, called Splash, which can find sparse amino or nucleic acid patterns matching identically or similarly in a set of protein or DNA sequences. Sparse patterns of any length, up to the size of the input sequence, can be discovered without significant loss in performances. Splash is extremely efficient and embarrassingly parallel by nature. Large databases, such as a complete genome or the non-redundant SWISS-PROT database can be processed in a few hours on a typical workstation. Alternatively, a protein family or superfamily, with low overall homology, can be analyzed to discover common functional or structural signatures. Some examples of biologically interesting motifs discovered by Splash are reported for the histone I and for the G-Protein Coupled Receptor families. Due to its efficiency, Splash can be used to systematically and exhaustively identify conserved regions in protein family sets. These can then be used to build accurate and sensitive PSSM or HMM models for sequence analysis.
Availability: Splash is available to non-commercial research centers upon request, conditional on the signing of a test field agreement.
Contact: acal@us.ibm.com, Splash main page http://www.research.ibm.com/splash
Similar articles
- Systematic and fully automated identification of protein sequence patterns.
Hart RK, Royyuru AK, Stolovitzky G, Califano A. Hart RK, et al. J Comput Biol. 2000;7(3-4):585-600. doi: 10.1089/106652700750050952. J Comput Biol. 2000. PMID: 11108480 - An efficient, versatile and scalable pattern growth approach to mine frequent patterns in unaligned protein sequences.
Ye K, Kosters WA, Ijzerman AP. Ye K, et al. Bioinformatics. 2007 Mar 15;23(6):687-93. doi: 10.1093/bioinformatics/btl665. Epub 2007 Jan 19. Bioinformatics. 2007. PMID: 17237070 - Engineering Aspects of Olfaction.
Persaud KC. Persaud KC. In: Persaud KC, Marco S, Gutiérrez-Gálvez A, editors. Neuromorphic Olfaction. Boca Raton (FL): CRC Press/Taylor & Francis; 2013. Chapter 1. In: Persaud KC, Marco S, Gutiérrez-Gálvez A, editors. Neuromorphic Olfaction. Boca Raton (FL): CRC Press/Taylor & Francis; 2013. Chapter 1. PMID: 26042329 Free Books & Documents. Review. - ARCS-Motif: discovering correlated motifs from unaligned biological sequences.
Zhang S, Su W, Yang J. Zhang S, et al. Bioinformatics. 2009 Jan 15;25(2):183-9. doi: 10.1093/bioinformatics/btn609. Epub 2008 Dec 9. Bioinformatics. 2009. PMID: 19073591 - Discovering sequence motifs.
Bailey TL. Bailey TL. Methods Mol Biol. 2008;452:231-51. doi: 10.1007/978-1-60327-159-2_12. Methods Mol Biol. 2008. PMID: 18566768 Review.
Cited by
- SEARCHPATTOOL: a new method for mining the most specific frequent patterns for binding sites with application to prokaryotic DNA sequences.
Elloumi F, Nason M. Elloumi F, et al. BMC Bioinformatics. 2007 Sep 20;8:354. doi: 10.1186/1471-2105-8-354. BMC Bioinformatics. 2007. PMID: 17883842 Free PMC article. - Gene expression profiling suggests primary central nervous system lymphomas to be derived from a late germinal center B cell.
Montesinos-Rongen M, Brunn A, Bentink S, Basso K, Lim WK, Klapper W, Schaller C, Reifenberger G, Rubenstein J, Wiestler OD, Spang R, Dalla-Favera R, Siebert R, Deckert M. Montesinos-Rongen M, et al. Leukemia. 2008 Feb;22(2):400-5. doi: 10.1038/sj.leu.2405019. Epub 2007 Nov 8. Leukemia. 2008. PMID: 17989719 Free PMC article. - Serum proteome profiling detects myelodysplastic syndromes and identifies CXC chemokine ligands 4 and 7 as markers for advanced disease.
Aivado M, Spentzos D, Germing U, Alterovitz G, Meng XY, Grall F, Giagounidis AA, Klement G, Steidl U, Otu HH, Czibere A, Prall WC, Iking-Konert C, Shayne M, Ramoni MF, Gattermann N, Haas R, Mitsiades CS, Fung ET, Libermann TA. Aivado M, et al. Proc Natl Acad Sci U S A. 2007 Jan 23;104(4):1307-12. doi: 10.1073/pnas.0610330104. Epub 2007 Jan 12. Proc Natl Acad Sci U S A. 2007. PMID: 17220270 Free PMC article. - Detection and preliminary analysis of motifs in promoters of anaerobically induced genes of different plant species.
Mohanty B, Krishnan SP, Swarup S, Bajic VB. Mohanty B, et al. Ann Bot. 2005 Sep;96(4):669-81. doi: 10.1093/aob/mci219. Epub 2005 Jul 18. Ann Bot. 2005. PMID: 16027132 Free PMC article. - Efficient mining gapped sequential patterns for motifs in biological sequences.
Liao V, Chen MS. Liao V, et al. BMC Syst Biol. 2013;7 Suppl 4(Suppl 4):S7. doi: 10.1186/1752-0509-7-S4-S7. Epub 2013 Oct 23. BMC Syst Biol. 2013. PMID: 24565366 Free PMC article.
MeSH terms
Substances
LinkOut - more resources
Full Text Sources
Other Literature Sources