Empirical statistical estimates for sequence similarity searches - PubMed (original) (raw)
Comparative Study
. 1998 Feb 13;276(1):71-84.
doi: 10.1006/jmbi.1997.1525.
Affiliations
- PMID: 9514730
- DOI: 10.1006/jmbi.1997.1525
Comparative Study
Empirical statistical estimates for sequence similarity searches
W R Pearson. J Mol Biol. 1998.
Abstract
The FASTA package of sequence comparison programs has been modified to provide accurate statistical estimates for local sequence similarity scores with gaps. These estimates are derived using the extreme value distribution from the mean and variance of the local similarity scores of unrelated sequences after the scores have been corrected for the expected effect of library sequence length. This approach allows accurate estimates to be calculated for both FASTA and Smith-Waterman similarity scores for protein/protein, DNA/DNA, and protein/translated-DNA comparisons. The accuracy of the statistical estimates is summarized for 54 protein families using FASTA and Smith-Waterman scores. Probability estimates calculated from the distribution of similarity scores are generally conservative, as are probabilities calculated using the Altschul-Gish lambda, kappa, and eta parameters. The performance of several alternative methods for correcting similarity scores for library-sequence length was evaluated using 54 protein superfamilies from the PIR39 database and 110 protein families from the Prosite/SwissProt rel. 34 database. Both regression-scaled and Altschul-Gish scaled scores perform significantly better than unscaled Smith-Waterman or FASTA similarity scores. When the Prosite/ SwissProt test set is used, regression-scaled scores perform slightly better; when the PIR database is used, Altschul-Gish scaled scores perform best. Thus, length-corrected similarity scores improve the sensitivity of database searches. Statistical parameters that are derived from the distribution of similarity scores from the thousands of unrelated sequences typically encountered in a database search provide accurate estimates of statistical significance that can be used to infer sequence homology.
Similar articles
- Sensitivity and selectivity in protein similarity searches: a comparison of Smith-Waterman in hardware to BLAST and FASTA.
Shpaer EG, Robinson M, Yee D, Candlin JD, Mines R, Hunkapiller T. Shpaer EG, et al. Genomics. 1996 Dec 1;38(2):179-91. doi: 10.1006/geno.1996.0614. Genomics. 1996. PMID: 8954800 - Comparison of methods for searching protein sequence databases.
Pearson WR. Pearson WR. Protein Sci. 1995 Jun;4(6):1145-60. doi: 10.1002/pro.5560040613. Protein Sci. 1995. PMID: 7549879 Free PMC article. - FASTA-SWAP and FASTA-PAT: pattern database searches using combinations of aligned amino acids, and a novel scoring theory.
Ladunga I, Wiese BA, Smith RF. Ladunga I, et al. J Mol Biol. 1996 Jun 21;259(4):840-54. doi: 10.1006/jmbi.1996.0362. J Mol Biol. 1996. PMID: 8683587 - BLAST and FASTA similarity searching for multiple sequence alignment.
Pearson WR. Pearson WR. Methods Mol Biol. 2014;1079:75-101. doi: 10.1007/978-1-62703-646-7_5. Methods Mol Biol. 2014. PMID: 24170396
Cited by
- Network pharmacology strategies toward multi-target anticancer therapies: from computational models to experimental design principles.
Tang J, Aittokallio T. Tang J, et al. Curr Pharm Des. 2014;20(1):23-36. doi: 10.2174/13816128113199990470. Curr Pharm Des. 2014. PMID: 23530504 Free PMC article. Review. - Proteny: discovering and visualizing statistically significant syntenic clusters at the proteome level.
Gehrmann T, Reinders MJ. Gehrmann T, et al. Bioinformatics. 2015 Nov 1;31(21):3437-44. doi: 10.1093/bioinformatics/btv389. Epub 2015 Jun 27. Bioinformatics. 2015. PMID: 26116928 Free PMC article. - Where does the alignment score distribution shape come from?
Ortet P, Bastien O. Ortet P, et al. Evol Bioinform Online. 2010 Dec 12;6:159-87. doi: 10.4137/EBO.S5875. Evol Bioinform Online. 2010. PMID: 21258650 Free PMC article. - JACOP: a simple and robust method for the automated classification of protein sequences with modular architecture.
Sperisen P, Pagni M. Sperisen P, et al. BMC Bioinformatics. 2005 Aug 31;6:216. doi: 10.1186/1471-2105-6-216. BMC Bioinformatics. 2005. PMID: 16135248 Free PMC article. - Comparison of sequence profiles. Strategies for structural predictions using sequence information.
Rychlewski L, Jaroszewski L, Li W, Godzik A. Rychlewski L, et al. Protein Sci. 2000 Feb;9(2):232-41. doi: 10.1110/ps.9.2.232. Protein Sci. 2000. PMID: 10716175 Free PMC article.
Publication types
MeSH terms
LinkOut - more resources
Full Text Sources
Other Literature Sources
Molecular Biology Databases