Assessing annotation transfer for genomics: quantifying the relations between protein sequence, structure and function through traditional and probabilistic scores - PubMed (original) (raw)
. 2000 Mar 17;297(1):233-49.
doi: 10.1006/jmbi.2000.3550.
Affiliations
- PMID: 10704319
- DOI: 10.1006/jmbi.2000.3550
Free article
Assessing annotation transfer for genomics: quantifying the relations between protein sequence, structure and function through traditional and probabilistic scores
C A Wilson et al. J Mol Biol. 2000.
Free article
Abstract
Measuring in a quantitative, statistical sense the degree to which structural and functional information can be "transferred" between pairs of related protein sequences at various levels of similarity is an essential prerequisite for robust genome annotation. To this end, we performed pairwise sequence, structure and function comparisons on approximately 30,000 pairs of protein domains with known structure and function. Our domain pairs, which are constructed according to the SCOP fold classification, range in similarity from just sharing a fold, to being nearly identical. Our results show that traditional scores for sequence and structure similarity have the same basic exponential relationship as observed previously, with structural divergence, measured in RMS, being exponentially related to sequence divergence, measured in percent identity. However, as the scale of our survey is much larger than any previous investigations, our results have greater statistical weight and precision. We have been able to express the relationship of sequence and structure similarity using more "modern scores," such as Smith-Waterman alignment scores and probabilistic P-values for both sequence and structure comparison. These modern scores address some of the problems with traditional scores, such as determining a conserved core and correcting for length dependency; they enable us to phrase the sequence-structure relationship in more precise and accurate terms. We found that the basic exponential sequence-structure relationship is very general: the same essential relationship is found in the different secondary-structure classes and is evident in all the scoring schemes. To relate function to sequence and structure we assigned various levels of functional similarity to the domain pairs, based on a simple functional classification scheme. This scheme was constructed by combining and augmenting annotations in the enzyme and fly functional classifications and comparing subsets of these to the Escherichia coli and yeast classifications. We found sigmoidal relationships between similarity in function and sequence, with clear thresholds for different levels of functional conservation. For pairs of domains that share the same fold, precise function appears to be conserved down to approximately 40 % sequence identity, whereas broad functional class is conserved to approximately 25 %. Interestingly, percent identity is more effective at quantifying functional conservation than the more modern scores (e.g. P-values). Results of all the pairwise comparisons and our combined functional classification scheme for protein structures can be accessed from a web database at http://bioinfo.mbb.yale.edu/alignCopyright 2000 Academic Press.
Similar articles
- The relationship between protein structure and function: a comprehensive survey with application to the yeast genome.
Hegyi H, Gerstein M. Hegyi H, et al. J Mol Biol. 1999 Apr 23;288(1):147-64. doi: 10.1006/jmbi.1999.2661. J Mol Biol. 1999. PMID: 10329133 - Evolution of function in protein superfamilies, from a structural perspective.
Todd AE, Orengo CA, Thornton JM. Todd AE, et al. J Mol Biol. 2001 Apr 6;307(4):1113-43. doi: 10.1006/jmbi.2001.4513. J Mol Biol. 2001. PMID: 11286560 - Identification of homology in protein structure classification.
Dietmann S, Holm L. Dietmann S, et al. Nat Struct Biol. 2001 Nov;8(11):953-7. doi: 10.1038/nsb1101-953. Nat Struct Biol. 2001. PMID: 11685241 - Protein folds, functions and evolution.
Thornton JM, Orengo CA, Todd AE, Pearl FM. Thornton JM, et al. J Mol Biol. 1999 Oct 22;293(2):333-42. doi: 10.1006/jmbi.1999.3054. J Mol Biol. 1999. PMID: 10529349 Review. - Structural genomics and its importance for gene function analysis.
Skolnick J, Fetrow JS, Kolinski A. Skolnick J, et al. Nat Biotechnol. 2000 Mar;18(3):283-7. doi: 10.1038/73723. Nat Biotechnol. 2000. PMID: 10700142 Review.
Cited by
- Unsupervised protein embeddings outperform hand-crafted sequence and structure features at predicting molecular function.
Villegas-Morcillo A, Makrodimitris S, van Ham RCHJ, Gomez AM, Sanchez V, Reinders MJT. Villegas-Morcillo A, et al. Bioinformatics. 2021 Apr 19;37(2):162-170. doi: 10.1093/bioinformatics/btaa701. Bioinformatics. 2021. PMID: 32797179 Free PMC article. - Enlightening the taxonomy darkness of human gut microbiomes with a cultured biobank.
Liu C, Du MX, Abuduaini R, Yu HY, Li DH, Wang YJ, Zhou N, Jiang MZ, Niu PX, Han SS, Chen HH, Shi WY, Wu L, Xin YH, Ma J, Zhou Y, Jiang CY, Liu HW, Liu SJ. Liu C, et al. Microbiome. 2021 May 21;9(1):119. doi: 10.1186/s40168-021-01064-3. Microbiome. 2021. PMID: 34020714 Free PMC article. - Sequence diversity of the Trypanosoma cruzi complement regulatory protein family.
Beucher M, Norris KA. Beucher M, et al. Infect Immun. 2008 Feb;76(2):750-8. doi: 10.1128/IAI.01104-07. Epub 2007 Dec 10. Infect Immun. 2008. PMID: 18070905 Free PMC article. - Identification of protein biochemical functions by similarity search using the molecular surface database eF-site.
Kinoshita K, Nakamura H. Kinoshita K, et al. Protein Sci. 2003 Aug;12(8):1589-95. doi: 10.1110/ps.0368703. Protein Sci. 2003. PMID: 12876308 Free PMC article. - Sequence conserved for subcellular localization.
Nair R, Rost B. Nair R, et al. Protein Sci. 2002 Dec;11(12):2836-47. doi: 10.1110/ps.0207402. Protein Sci. 2002. PMID: 12441382 Free PMC article.
Publication types
MeSH terms
Substances
LinkOut - more resources
Full Text Sources
Other Literature Sources
Molecular Biology Databases