Inference of functional regions in proteins by quantification of evolutionary constraints - PubMed (original) (raw)

Inference of functional regions in proteins by quantification of evolutionary constraints

Alexander L Simon et al. Proc Natl Acad Sci U S A. 2002.

Abstract

Likelihood estimates of local rates of evolution within proteins reveal that selective constraints on structure and function are quantitatively stable over billions of years of divergence. The stability of constraints produces an intramolecular clock that gives each protein a characteristic pattern of evolutionary rates along its sequence. This pattern allows the identification of constrained regions and, because the rate of evolution is a quantitative measure of the strength of the constraint, of their functional importance. We show that results from such analyses, which require only sequence alignments, are consistent with experimental and mutational data. The methodology has significant predictive power and may be used to guide structure--function studies for any protein represented by a modest number of homologs in sequence databases.

PubMed Disclaimer

Figures

Figure 1

Figure 1

(A and B) Overlays of RPs from homologous proteins. x axis, position in sequence alignment;y axis, relative rate calculated from the ratio of substitutions within each window, divided by the total number of substitutions in the alignment. Rates were first calculated and normalized independently, and plots were then overlaid in register with the sequence alignment. Small insertions in single sequences were removed before analysis. (C) Overlay of the rate plot of p53 (eleven sequences) and the frequency of missense mutations isolated from somatic tumors. Blue Roman numerals denote the canonical domains as described in the p53 literature (17); tet, tetramerization domain. Red numbers are the inferred ECRs, ranked according to the rate of the slowest evolving window. (D) Correlation of the average rate of evolution with the density of point mutations in the canonical domains (blue) or ECRs (red).

Figure 2

Figure 2

Visualization of relative rates in structures using a color spectrum; blue represents the slowest rates, orange and red the highest. This figure was prepared using RASMAC Version 2.6 (ref. ;

http://www.umass.edu/microbio/rasmol/

) and modified Protein Data Bank (PDB) files (ref. ; ht) in which the temperature field was substituted with the relative rate scaled linearly to span the range of available colors. (A) GAPDH tetramer (PDB ID code 1GD1). (B) The p35 core bound to DNA (PDB ID code 1TUP). (C) Virtual fusion of the N terminus of mouse Sonic Hedgehog (PDB ID code 1VHH) with the C terminus of Drosophila Hedgehog (PDB ID code 1AT0). Only three amino acids (3 aa) in the alignment separate the structures of the two domains.

Figure 3

Figure 3

RPs for five case studies. x axis, position in sequence alignment; y axis, smoothed relative rates for emphasis on detection of ECRs. Bars at y = 1 indicate regions of uncertain alignment. Blue and/or red bars underneath ECRs indicate the rate of the most slowly evolving window in the ECR (position on y axis) and the extent of the inferred ECR (extent along x axis). ECRs whose troughs are entirely contained with a region of uncertain alignment should be disregarded, but note that the resolution of the graphic is limited. (A) Notch; ECRs at the start and end of the repeat regions are labeled with the number of the repeat to which they correspond. Most-slowly evolving repeats are in italics. The gap just N-terminal to the PEST ECR corresponds to 180 highly divergent positions. (B) β-catenin/armadillo; arrowhead points to quickly evolving insertion in repeat 10. (C) SMC1/cohesin with the five known regions labeled. Note that each region contains several ECRs. (D) Overlay of the N-terminal RPs of Delta and Serrate prepared as described in Fig. 1. Gaps in the plots are due to alignment of the RPs to each other. Novel predicted domains are labeled in bold. (E) Overlay of Wnt1/wg and Wnt5a/b plots. The ranks of the three most slowly evolving ECRs are in the color corresponding to the paralog to which they belong. The sequences from human Wnt1 that correspond to the most slowly evolving windows in ECRs 1, 2, and 3 are, respectively, CKCHGMSGSCTVRTCWMRL, VNRGCRETAFIFAITSAGV, and CNSSSPALDGCELLCCGRG.

Figure 4

Figure 4

The Myb case study for illustration of inference of evolved differences between paralogs. (A) Differences in relative rates between A myb and B myb (red), and C myb and B myb (green). Purple bar delineates the region C-terminal to the myb repeats that contains the acidic activator domain in A- and C-myb, comprising alignment positions 200–372. (B) Trees for each paralog relating the orthologs from human (H), mouse (M), chick (C), and_Xenopus_ (X), with branch lengths proportional to the number of substitutions. Left set of trees are calculated from the entire alignment; right set of trees are calculated from just the positions corresponding to the region indicated in purple in_A_. This region is constrained equally in A and C, but in B it evolves much faster by comparison. (C) Most likely position of the origin of the constraint, which is shared between A and C, but not present in B and invertebrates. Yellow diamonds are gene duplications; vertebrate subtrees are simplified for display.

References

    1. Kimura M. The Neutral Theory of Molecular Evolution. Cambridge, U.K.: Cambridge Univ. Press; 1983.
    1. Li W H. Molecular Evolution. Sunderland, MA: Sinauer; 1997.
    1. Uzzell T, Corbin KW. Science. 1971;172:1089–1096. - PubMed
    1. Yang Z. Mol Biol Evol. 1993;10:1396–1401. - PubMed
    1. Felsenstein J, Churchill G A. Mol Biol Evol. 1996;13:93–104. - PubMed

Publication types

MeSH terms

Substances

LinkOut - more resources