Adaptive evolution in humans revealed by the negative correlation between the polymorphism and fixation phases of evolution (original) (raw)

Abstract

The selective forces acting on amino acid substitutions may be different in the two phases of molecular evolution: polymorphism and fixation. Negative selection and genetic drift may dominate the first phase, whereas positive selection may become much more significant in the second phase. However, the conventional dichotomy of synonymous vs. nonsynonymous changes does not offer the resolution needed to study the dynamics of these two phases. Following previously published methods, we separated amino acid changes into 75 elementary types (1-bp substitution between their respective codons). The likelihood of each type of amino acid change becoming polymorphic (PI, which stands for “polymorphic index”), relative to synonymous changes, can then be calculated. Similarly, the likelihood of fixation (FI, for “fixation index”), conditional on common polymorphisms, is also calculated. Using Perlegen and HapMap data on human polymorphisms and the chimpanzee sequences as the outgroup, we compared the evolutionary dynamics of the 75 elementary changes in the two phases. We found a strong “L-shaped” negative correlation (P < 0.001) between FI and PI. Only those changes with low PIs show FI > 1, which is often a signature of adaptive evolution. These patterns suggest that negative and positive selection operate more effectively on the same set of amino acid changes and that ≈10–13% of amino acid substitutions between humans and chimpanzee may be adaptive.

Keywords: amino acid substitution, deleterious mutation, positive selection, selective constraint


How much of coding sequence evolution is adaptive remains a central question in molecular evolution (ref. 1 and see the refs. in ref. 2). Many studies have attempted to address this issue by considering levels of within-species polymorphism or between-species divergence (3, 4). One may gain further resolution by dividing the process of mutation substitution into two separate phases, polymorphism and fixation. When a new mutation arises in a population, it exists as a low-frequency polymorphism. In this first phase of evolution, the dominant forces are genetic drift and negative selection. If this mutation can reach a substantial frequency (say, 20% in humans), then it is unlikely to be strongly deleterious. Hence, the second phase of evolution, the process of fixation, is governed mainly by drift and positive selection. Whether these two phases are positively or negatively correlated will inform us about how natural selection operates. In this report, we shall address these questions by using human data, but the approach should be general.

To investigate the evolutionary dynamics and correlation between these two phases, we need to be able to classify mutations in coding regions into more categories than the traditional nonsynonymous and synonymous dichotomy. We follow the procedure developed by Tang et al. (refs. 5 and 6, see also ref. 7), who classified amino acid changes into 75 elementary types. An elementary amino acid change is one that can be reached by 1-bp substitution in the codons (Phe to Leu, for example). The rest of the 115 types are all composites of two or three elementary changes (Phe to Pro, for example). One can then ask how these 75 elementary changes differ in their likelihoods of becoming polymorphic, or fixed, and how the dynamics in these two phases are correlated.

Here, we describe an analysis of human polymorphism in relation to human–chimpanzee divergence by using a variety of publicly available data sets. Human polymorphism data are often collected in a way that results in strong ascertainment biases, which we demonstrate favors nonsynonymous over synonymous SNPs. (The bias potentially makes the estimation of positive selection conservative.) The comparisons among different amino acid changes help to ameliorate this bias, thereby allowing meaningful inferences on the forces shaping patterns of protein evolution to be made.

Results

Data and Tests.

We used four different data sets to address the question of adaptive evolution in coding regions of the human genome. The Perlegen (8) and HapMap data sets (9) are large collections of human SNPs. HapMap is an SNP-discovery project that we show preferentially focused on nonsynonymous SNPs, whereas Perlegen used all available SNPs. In addition, we analyzed data derived from the SeattleSNPs (http://pga.gs.washington.edu) (10) and National Institute of Environmental Health Sciences (NIEHS) database (http://egp.gs.washington.edu) (11), which are based on full resequencing data. The SeattleSNPs and NIEHS data were combined in our analysis because neither was collected with the knowledge of polymorphism from prior smaller-scale surveys. It is important to note that the SeattleSNPs and NIEHS data sets are relatively small and hence may not be truly representative of the genome.

Because of these different strategies for genotyping SNPs among databases, we first studied patterns of polymorphism and divergence for nonsynonymous and synonymous changes in each database to investigate possible ascertainment bias introduced by variable SNP discovery strategies.

Contrasting Divergence and Polymorphism for Nonsynonymous (A) Vs. Synonymous (S) Changes.

All SNPs are classified as either A (which stands for “amino acid altering”) or S. We polarized the them into ancestral vs. derived-state on the basis of the chimpanzee sequence, and our analysis is based on derived allele frequencies. The A/S ratio across the frequency spectrum, with fixed changes between human and chimpanzee, are shown in Fig. 1 and supporting information (SI) Fig. 4.

Fig. 1.

Fig. 1.

A/S ratios partitioned by the frequency of the SNPs. The observed A and S as well as the expected S are also given. In the far right is the divergence between human and chimpanzee. The expected S for each frequency class is calculated on the assumption of neutral equilibrium, with the total being equal to the observed. In neutral equilibrium, the expected number of mutations that are observed i times in the sample is proportional to 1/i. (a) Perlegen data. (b) HapMap data.

As expected, in the Perlegen and HapMap data, low-frequency synonymous SNPs (<10%) are underrepresented, whereas the combined data of SeattleSNPs and NIEHS do not show such a trend. The demography of human populations suggests that neutral mutations of low-frequency should equal or exceed the neutral equilibrium level (12). Our results are consistent with the expectation that there is a bias toward common SNPs in the Perlegen and HapMap data (13). This bias is stronger for nonsynonymous SNPs, especially in the HapMap data (see “HapMap Data”).

Perlegen Data.

Similar to many previous studies (1, 2, 14), the Perlegen data shows a significantly larger A/S ratio for the lowest frequency class (≤20%) (χ2 = 47.2, P < 0.001; Fig. 1a). The highest frequency class has a reduced A/S ratio, but that is perhaps due to the small sample size in that class. A straightforward explanation for the excess in A/S in the lowest frequency class is the presence of slightly deleterious amino acid polymorphisms (15, 16). These amino acid variants can rise to a modest frequency before selection overcomes the effect of genetic drift. We used 20% as the frequency cutoff for common and rare SNPs. This cutoff value is reasonable, because deleterious mutations rarely reach the frequency of 20% unless |2_Ns_| < 10, where s denotes selective coefficient and N denotes effective population size (ref. 14, and see SI Fig. 5). Note that the results presented are insensitive to the choice of cutoff (see details in Discussion).

In Table 1, we summarize the observed and expected SNP patterns. The ratio between the first two A/S ratios is referred to as the polymorphism index (PI), defined as [Arare/Srare]/[Amutation/Smutation]. PI is the likelihood that a new amino acid substitution will become polymorphic, relative to that for a synonymous mutation. We also defined the fixation index (FI = [Adivergence/Sdivergence]/[Acommon/Scommon]) as the likelihood an amino acid variant that exists at moderate to high frequency (>20%) will become fixed, relative to that for a synonymous variant at the same frequency. This is a modified McDonald and Kreitman test (MK test) (3), which includes all of the variants. In general, FI > 1 has been accepted as a likely indication of positive selection in action. Note that the ascertainment bias toward middle frequency variants, especially the nonsynonymous ones, would lead to the underestimation of adaptive evolution. As a result, our conclusion would be a conservative one.

Table 1.

Summary of the polymorphism and divergence data from Perlegen and HapMap

Data sets A S A/S
Perlegen
New mutation 9,742 3,019 3.227
Polymorphism (rare, ≤20%) 2,929 3,019 0.970
Polymorphism (common, >20%) 2,306 3,082 0.748
Divergence (fixed) 26,505 32,582 0.813
FI Observed = 1.087, expected = 1.128
HapMap
New mutation 5,934 1,849 3.209
Polymorphism (rare, ≤20%) 3,772 1,849 2.041
Polymorphism (common, >20%) 2,937 2,073 1.417
Divergence (fixed) 30,158 35,502 0.849
FI Observed = 0.600, expected = 1.023

The Perlegen data show an FI value of 1.087. Although the number is significantly larger than 1 (χ2 = 8.45, P < 0.05), the magnitude seems too small to be biologically meaningful. Furthermore, when data from multiple loci are combined, Shapiro et al. (2) has recently shown that the neutral expectation for the FI is not necessarily equal to 1. For example, following the procedure outlined by Shapiro et al. (2), we calculated the neutral value of the FI for the Perlegen data to be 1.128, which is indeed larger than the observed value. Hence, there is no evidence of positive selection by this analysis.

HapMap Data.

The HapMap SNP patterns are summarized in Table 1. The A/S ratio of new mutations and divergence are comparable with that of the Perlegen data. However, the A/S ratios of both rare (≤20%) and common (>20%) SNPs are larger than Perlegen. The expected neutral FI is 1.023, which is greater than the observed FI, 0.600. The observed PI is 0.636 (2.041/3.209), which is also substantially higher than the estimate from Perlegen data.

As can be seen in Fig. 1b, the A/S ratios are very similar up to the highest level of polymorphism (>90%), but there is a steep drop in divergence that cannot be accounted for by demography, and no other data set shows this pattern. Nor can it be explained by the presence of slightly deleterious mutations. The most logical explanation is the strong bias in the inclusion of nonsynonymous polymorphisms over synonymous ones in the HapMap data.

Seattle + NIEHS Data.

The SNP patterns are given in SI Table 3. The combined SeattleSNPs and NIEHS data sets show two distinctive features. First, the A/S ratio in the new mutation class is unusually low (2.3 vs. 3.2 in the larger data sets). Second, the A/S ratio for the highest frequency class is more than twice that for other common polymorphisms (1.333 vs. 0.589). This result may be due to the small number of genes, which were chosen for their possible implications in immunity-related diseases (17). Fay and Wu's H test (18) is significant for many genes in the Seattle + NIEHS data set, indicating the excess of very high-frequency variants. Whether this excess is an indication of hitchhiking with advantageous mutation is not the focus of this study. We therefore excluded the 80–100% frequency class of SNPs from the MK test. The divergence A/S ratio (0.747) is indeed significantly higher than that of the common polymorphism (0.589). More accurately, the observed FI (1.205, based on polymorphisms between 20–80%) is larger than expected (1.079).

Summary of the MK Test Contrasting A and S Changes.

Although the combined SeattleSNPs and NIEHS data sets shows some evidence of adaptive protein evolution by the MK test, the set of genes appears to be somewhat uncommon in their function (17). Indeed, the small number of genes chosen for specific purposes makes the extension of this result to the whole genome quite uncertain. Between the two large “genome-scale” data sets, ascertainment may have upwardly biased the polymorphic A/S ratio, wiping out any potential signal of positive selection detectable by the MK test. The following section is designed to see whether the signal of positive selection is indeed absent in the human–chimpanzee comparison or whether it has merely been obscured by ascertainment bias.

Contrasting Divergence and Polymorphism for the 75 Elementary Amino Acid Changes.

The potential biases toward collecting common amino acid polymorphisms complicate inferences of adaptive evolution and attenuate the power of statistical methods designed for complete sequence data. Therefore, we compared different classes of amino acid changes. Among the 190 (20 × 19/2) possible amino acid changes, 75 are referred to as elementary amino acid changes, which differ by 1 bp in their codons (5). We assume that there is much less ascertainment bias among the 75 elementary changes than between A and S changes. The observations below indeed support this assumption. The justification of adapting the framework of the MK test to analyzing different classes of amino acid changes is given in SI Table 4.

Perlegen Data Set.

With large data sets, we can calculate FI for each elementary change (see Materials and Methods). Under strict neutrality, the ratio of polymorphism to divergence (P/D) should be the same across all 75 elementary changes, much like the conventional MK test between A and S changes (Table 2). Again, we used common SNPs (>20%) to calculate FIs. The P/D ratios among the 75 classes in Table 2 are highly heterogeneous (χ2 = 186.4, P < 0.001), indicating variation in FI among classes. By sequentially removing each class in the descending order of FI, we found that the 41 classes with the lower FI values are homogeneous, with an average FI of 0.948.

Table 2.

Summary of polymorphism and divergence for the 75 elementary amino acid changes

Data sets AA1 AA2 New mutation Rare polymorphism, ≤20% PI Common polymorphism, >20% Divergence (fixed) FI
Perlegen
Leu Trp 39 3 0.077 0 54.5 >5.0
Lys Ile 54 2 0.037 1 43 4.067
Lys Met 80 8 0.1 3 85 2.68
Asn His 95 10 0.105 7 196 2.649
Ser Ile 104 12 0.116 4 108.5 2.566
Tyr Phe 96 8 0.084 6 156 2.459
Arg Ile 32 4 0.127 1 25.5 2.412
Gln Leu 137 7 0.051 5 121 2.289
Ile Leu 154 18 0.117 12 281.5 2.219
Ile Phe 111 9 0.081 6 128 2.018
Arg Gln 96 140 1.456 97 896 0.874
Pro Leu 201 78 0.388 80 734.5 0.868
Ser Leu 60 43 0.717 29 263 0.858
Val Met 71 97 1.362 74 646.5 0.826
Ser Phe 105 29 0.276 29 249.5 0.814
Arg His 58 125 2.141 90 754 0.792
Val Ile 118 233 1.974 174 1,445 0.786
Arg Thr 69 14 0.204 15 108 0.681
Arg Trp 69 49 0.710 41 282.5 0.652
Thr Met 40 76 1.900 82 537 0.619
Top 34 classes 4,457 622 0.14 419 7,589 1.713
Bottom 41 classes 5,286 2,307 0.436 1,887 18,916 0.948
Synonymous 3,019 3,019 3,082 32,582
HapMap
Top 30 classes 2,260 734 0.325 437 7,168 0.958
Bottom 45 classes 3,673 3,038 0.827 2,500 22,990 0.537
Synonymous 1,849 1,849 2,073 35,502

What types of amino acid changes may have high or low FIs? It has been suggested that positive selection tends to operate on the more conservative changes (19), whereas others argued that it works more often on the radical ones (20, 21). There are two commonly used indices for the physicochemical differences between amino acids: Grantham's distance, which takes into account the volume, polarity, and carbon composition of the side chain of each amino acid (22), and Miyata's distance, which measures volume and polarity (23). Conservative changes have smaller distances than radical ones by either measure. As can be seen in Fig. 2 and SI Fig. 6, there is no correlation between FIs and either measure of amino acid properties.

Fig. 2.

Fig. 2.

The observed FIs for the 75 elementary amino acid changes in the Perlegen data as a function of the physicochemical distance (Grantham's distance, ref. 22) between amino acid pairs. Note the absence of any correlation.

Is there a good predictor of FIs? Fig. 3 shows the correlation between FI and PI among the 75 elementary changes based on the Perlegen data. The plot of FI against PI shows an L-shaped distribution; FI is thus negatively correlated with PI (correlation coefficient r = −0.43, P < 0.001). Notably, amino acid changes with high FI values all have fairly low PI values. In other words, amino acid changes that are less likely to become polymorphic are much more likely to become fixed once they become polymorphic. This result is robust against the choice of cutoff for medium polymorphism. The results with a cutoff of 10% or 30% are shown in SI Fig. 7.

Fig. 3.

Fig. 3.

Correlation between FI and PI among the 75 elementary amino acid changes. Points inside the dashed box have FI values that are statistically higher than the rest, which have the same FI, indicated by the horizontal dashed line. (a) Perlegen data. (b) HapMap data. Note that only changes with low PI have high FI values.

A possible explanation for the negative correlation between the two phases of evolution, polymorphism, and fixation is also the simplest one: Positive and negative selection are both more effective on the same subset of amino acid changes. If amino acid 1↔amino acid 2 changes are more likely to be deleterious than other types of changes, then they are also more likely to be advantageous. Thus, although negative selection against amino acid 1↔amino acid 2 would often prevent the changes from becoming polymorphic, positive selection is also more effective in driving them to fixation when they do become polymorphic.

The opposite dynamics of becoming polymorphic and becoming fixed also explain the lack of power in predicting FI by the conventional measures (see Fig. 2 for examples). Most measures attempt to predict the long-term evolutionary dynamics of amino acid changes, or evolutionary index (EI), in the terminology of Tang et al. (5). Because EI is proportional to PI × FI, a good predictor of PI would obviously be a bad one for FI and vice versa. Therefore, the best predictors from those attempts are probably the ones that do not do particularly well (nor particularly poorly) in either the polymorphism or fixation phase. This compromise applies to other measures of evolutionary dynamics such as percent accepted mutation (PAM) (24) and blocks substitution matrix (Blosum) (25). For similar reasons, physicochemical distance measures of amino acids have not been very successful in predicting EI (5), because most of these measures rely on some evolutionary indices as well.

Finally, if common polymorphisms are all neutral, then FIs among classes should not be statistically different, but Fig. 3 and Table 2 reveal a subset of elementary changes with unusually large FIs that stand apart from the homogeneous group. The dashed line of Fig. 3 separates those that have higher-than-expected FIs (above the dashed line) from the homogeneous group (below the dashed line). The latter all have an FI value of ≈0.948, as opposed to the range of values from 1.38 to 4.1 in the former group with an average of 1.71. If we use the MK test on the former group as a whole (see Table 2), then the number of amino acid changes between species in excess of expectation is 3,160 (= 7,589 − [419 × 32,582/3,082]), which is 11.9% (3,160/[7,589 + 18,916]) of the total amino acid substitutions. This proportion is often interpreted to be due to positive selection, although considerable caution has to be exercised in this interpretation (see Discussion).

HapMap Data Set.

The main difference between HapMap and Perlegen appears to be a greater bias toward nonsynonymous polymorphisms in the former compared with the latter. By the χ2 test, the top 30 classes have significantly higher FI values (mean = 0.958) than those of the remaining classes (mean = 0.537; see bottom of Table 2). When FI is plotted against PI, the HapMap data behave qualitatively like the Perlegen data (Fig. 3b). The difference is that the FI values stabilize at 0.537 for HapMap. Because this low FI value is true even for very high frequency SNPs (see Fig. 1b), we suggest that this may be close to the neutral FI value, which is much <1 due to ascertainment bias, as discussed earlier.

Given the ascertainment bias, it would not be possible to estimate adaptive evolution by the conventional MK test, which contrasts A and S changes. However, if we assume that the homogeneous group, which consists of the bottom 45 classes of amino acid changes, represents neutral variants, then they may be substituted for silent changes in the MK test (see bottom of Table 2). We may thus calculate the excess in amino acid substitutions among the top 30 classes as 3,149.3 (= 7,168 − [437 × 22,990/2,500]), which is 10.4% (3,149.3/[7,168 + 22,990]) of the total. (If we use the same procedure of contrasting low and high FI amino acid changes on the Perlegen data, then the percentage of excess is 12.8%.) The estimated proportion of adaptive changes is surprisingly similar between the two data sets despite their very different absolute values.

Discussion

It seems possible that, in the search for disease-associated SNPs, common nonsynonymous ones would be preferentially included, relative to synonymous changes. As a result, the A/S ratios are likely to be inflated for common polymorphisms, relative to divergence (as well as low-frequency SNPs). This potential bias has rendered human SNP data in coding regions difficult to interpret with respect to adaptive evolution. Despite this inherent ascertainment bias, the large data sets can still be informative by comparing different types of amino acid changes. In general, there should be much less bias to preferentially include, for instance, Glu–Leu over Ala–Pro changes in SNP discovery.

A different and perhaps more important reason to classify amino acid changes into the 75 elementary types is to compare their evolutionary dynamics in the polymorphism and fixation phases of evolution. Because selection operating in these two phases is likely to be different (for example, negative selection is unlikely to play a major role in the fixation phase), the distinction should provide a better resolution for measuring selective pressure.

Another issue is the assignment of the ancestral vs. derived state for any human SNP. The parsimonious assignment using a chimpanzee sequence as the outgroup has an inherent error rate. In SI Fig. 8, we used the macaque sequence as a second outgroup and estimated the error rate in polarizing the ancestral vs. derived variant to be ≈0.65%. In other words, of every 100 SNPs, <1 site is expected to have its derived state assigned incorrectly.

The finding of an L-shape negative correlation between FI and PI has a simple interpretation: Amino acid changes that experience stronger negative selection are also more likely to experience stronger positive selection. This finding would imply that dissimilar amino acids are not only more likely to be deleterious but also more likely to be advantageous. The latter implication, that advantageous amino acid changes tend not to be of the conservative kind, has been a point of contention in the literature (26, 27). Indeed, both the neutral theory and the neo-Darwinian view seem to suggest otherwise. For example, Kimura (28) stated a rule of molecular evolution as such: “Those mutant substitutions that are less disruptive to the existing structure and function of a molecule (conservative substitutions) occur more frequently in evolution than more disruptive ones.” In the neo-Darwinian view, adaptive evolution is believed to take small incremental steps (29), and conservative amino acid changes certainly fit the bill well. The observation of Fig. 3 is thus somewhat surprising, because almost all adaptive changes have low PIs and, hence, are nonconservative (if we measure the conservativeness by PI). Strictly speaking, conservative vs. radical changes should be defined in physicochemical terms. However, because there are many such terms including molecular weight, volume, surface area, polarity, AWR, pI, aliphatic, aromatic and so on, their relative importance is usually determined by how well each is correlated with evolutionary rate. Thus, conservative measures cannot be decoupled from evolutionary dynamics.

Indeed, physicochemical distances are usually developed to predict the evolutionary dynamics of amino acid substitutions. Although both of the two commonly used similarity measures (Grantham's and Miyata's) have some power in predicting substitutions between species (5, 19, 30, 31), neither has any power in predicting the fixation probability (Fig. 2 and SI Fig. 6). Most similarity measures attempt to predict the likelihood of substitution between species (i.e., EI; see ref. 5), which is the product of two negatively correlated quantities, PI and FI. As explained earlier, no measure is likely to predict either PI or FI particularly well because of this negative correlation. In light of the observations of Fig. 3, we suggest that new measures be developed by fitting the predictions to PI and FI separately. PI alone is certainly a better measure than EI for the conservativeness of amino acid changes. The higher the PI, the more likely the amino acid changes can become polymorphic and the changes can be said to “more conservative.”

We estimated the proportion of adaptive amino acid changes between human and chimpanzee to be 10.4–12.8%. This range is close to estimates in previous studies (32) that used entirely different approaches. Our estimate is accurate only if the assumption that the FI value of the homogeneous group of Fig. 3 represents the true neutral value. This value is close to 1 in Fig. 3a, but in Fig. 3b, it is much less than 1 (presumably due to ascertainment bias in HapMap data). The assumption is reasonable as long as SNPs >20% in human populations are neutral variants and unlikely to be deleterious (see SI Fig. 5 for justification). If they are advantageous, then our estimate of adaptive evolution would be conservative. Other general caveats against the adaptive interpretation of the MK test (3, 33) of course apply.

In summary, by dividing amino acid changes into the 75 elementary classes, which are increasingly feasible with large genomic data sets, we gain insight into molecular evolutionary processes both within and between species. Coding regions in humans are found to be under both strong positive and negative selection by this type of analysis.

Materials and Methods

Data Collection.

Human polymorphism data were collected from four kinds of SNP databases: Perlegen Sciences (8), The International HapMap Project (phase one freeze data, www.hapmap.org), SeattleSNPs (http://pga.mbt.washington.edu), and NIEHS SNPs (http://egp.gs.washington.edu). Our study was focused on autosome and coding SNPs.

For the SeattleSNPs and NIEHS data, DNA and protein sequences of the corresponding human genes to the genotyped SNPs were provided by the databases. For Perlegen and HapMap data, the annotations of genotyped SNPs and the corresponding DNA and protein sequences of human genes were obtained from the Single Nucleotide Polymorphism Database of the National Center for Biotechnology Information (www.ncbi.nlm.nih.gov/SNP) by using the reference numbers (rs numbers) of SNPs.

The cDNA and protein sequences of human and chimpanzee were obtained from Ensembl database (www.ensembl.org/info/data/download.html). The putative orthologous protein sequences of chimpanzee to human were determined by reciprocal BLAST top hits (34). The putative orthologous pairs of human and chimpanzee protein sequences were aligned by CLUSTALW (35) and the corresponding cDNA sequences were aligned according to these protein alignments by the tranalign program, which is included in the EMBOSS package (36).

The human gene from Ensembl database and the gene from SNP databases are considered to be identical when they have 100% match for the entire coding sequence. Five thousand eight human–chimpanzee orthologs were identified for the Perlegen data, 5,535 were identified for the HapMap data, and 274 were identified for the SeattleSNPs + NIEHS data.

All of the human SNPs we used were polarized into ancestral and derived alleles according to parsimony referring to chimpanzee DNA sequences. SNPs that were unable to be polarized or SNPs that possessed more than three alleles were not used in this study. Furthermore, SNPs whose corresponding human gene does not have a chimpanzee ortholog were also not considered.

The Expected Number of New Mutations.

We used the same method in Tang et al. (5) to obtain the expected number of new mutations. When one nucleotide substitution is allowed, one codon can change in nine different ways. Some of the changes may result in amino acid changes (A sites) and others may not (S sites; refs. 37 and 38). Once a nucleotide sequence is given, the ratio of S sites/A sites can be calculated. We included the transition/transversion ratio in this calculation, and we estimated this ratio as 2.4. This ratio was estimated by the fourfold degenerate sites of 5,535 human genes. The expected number of nonsynonymous changes was obtained by setting the expected number of synonymous changes equal to the observed.

We also used the concept of elementary amino acid changes, which is suggested by Tang et al. (5). Nonsynonymous changes can be classified into 75 kinds of elementary changes, which are caused by one nucleotide change in a codon. The expected number of amino acid changes for each class can be calculated by distinguishing nonsynonymous changes according to the elementary changes (5).

Supplementary Material

Supporting Information

Acknowledgments

We thank Joshua Shapiro and Kai Zeng for helpful suggestions during the course of this study, Shintaroh Ueda for the suggestion to use macaques, two anonymous reviewers for thoughtful comments, and the HapMap Consortium for making HapMap data available. This work was supported by National Institutes of Health Grant GM076036 (to C.-I.W. and J.M.A.).

Abbreviations

A

nonsynonymous substitutions

EI

evolutionary index

FI

fixation index

MK test

McDonald and Kreitman test

PI

polymorphism index

S

synonymous substitutions.

Footnotes

The authors declare no conflict of interest.

This article is a PNAS direct submission.

References

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supporting Information