Estimating Ancestral Population Sizes and Divergence Times (original) (raw)

Journal Article

Department of Human Genetics

, University of Chicago, Chicago, Illinois 60637

Search for other works by this author on:

Received:

12 February 2002

Accepted:

14 October 2002

Published:

01 January 2003

Navbar Search Filter Mobile Enter search term Search

Abstract

This article presents a new method for jointly estimating species divergence times and ancestral population sizes. The method improves on previous ones by explicitly incorporating intragenic recombination, by utilizing orthologous sequence data from closely related species, and by using a maximum-likelihood framework. The latter allows for efficient use of the available information and provides a way of assessing how much confidence we should place in the estimates. I apply the method to recently collected intergenic sequence data from humans and the great apes. The results suggest that the human-chimpanzee ancestral population size was four to seven times larger than the current human effective population size and that the current human effective population size is slightly >10,000. These estimates are similar to previous ones, and they appear relatively insensitive to assumptions about the recombination rates or mutation rates across loci.

THE effective population size (_N_e) of a species has a direct effect on the amount and the pattern of DNA sequence variation. Researchers have therefore used sequence polymorphism data to estimate _N_e (e.g., Kreitman 1983; Takahata 1993; Nachman and Crowell 2000). The amount of observed diversity can be used to estimate the population mutation parameter θ= 4_N_eμ, and the per generation mutation rate μ can be estimated either directly (Harada et al. 1993; Giannelli et al. 1999) or indirectly (e.g., Kimura 1983; Satta et al. 1993; Kumar and Hedges 1998; Nachman and Crowell 2000) from divergence data (given assumptions about the divergence date and the average generation time).

Most estimates of _N_e for humans are ∼10,000-15,000 (e.g., Takahata 1993; Harding et al. 1997). While there are many possible reasons why the effective population size may be quite different from the census population size (Caballero 1994), it remains surprising that the human _N_e is so low, especially given humans’ large range over the past 1-2 million years (MY; e.g., Swisher et al. 1994; Gabunia and Vekua 1995). In particular, great ape species historically have had much smaller ranges, but have _N_e two to three times larger than the human _N_e. Did some event associated with the founding of the genus Homo (Takahata 1993) or some other particular event in human history lead to a sharp reduction in effective population size? It is difficult to answer this without knowing how _N_e has varied over evolutionary time. Recently, progress has been made in estimating the effective population size of the population directly ancestral to two extant daughter species.

A few main methods exist for estimating _N_a (see Takahata and Satta 2002 for a more in-depth discussion). (We refer to the ancestor’s population size as the “ancestral _N_e,” or _N_a.) One method, referred to as the trichotomy method, requires orthologous sequence data from three closely related species (Nei 1987; Wu 1991). This approach uses a single orthologous sequence from each of three species and assumes a simple model of population history where at fixed times different species become isolated with no further admixture (cf. Hey 1994). Random mating is assumed within each population. If the time between the two speciation events is small, the gene tree for a particular region will not always match the species tree (Nei 1987; see Figure 1). The probability that this happens depends in part on _N_a for the ancestral population of species 1 and 2. In particular, a necessary (but not sufficient) condition for the gene tree and the species tree to be incompatible is that the species 1 and 2 lineages do not coalesce between the two speciation events. If _N_a is larger, the probability of a common ancestor before time _T_2 is reduced, leading to a greater chance that the gene tree and the species tree do not match. The trichotomy method uses orthologous data from many unlinked loci, infers the gene tree at each locus, calculates the proportion of loci where the inferred gene tree does not match the species tree, and then uses this proportion to estimate _N_a. Application of the trichotomy method to human and great ape sequence data has led to estimates of _N_a (for the human-chimpanzee ancestral population) substantially larger than the current _N_e for humans (Ruvolo 1997; Chen and Li 2001; Takahata and Satta 2002). Chen and Li (2001) estimate, for example, _N_a = 52,000-96,000, or roughly five to nine times larger than the current human _N_e.

Figure 1.

—Two possible gene trees given a particular species tree. _T_2 is the time when the first speciation event occurs. In A, the gene tree and species tree are compatible, while in B they are incompatible. Divergence between single orthologous sequences from two species (species 1 and 2) consists of two parts: time when the species are separated (y) and the time when the two sequences segregate in the ancestral population (x).

Another method for estimating N_a requires divergence data from two or three species (Takahata 1986; Takahata et al. 1995; Yang 1997, 2002). Here, I describe the two-species method, because that is what is generally used. Given two orthologous sequences, one each from a pair of species, they will coalesce at some time that predates the species divergence time (see Figure 1). For the autosomes, the time spent in the ancestral population before coalescence (x in Figure 1) is exponentially distributed, with mean 2_N_a_g (where g is the average generation time and _N_a is the diploid ancestral _N_e). In contrast, the postspeciation branch lengths (y in Figure 1) are fixed. Given data from multiple unlinked loci and assumptions about g and μ, one can use maximum likelihood to jointly estimate the speciation time and _N_a (Takahata et al. 1995; Yang 1997). The general idea is that large values of _N_a correspond to greater variability in the coalescence time of two orthologous sequences and thus greater variance in the observed divergences across loci. Using human and chimpanzee divergence data, estimates of the human-chimpanzee _N_a are ∼5-10 times the current _N_e (Takahata and Satta 1997; Takahata 2001).

Finally, two other methods require intraspecific polymorphism data from two species and use either a moment-based (Wakeley and Hey 1997) or a maximum-likelihood (Nielsen and Wakeley 2001) approach to estimate model parameters (including in particular _N_a and the divergence time). Both of these methods are well suited for species that have diverged relatively recently, but less so for species such as humans and chimpanzees that share very little ancestral polymorphism. In any case, at the present they cannot be used to estimate the human-chimp _N_a because of a lack of chimpanzee polymorphism data.

Large estimates of the human-chimpanzee _N_a are concordant with a study of Mhc that used the high levels of diversity there to estimate a long-term (i.e., over the past 10-20 million years) average effective population size of ∼105 (Takahata 1991). However, it should be noted that the large estimates of _N_a are difficult to reconcile with human-chimpanzee divergence times estimated from molecular data. Most recent estimates of the divergence time fall between 4 and 6 million years ago (MYA; e.g., Horai et al. 1995; Easteal and Herbert 1997; Kumar and Hedges 1998; Kumar and Subramanian 2002). These estimates are for a single human and a single chimpanzee sequence; they reflect both divergence between species and segregation in the ancestral population (x and y in Figure 1). If _N_a is large, then x must be large; if x is large and x + y is fixed, then y must be small. Suppose, for example, that (x + y) = 5.5 MY, as estimated by Kumar and Hedges (1998), and that _N_a = 52,000-96,000 (cf. Chen and Li 2001). Then, if the average generation time is 20 years, y would be 1.7-3.4 MY. If the average generation time were 25 years (see discussion), then y = 0.7-2.9 MY. These estimates postdate many well-documented australopithecine fossils and are therefore dubious estimates of the time since speciation.

One possible explanation is that _N_a has been consistently overestimated. Indeed, both the trichotomy method and the two-species maximum-likelihood method have been criticized (Hudson 1992; Takahata et al. 1995; Satta et al. 2000; Takahata and Satta 2002), and it is not clear how accurate the estimates are. For one, the trichotomy method assumes one already knows the time between the two speciation events, but this is generally not known a priori. Furthermore, it assumes one can correctly infer the phylogeny for any particular locus. Errors in phylogenetic inference arise when analyzing actual data, and the whole endeavor does not make sense in the presence of intragenic recombination (cf. Nordborg 2001). With three closely related species, the true phylogenies for nearby sites are not always the same. So, when loci are analyzed, they are often an amalgamation of sites with different phylogenies. Trying to infer a single phylogeny from such data is clearly not appropriate (Satta et al. 2000). Even if the problems of recombination and phylogenetic reconstruction were ignored, the trichotomy method does not make an efficient use of the available information. Data from each locus are summarized into a single binary variable, depending on whether the inferred locus phylogeny agrees or disagrees with the species tree.

The maximum-likelihood methods are more rigorous and efficient, but they too have two main drawbacks. As with the trichotomy method and the method of Nielsen and Wakeley (2001), intragenic recombination is ignored. In addition, the methods of Takahata et al. (1995) are highly sensitive to variation in μ among loci. The problem is that variation in μ leads to greater variance in observed divergences across loci, which inflates the estimate of _N_a. Although variation in μ can be explicitly modeled in the analyses (Yang 1997; Takahata and Satta 2002), it is difficult to know whether a particular model of rate variation is appropriate, especially given data from only two species (but see Yang 2002).

In this article, I present a new method for estimating _N_a. The method requires orthologous sequence data from three or more species (two plus one or more outgroups) and jointly estimates _N_a and species divergence times using a summary maximum-likelihood approach. Unlike the previous maximum-likelihood methods, intragenic recombination is incorporated, and likelihoods are estimated from coalescent simulations. Also, the model can account for variation in mutation rates across loci. Although the method can be used on data from any taxonomic group (as long as at least one outgroup species is available), I concentrate here on analyzing human and great ape sequence data. The maximum-likelihood framework allows for the estimation of confidence intervals; this, along with a more realistic model, allows us to assess with greater rigor whether the human-chimp _N_a was much larger than the current human _N_e, as previous studies have claimed. I apply the method to the orthologous data from 53 intergenic regions reported in Chen and Li (2001) and generate both point estimates and approximate confidence intervals for _N_a and species divergence times. Intergenic sequence data are preferable to data from genes (even synonymous sites or introns) because they are less likely to have been affected by natural selection at closely linked sites.

METHODS

I describe the model in which there are orthologous sequence data from four species. The case in which there are three species (or five or more) follows analogously.

Suppose we have four species with a known phylogeny. We assume a null model of speciation (cf. Hey 1994; Satta et al. 2000) whereby a panmictic ancestral population splits at a fixed time into two panmictic descendant populations, with no subsequent migration between the descendant populations. The scaled mutation and recombination parameters are θ (= 4_N_hμ) and ρ (= 4_N_h_r_), where μ is the mutation rate per site per generation and r is the recombination rate per site per generation. Label the species H, C, G, and O, with current diploid effective population sizes _N_h, _N_c, _N_g, and _N_o, respectively. Suppose H and C split at time _T_1, H and G at time _T_2, and H and O at time _T_3, with _T_1 < _T_2 < _T_3 (see Figure 2). _T_1, _T_2, and _T_3 are scaled in units of 4_N_h generations. From time _T_1 until _T_3 both the H-C ancestral population and the H-C-G ancestral population have effective size _N_a, while the H-C-G-O ancestral population has effective size _N_o. The results are similar if the latter ancestral population has effective size _N_a (results not shown). Finally, define _n_s as the number of contiguous nucleotide sites in the simulation. There are a total of 11 parameters in the model, listed in Table 1. We assume that the generation time and mutation rate do not vary across species. These assumptions are reasonable when the species considered have similar life-history traits and are closely related. There are three possible (unrooted) gene trees, with H and C, H and G, or C and G as sibling species.

—Model of species history considered. There are four species with known branching order, labeled H, C, G, and O. The three speciation events (starting from the present) occur at times T1, T2, and T3, where time is scaled in units of 4Nh generations. See methods for more details.

Figure 2.

—Model of species history considered. There are four species with known branching order, labeled H, C, G, and O. The three speciation events (starting from the present) occur at times _T_1, _T_2, and _T_3, where time is scaled in units of 4_N_h generations. See methods for more details.

Now, suppose we have a single orthologous sequence from each species. For each site that is “segregating” (i.e., is not identical across all species), we can infer that one or more mutations happened on certain branches in the unrooted tree. We do this assuming the fewest number of mutations that can explain the data. For example, if the H, C, G, and O sequences have A, G, A, and A, respectively, then we infer that a mutation happened on the branch leading to species C. All biallelic segregating sites fall into seven categories, resulting from mutations on seven different branches of an unrooted tree. These seven branches have H, C, G, O, HC, HG, or CG as descendants and are referred to as the seven types of branches. Note that for any particular gene tree there are only five possible branches, four external ones (with a single species as a descendant) and one internal one. Any site may have one of three possible gene trees, leading to seven possible branches over all possible gene trees (the four external branches that are common to each gene tree and one internal branch from each gene tree). For sites with three segregating nucleotides, we assume that the two species with the same base share the ancestral state and that the two other bases each arose from a single mutation. The Chen and Li (2001) data do not contain any sites where each species has a different nucleotide, so we do not consider this possibility.

Parameter	Definition
θ	= 4_N_hμ, where μ is the mutation rate per site per generation.
ρ	= 4_N_h_r_, where r is the recombination rate per site per generation.
_N_h	The current effective population size of humans.
_N_c	The current effective population size of chimpanzees.
_N_g	The current effective population size of gorillas.
_N_o	The current effective population size of orangutans.
_N_a	The ancestral population size for the human-chimp and human-gorilla ancestors. _T_1 The time (in units of 4_N_h generations) of the human-chimp split.
_T_2	The time (in units of 4_N_h generations) of the human-gorilla split.
_T_3	The time (in units of 4_N_h generations) of the human-orang split.
_n_s	Number of contiguous sites in a locus.

Parameter	Definition
θ	= 4_N_hμ, where μ is the mutation rate per site per generation.
ρ	= 4_N_h_r_, where r is the recombination rate per site per generation.
_N_h	The current effective population size of humans.
_N_c	The current effective population size of chimpanzees.
_N_g	The current effective population size of gorillas.
_N_o	The current effective population size of orangutans.
_N_a	The ancestral population size for the human-chimp and human-gorilla ancestors. _T_1 The time (in units of 4_N_h generations) of the human-chimp split.
_T_2	The time (in units of 4_N_h generations) of the human-gorilla split.
_T_3	The time (in units of 4_N_h generations) of the human-orang split.
_n_s	Number of contiguous sites in a locus.

Parameter	Definition
θ	= 4_N_hμ, where μ is the mutation rate per site per generation.
ρ	= 4_N_h_r_, where r is the recombination rate per site per generation.
_N_h	The current effective population size of humans.
_N_c	The current effective population size of chimpanzees.
_N_g	The current effective population size of gorillas.
_N_o	The current effective population size of orangutans.
_N_a	The ancestral population size for the human-chimp and human-gorilla ancestors. _T_1 The time (in units of 4_N_h generations) of the human-chimp split.
_T_2	The time (in units of 4_N_h generations) of the human-gorilla split.
_T_3	The time (in units of 4_N_h generations) of the human-orang split.
_n_s	Number of contiguous sites in a locus.

Parameter	Definition
θ	= 4_N_hμ, where μ is the mutation rate per site per generation.
ρ	= 4_N_h_r_, where r is the recombination rate per site per generation.
_N_h	The current effective population size of humans.
_N_c	The current effective population size of chimpanzees.
_N_g	The current effective population size of gorillas.
_N_o	The current effective population size of orangutans.
_N_a	The ancestral population size for the human-chimp and human-gorilla ancestors. _T_1 The time (in units of 4_N_h generations) of the human-chimp split.
_T_2	The time (in units of 4_N_h generations) of the human-gorilla split.
_T_3	The time (in units of 4_N_h generations) of the human-orang split.
_n_s	Number of contiguous sites in a locus.

The sequence data for the 53 intergenic regions reported in Chen and Li (2001) were kindly provided by the authors and aligned by eye. All indels were excluded. From the remaining sequence, we count the inferred number of mutations that happened on each of the seven branch types. (No distinction is made between transitions and transversions.) For a given region, denote these numbers of inferred mutations as b = (_b_1, _b_2,... _b_7). For given values of M = (θ, ρ, _N_h, _N_c, _N_g, _N_o, _N_a, _T_1, _T_2, _T_3, _n_s) we estimate the likelihood of observing the vector b using Monte Carlo simulations.

The population model in Figure 2 is simulated using a modification of the coalescent with recombination (Hudson 1983). For each site in each replicate, we classify all the branches in the genealogy as one of the seven types of branches (i.e., with descendants H, C, G, O, HC, HG, or CG in the unrooted tree). Since mutations happen at rate μ per site per generation, we can tabulate from the total branch lengths the expected number of mutations that lie on each of the types of branches. For the _j_th replicate, denote these expected values as Bj = (_B_1, _B_2,... _B_7). The probability of observing b given Bj is then

Pr(b∣Bj)=∏i=17e−BiBibibi!.

To estimate the likelihood of M, we just average this probability over many replicates,

lik(M∣b)∝Pr(b∣M)≈1x∑j=1xPr(b∣Bj),

where x is large. The Chen and Li (2001) data consist of two sequences from each species (a single diploid sequence), while the method requires a single sequence. Intraspecies polymorphism may add to the bi values, depending on which chromosome is considered. In these cases, we take the average likelihood over the different possible bi values.

The above equation describes how to estimate the likelihood of M for a single locus. Define **M**′ as a vector containing the first 10 values of M. Estimation of the likelihood of M′ over multiple loci is straightforward. Given a collection of k loci, define {Mi}i=1k as a collection of corresponding M vectors, where the Mi are identical except for _n_s (which is calculated for each locus). Define bi as the vector b for the _i_th locus. Then, since unlinked loci are evolutionarily independent, we can estimate the likelihood of M′ over multiple loci simply by taking the product of the individual lik(M|b) estimates:

We have taken the approach of summarizing the data by b before performing maximum likelihood. Summary-likelihood methods have been quite useful in other situations (e.g., Weiss and Von Haeseler 1998; Wall 2000; Fearnhead and Donnelly 2002) and are generally computationally much simpler than full-likelihood methods. For the case of estimating _N_a, a full-likelihood approach including intragenic recombination does not look to be computationally feasible at this time.

Of the 11 parameters that make up the model M, only 9 can freely vary. _n_s is fixed from the actual data, while _N_h is relevant only indirectly; it turns out that the simulations use only the ratios of the effective population sizes (_i.e., N_a/_N_h, _N_c/_N_h, etc.), not their actual values. The actual _N_h comes into play when interpreting the simulation results (e.g., translating from scaled time to actual time). Ideally, one would like to let θ, ρ, _N_c/_N_h, _N_g/_N_h, _N_o/_N_h, _N_a/_N_h, _T_1, _T_2, and _T_3 vary freely and determine which combination of parameter values maximizes the likelihood of observing the actual data. However, this is computationally prohibitive, so we fix those values for which we have prior information and let the others vary: _N_a/_N_h, _T_1, _T_2, and _T_3 vary freely (at increments of 1.0, 0.25, 0.5, and 1.0, respectively), and we consider the following four schemes for the other parameters:

Model 1: θ=ρ= 0.001/bp; _N_c = _N_g = _N_o = 3_N_h.
Model 2: θ_n_s for each locus is proportional to the total inferred number of mutations (across all four species), and the average θ/bp (over all loci) is 0.001; ρ= 0.001/bp; _N_c = _N_g = _N_o = 3_N_h.
Model 3: Same as model 2, but all CpG sites were excluded, and the average θ/bp (over all loci) is 0.00075.
Model 4: θ= 0.001/bp; ρ= 0.002/bp; _N_c = _N_g = _N_o = 3_N_h.
Model 5: θ=ρ= 0.001/bp; _N_c = _N_g = _N_o = 6_N_h.

θ can be easily estimated from human sequence polymorphism data (e.g., Watterson 1975). Putatively neutral sites from recent resequencing studies of human variation suggest that θ= 0.001/bp is a good ballpark figure for the autosomes and that roughly one-fourth of all segregating mutations occur at CpG sites (e.g., Nachman and Crowell 2000; Przeworski et al. 2000; Templeton et al. 2000; Ebersberger et al. 2001; Frisse et al. 2001). Model 2 tests how sensitive the results are to variation in θ across loci by taking the same average θ as model 1, but assuming that θ_n_s for each locus is proportional to the observed number of inferred mutations. (This is equivalent to estimating θ using Watterson 1975, assuming an average of θ= 0.001/bp.) The genome-wide average rate of crossing over in humans is r = 1.3 × 10-8/bp (Yu et al. 2001). If _N_h ≈ 104, then ρ ≈ 5.2 × 10-4/bp. We take slightly larger ρ values to account for the unknown contribution of gene conversion to overall rates of recombination (see, e.g., Frisse et al. 2001; Przeworski and Wall 2001). Finally, levels of nonhuman great ape diversity seem to be substantially higher than human diversity levels (Deinard and Kidd 1999; Kaessmann et al. 1999, 2001), but not enough data have been gathered to accurately estimate _N_c/_N_h, _N_g/_N_h, or _N_o/_N_h. We have chosen values that might plausibly reflect the total species diversity in chimps, gorillas, and orangs. Models 3-5 were chosen to explore the sensitivity of the results to the presence of hypermutable CpG sites, assumptions about the recombination rate, and assumptions about great ape population sizes, respectively.

In addition to using the parameter combination that maximizes the likelihood as a point estimate, it would be useful to determine how much confidence we should place in the estimated values. To get a sense of how the likelihood varies as a function of _T_1, for example, I calculate the (approximate) profile likelihood:

likT1(T)=sup∏i=1klik(Mi∣bi,T1=T).

Approximate 95% confidence intervals are found by using the standard χ2 approximation for the likelihood-ratio statistic 2 ln(_L_0/_L_1) (where _L_0 is the maximum likelihood and _L_1 is the profile likelihood at an alternative point). The likelihood functions calculated are not true profile likelihoods, since some of the nuisance parameters are not allowed to vary freely. So, it is not clear whether the standard χ2 approximation is appropriate. Approximate profile likelihoods are calculated for _N_a/_N_h and _T_1, and linear interpolation is used to estimate the log-likelihood for parameter values that are not directly estimated by simulation.

To verify the accuracy of the method, I run coalescent simulations with known _T_1, _T_2, _T_3, and _N_a/_N_h values; then, I use the new method on the simulated data to estimate parameters and to compare the estimated values with the actual ones. These simulations modeled 50 loci of 500 bp each, with θ =ρ= 0.001/bp, _N_c = _N_g = _N_o = 3_N_h, _T_1 = 5.0, _T_2 = 8.0, _T_3 = 14.0, and _N_a/_N_h = 5.0. The parameter values were chosen to roughly match both the Chen and Li (2001) data and our a priori knowledge about species divergence times and ancestral population sizes. Five replicates were run; I analyzed each one under the assumptions of model 1 (see above). Note that this assumes an idealized situation, where the nuisance parameters are known exactly.

All programs were written in C and are available from the author on request. A total of 5 × 104 replicates were run for each model and parameter combination. To give a sense of the computational efficiency, the total simulations took 5 months to run on a pair of 1.7 GHz Pentium 4 processors.

RESULTS

The maximum-likelihood estimates for _T_1, _T_2, _T_3, and _N_a/_N_h are presented in Table 2. The estimates across the five models are broadly similar; all of them estimate an ancestral population size five to six times larger than the current human effective population size, in keeping with previous studies (Takahata and Satta 1997; Chen and Li 2001; Takahata 2001). The estimates of _T_1, the human-chimpanzee divergence time, are also roughly in line with expectations. If we assume that g = 25 years and _N_h = 104 (or that g = 20 years and _N_h = 12,500), then these estimates range from 3.5 to 4.0 MYA. In contrast, the paleontological record suggests that uniquely human ancestors were around at least 4-4.5 MYA (White et al. 1994; Leakey et al. 1995) and perhaps much earlier (Haile-Selassie 2001; Brunet et al. 2002). This disparity can easily be reconciled if both the average generation time and the current human effective population size are on the larger side of previous estimates (e.g., g = 25 years and _N_h = 15,000). Given our uncertainty in parameter estimates, these values are quite plausible. If instead we were to assume the generation time and species divergence time were known, then we could use the results to estimate the current human effective population size. If _T_1 = 6 MYA and g = 25 years, then the point estimates of _N_h range from 15,000 to 17,100. The other species divergence times are also on the recent side; assuming once again that g = 25 years and _N_h = 10,000, the estimated human-gorilla divergence time ranges from 5.0 to 5.5 MYA, while the estimated human-orangutan divergence time ranges from 12 to 13 MYA.

TABLE 2

Parameter estimates and confidence intervals for the Chen and Li (2001) data

MLEa	Intervalsb
_T_1	_T_2	_T_3	_N_a/_N_h	_T_1	_N_a/_N_h
Model 1	3.5	5.0	12.0	6.0	2.8-4.3	4.3-7.0
Model 2	4.0	5.5	13.0	5.0	3.0-4.5	3.8-6.5
Model 3	4.0	5.5	13.0	5.0	2.9-4.8	3.5-6.5
Model 4	3.5	5.0	12.0	6.0	2.7-4.2	4.3-7.1
Model 5	3.5	5.0	12.0	6.0	2.8-4.3	4.3-7.0

MLEa	Intervalsb
_T_1	_T_2	_T_3	_N_a/_N_h	_T_1	_N_a/_N_h
Model 1	3.5	5.0	12.0	6.0	2.8-4.3	4.3-7.0
Model 2	4.0	5.5	13.0	5.0	3.0-4.5	3.8-6.5
Model 3	4.0	5.5	13.0	5.0	2.9-4.8	3.5-6.5
Model 4	3.5	5.0	12.0	6.0	2.7-4.2	4.3-7.1
Model 5	3.5	5.0	12.0	6.0	2.8-4.3	4.3-7.0

MLE, maximum-likelihood estimate. Parameter values that maximize the likelihood of the data are shown.

Approximate 95% confidence intervals. See methods for details.

TABLE 2

Parameter estimates and confidence intervals for the Chen and Li (2001) data

MLEa	Intervalsb
_T_1	_T_2	_T_3	_N_a/_N_h	_T_1	_N_a/_N_h
Model 1	3.5	5.0	12.0	6.0	2.8-4.3	4.3-7.0
Model 2	4.0	5.5	13.0	5.0	3.0-4.5	3.8-6.5
Model 3	4.0	5.5	13.0	5.0	2.9-4.8	3.5-6.5
Model 4	3.5	5.0	12.0	6.0	2.7-4.2	4.3-7.1
Model 5	3.5	5.0	12.0	6.0	2.8-4.3	4.3-7.0

MLEa	Intervalsb
_T_1	_T_2	_T_3	_N_a/_N_h	_T_1	_N_a/_N_h
Model 1	3.5	5.0	12.0	6.0	2.8-4.3	4.3-7.0
Model 2	4.0	5.5	13.0	5.0	3.0-4.5	3.8-6.5
Model 3	4.0	5.5	13.0	5.0	2.9-4.8	3.5-6.5
Model 4	3.5	5.0	12.0	6.0	2.7-4.2	4.3-7.1
Model 5	3.5	5.0	12.0	6.0	2.8-4.3	4.3-7.0

MLE, maximum-likelihood estimate. Parameter values that maximize the likelihood of the data are shown.

Approximate 95% confidence intervals. See methods for details.

To assess how much confidence we should place in the point estimates, I calculated approximate profile-likelihood curves and estimated ∼95% confidence intervals. The intervals for _T_1 and _N_a/_N_h are listed in Table 2. For both _T_1 and _N_a/_N_h the intervals are quite narrow, which suggests that the estimates are precise. All four models exclude _N_a/_N_h ≤ 3.5 and _N_a/_N_h ≥ 7.1 from the approximate confidence intervals. For _T_1, the lower boundaries range from 2.7 to 3.0 and the upper boundaries range from 4.2 to 4.8. If as before we take g = 25 years and N = 104, the upper boundaries range from 4.2 to 4.8 MYA; these times are still more recent than the paleontological record would suggest. As mentioned above, a small increase in _N_h is sufficient to reconcile the time estimates with the paleontological record. Figure 3 shows the profile-likelihood functions of _N_a/_N_h and _T_1 for model 2. The curves quickly become quite steep, suggesting that the range of plausible values is not that large. So, even if the approximate confidence intervals were nonconservative, it is likely that conservative ones would not differ much from the intervals listed in Table 2. The corresponding likelihood curves for the other models are qualitatively similar to those in Figure 3.

To verify the accuracy of the method, I applied it on five simulated data sets with known parameter values (see methods). Each one had actual values of _T_1 = 5.0, _T_2 = 8.0, _T_3 = 14.0, and _N_a/_N_h = 5. The estimated parameter values, along with the confidence intervals for _T_1 and _N_a/_N_h, are given in Table 3. The means of the parameter estimates are 5.0, 8.1, 14.0, and 4.8 for _T_1, _T_2, _T_3, and _N_a/_N_h, respectively, which suggests that the method has no or low bias. In addition, the confidence intervals for _T_1 and _N_a/_N_h contain the true value all five times. Due to the large computational burden, it was not possible to run enough replicates to accurately estimate the coverage properties of the confidence intervals.

Comparing the different rows in Table 2 can give us some idea of how sensitive the results are to assumptions about the nuisance parameters (i.e., θ, ρ, _N_c/_N_h, _N_g/_N_h, and _N_o/_N_h). Since the results from all of the models are very similar, it appears that the particular assumptions made do not appear to be very important. In particular, unlike the two-species maximum-likelihood method of Takahata et al. (1995), the results do not seem to be very sensitive to variation in mutation rates across loci. This may be due to the information about locus-specific mutation rates contained in the outgroup species or because the actual data have very little variation in mutation rates across loci.

DISCUSSION

Estimating ancestral population sizes has been an active research area for several years. The work presented here improves on previous efforts by explicitly incorporating intragenic recombination (see also Satta et al. 2000) and by efficiently utilizing data from outgroup species. The estimates of the human-chimpanzee _N_a are five to six times larger than the current human effective population size (see Table 2). Although most previous studies came to similar conclusions, it was not clear how much confidence to place in these estimates because of unrealistic assumptions, such as no recombination or no variation in mutation rates (Takahata and Satta 2002). The narrow confidence intervals and simulation results presented here (Tables 2 and 3) provide additional evidence that ancestral population sizes were substantially larger than the current human effective population size.

—Approximate profile-likelihood curves under model 2. (A) The curve for Na/Nh; (B) the curve for T1. In both cases, the y-axis is the maximal log-likelihood given a particular value of the parameter (see methods for details). The shaded horizontal line shows the cutoff for the ∼95% confidence intervals.

Figure 3.

—Approximate profile-likelihood curves under model 2. (A) The curve for _N_a/_N_h; (B) the curve for _T_1. In both cases, the _y_-axis is the maximal log-likelihood given a particular value of the parameter (see methods for details). The shaded horizontal line shows the cutoff for the ∼95% confidence intervals.

One recent study that came to a different conclusion (namely, that _N_a is roughly as small as _N_h) incorporated variation in mutation rates across loci but not intragenic recombination (Yang 2002). Recombination tends to decrease the variance in estimated branch lengths across loci; because of this, models that assume no recombination tend to underestimate _N_a (Takahata and Satta 2002). Further work must be done to quantify how model assumptions (both here and in other studies) affect estimates of the ancestral population size.

Although this application focuses on the human-chimpanzee _N_a, the same method can be used to estimate _N_a from other taxa, as long as there are orthologous sequence data from three or more species (including at least one outgroup) at multiple unlinked loci. Below, I discuss issues that might affect the general applicability of the method.

Likelihood model: One possible criticism of the model is that the relative locations of the segregating mutations are ignored. However, this is not likely to be very important, since the number and the pattern of segregating mutations are far more informative. Incorporating the segregating site locations may lead to narrower confidence intervals and more accurate estimation of the likelihood function, but excluding them is not expected to bias the results in either direction. Given the results, it does not seem to be worth the substantial computational burden to consider the full-likelihood model.

Mutational model: The mutational model that was adopted makes no distinction between transitions and transversions and assumes the mutation rate at each site in a locus is the same. However, some sites have higher mutation rates than others (Nachman and Crowell 2000; Templeton et al. 2000), which would increase the number of sites experiencing multiple mutations. Those multiply hit sites with fewer than three segregating nucleotides would then be misclassified by the model. In primates, the transition rate away from CpG sites is thought to be elevated by more than an order of magnitude due to methylated-cytosine mutagenesis (e.g., Jones et al. 1992; Giannelli et al. 1999). To test whether homoplasies from multiply hit CpG sites affected the parameter estimates, I reran model 2 excluding all CpG sites (i.e., model 3). Both the maximum-likelihood estimate and the shape of the profile-likelihood curves are almost identical (Table 2; results not shown), suggesting that the results presented here are relatively insensitive to the effects of multiple mutations at CpG sites. Calculations suggest that other proposed sites with elevated mutation rates, such as mononucleotide runs or DNA polymerase α-arrest sites (Krawczak and Cooper 1991; Templeton et al. 2000), are too rare to appreciably increase the expected number of homoplasies (results not shown). For studies of species with larger levels of divergence, the effects of homoplasies may be a more serious concern. Future work will concentrate on implementing a finite-site mutation model in the maximum-likelihood scheme described here.

TABLE 3

Parameter estimates and confidence intervals for simulated data

MLEa	Intervalsb
_T_1	_T_2	_T_3	_N_a/_N_h	_T_1	_N_a/_N_h
True value:c	5.0	8.0	14.0	5.0
Trial 1	5.0	8.0	15.0	4.0	4.2-5.7	3.2-5.4
Trial 2	4.5	8.0	14.0	5.0	3.5-5.5	3.5-6.4
Trial 3	4.5	7.5	14.0	5.0	3.4-5.2	3.9-6.7
Trial 4	5.0	8.0	14.0	5.0	4.3-6.1	3.5-6.2
Trial 5	6.0	9.0	13.0	5.0	4.8-6.7	3.8-6.9

MLEa	Intervalsb
_T_1	_T_2	_T_3	_N_a/_N_h	_T_1	_N_a/_N_h
True value:c	5.0	8.0	14.0	5.0
Trial 1	5.0	8.0	15.0	4.0	4.2-5.7	3.2-5.4
Trial 2	4.5	8.0	14.0	5.0	3.5-5.5	3.5-6.4
Trial 3	4.5	7.5	14.0	5.0	3.4-5.2	3.9-6.7
Trial 4	5.0	8.0	14.0	5.0	4.3-6.1	3.5-6.2
Trial 5	6.0	9.0	13.0	5.0	4.8-6.7	3.8-6.9

All trials were analyzed under model 1. See methods for full details.

Parameter values that maximize the likelihood of the data.

Approximate 95% confidence intervals. See methods for details.

Parameter values used in the simulations.

TABLE 3

Parameter estimates and confidence intervals for simulated data

MLEa	Intervalsb
_T_1	_T_2	_T_3	_N_a/_N_h	_T_1	_N_a/_N_h
True value:c	5.0	8.0	14.0	5.0
Trial 1	5.0	8.0	15.0	4.0	4.2-5.7	3.2-5.4
Trial 2	4.5	8.0	14.0	5.0	3.5-5.5	3.5-6.4
Trial 3	4.5	7.5	14.0	5.0	3.4-5.2	3.9-6.7
Trial 4	5.0	8.0	14.0	5.0	4.3-6.1	3.5-6.2
Trial 5	6.0	9.0	13.0	5.0	4.8-6.7	3.8-6.9

MLEa	Intervalsb
_T_1	_T_2	_T_3	_N_a/_N_h	_T_1	_N_a/_N_h
True value:c	5.0	8.0	14.0	5.0
Trial 1	5.0	8.0	15.0	4.0	4.2-5.7	3.2-5.4
Trial 2	4.5	8.0	14.0	5.0	3.5-5.5	3.5-6.4
Trial 3	4.5	7.5	14.0	5.0	3.4-5.2	3.9-6.7
Trial 4	5.0	8.0	14.0	5.0	4.3-6.1	3.5-6.2
Trial 5	6.0	9.0	13.0	5.0	4.8-6.7	3.8-6.9

All trials were analyzed under model 1. See methods for full details.

Parameter values that maximize the likelihood of the data.

Approximate 95% confidence intervals. See methods for details.

Parameter values used in the simulations.

Molecular clock: The method described here assumes that the rate of mutation per unit time is the same on all branches. This is likely a reasonable assumption for the data considered here. Noncoding regions are less likely to be affected by natural selection than are the coding regions analyzed in other studies. Also, there is no reason to assume substantial differences in mutation rates (per generation) between humans and great apes. The data on generation times are sparse; Eyre-Walker and Keightley (1999) cite a time of g = 23 years in chimpanzees, while estimates of current human generation times are ∼30 years (Siguroardóttir et al. 2000; Tremblay and Vézina 2000). Average human generation times (over the last several million years) may be substantially smaller. Indeed, the Chen and Li (2001) data show no evidence for more mutations on the chimpanzee branch than on the human branch (Chen and Li 2001; results not shown), suggesting that the long-term average generation times for humans and chimpanzees are quite similar.

For other taxa, the clock assumption may not be appropriate. It would be straightforward to generalize the model to have different rates of evolution on different branches and to estimate these as well as species divergence times and ancestral population sizes. More sequence data and more computational time would be required to accurately estimate the additional parameters, and the method (with variable rates) may not be feasible with more than three species.

Nuisance parameters: Although the goal of this article is to estimate ancestral population sizes and species divergence times, the model presented here also includes other parameters, such as θ, ρ, or _N_c/_N_h. The reason ρ is included is not to estimate the recombination rate from divergence data (which would be somewhat challenging). Rather, the values of parameters like ρ affect the likelihoods, so some assumptions must be made about them. In the interest of computational tractability, I have chosen plausible values for θ, ρ, _N_c/_N_h, _N_g/_N_h, and _N_o/_N_h. Comparing model 1 with models 4 and 5 suggests that the choice of particular values for these other parameters may not affect the estimates of the parameters of interest. Further simulations show that this is true for a wider range of values (ρ= 0.0005-0.003/bp; _N_h ≤ _N_c, _N_g, _N_o ≤ 6_N_h), although it should be pointed out that assuming no recombination (as previous methods do) leads to a likelihood of 0, due to the presence of several incompatibilities within loci. So, the estimates of _N_a/_N_h, _T_1, _T_2, and _T_3 are robust to the assumptions made about the other parameters.

Speciation model: Estimates of _N_a/_N_h and _T_1 provide information about the mean and the variance of the distribution of coalescent times of a single human and a single chimpanzee sequence. Under the simple speciation model considered here, greater variances in coalescent times must be the result of larger ancestral population sizes. Some researchers have suggested that there is often gene flow between “incipient species” (e.g., Wu 2001). A model of limited gene flow (prior to strict isolation) will lead to a greater variance in coalescent times. In particular, if there were gene flow after the initial divergence of the human and chimpanzee lines, then the ancestral population size (before the initial divergence) would be overestimated. Further work must be done to develop methods that can distinguish between limited gene flow and large ancestral population sizes using orthologous sequence data.

Acknowledgement

I thank M. Hare, M. Przeworski, N. Takahata, J. Wakeley, and an anonymous reviewer for comments on an earlier version of this manuscript. J.D.W. was supported in part by a National Science Foundation Postdoctoral Fellowship in Bioinformatics.

Footnotes

Communicating editor: N. Takahata

LITERATURE CITED

Brunet

Guy

Pilbeam

Mackaye

H T

Likius

et al. ,

2002

A new hominid from the Upper Miocene of Chad, Central Africa

Nature

418

145

–

151

Caballero

1994

Developments in the prediction of effective population size

Heredity

657

–

679

Chen

F-C

W-H

2001

Genomic divergences between humans and other hominoids and the effective population size of the common ancestor of humans and chimpanzees

Am. J. Hum. Genet.

444

–

456

Deinard

Kidd

1999

Evolution of a HOXB6 intergenic region within the great apes and humans

J. Hum. Evol.

687

–

703

Easteal

Herbert

1997

Molecular evidence from the nuclear genome for the time frame of human evolution

J. Mol. Evol.

S121

–

S132

Ebersberger

Metzler

Schwarz

Pääbo

2001

Genome-wide comparison of DNA sequences between humans and chimpanzees

Am. J. Hum. Genet.

1490

–

1497

Eyre-Walker

Keightley

P D

1999

High genomic deleterious mutation rates in hominids

Nature

397

344

–

347

Fearnhead

Donnelly

2002

Approximate likelihood methods for estimating local recombination rates. J. R. Stat. Soc

657

–

680

Frisse

Hudson

R R

Bartoszewicz

Wall

J D

Donfack

et al. ,

2001

Gene conversion and different population histories may explain the contrast between polymorphism and linkage disequilibrium levels

Am. J. Hum. Genet.

831

–

843

Gabunia

Vekua

1995

A Plio-Pleistocene hominid from Dmanisi, East Georgia, Caucasus

Nature

373

509

–

512

Giannelli

Anagnostopolous

Green

P M

1999

Mutation rates in humans

II. Sporadic mutation-specific rates and rate of detrimental human mutations inferred from hemophilia B. Am. J. Hum. Genet.

1580

–

1587

Haile-Selassie

2001

Late Miocene hominids from the Middle Awash, Ethiopia

Nature

412

178

–

181

Harada

Kusakabe

Yamazaki

Mukai

1993

Spontaneous mutation rates in null and band-morph mutations of enzyme loci in Drosophila melanogaster

Jpn. J. Genet.

605

–

616

Harding

R M

Fullerton

S M

Griffiths

R C

Bond

Cox

M J

et al. ,

1997

Archaic African and Asian lineages in the genetic ancestry of modern humans

Am. J. Hum. Genet.

772

–

789

Hey

1994

Bridging phylogenetics and population genetics with gene tree models

, pp.

435

–

449

Molecular Ecology and Evolution: Approaches and Applications

, edited by

Schierwater

Streit

Wagner

G P

Desalle

Birkhäuser Verlag

Basel, Switzerland

Horai

Hayasaka

Kondo

Tsugane

Takahata

1995

Recent African origin of modern humans revealed by complete sequences of hominoid mitochondrial DNAs. Proc. Natl. Acad. Sci

USA

532

–

536

Hudson

R R

1983

Properties of a neutral allele model with intragenic recombination

Theor. Popul. Biol.

183

–

201

Hudson

R R

1992

Gene trees, species trees and the segregation of ancestral alleles

Genetics

131

509

–

512

Jones

P A

Rideout

W M

Shen

J C

Spruck

C H

Tsai

Y C

1992

Methylation, mutation and cancer

Bioessays

–

Kaessmann

Wiebe

Pääbo

1999

Extensive nuclear DNA sequence diversity among chimpanzees

Science

286

1159

–

1161

Kaessmann

Wiebe

Weiss

Pääbo

2001

Great ape DNA sequences reveal a reduced diversity and an expansion in humans

Nat. Genet.

155

–

156

Kimura

1983

The Neutral Theory of Molecular Evolution.

Cambridge University Press

Cambridge, UK

Krawczak

Cooper

D N

1991

Gene deletions causing human genetic disease: mechanisms of mutagenesis and the role of the local DNA sequence environment

Hum. Genet.

425

–

441

Kreitman

1983

Nucleotide polymorphism at the alcohol dehydrogenase locus of Drosophila melanogaster

Nature

304

412

–

417

Kumar

Hedges

1998

A molecular timescale for vertebrate evolution

Nature

392

917

–

920

Kumar

Subramanian

2002

Mutation rates in mammalian genomes. Proc. Natl. Acad. Sci

USA

803

–

808

Leakey

M G

Feibel

C S

Mcdougall

Walker

1995

New four-million-year-old hominid species from Kanapoi and Allia Bay, Kenya

Nature

376

565

–

571

Nachman

M W

Crowell

S L

2000

Estimate of the mutation rate per nucleotide in humans

Genetics

156

297

–

304

Nei

1987

Molecular Evolutionary Genetics

Columbia University Press

New York

Nielsen

Wakeley

2001

Distinguishing migration from isolation: a Markov chain Monte Carlo approach

Genetics

158

885

–

896

Nordborg

2001

Coalescent theory

, pp.

179

–

212

Handbook of Statistical Genetics

, edited by

Balding

Bishop

Cannings

Wiley

Chichester, UK

Przeworski

Wall

J D

2001

Why is there so little intragenic linkage disequilibrium in humans? Genet

Res.

143

–

151

Przeworski

Hudson

R R

Di Rienzo

2000

Adjusting the focus on human variation

Trends Genet.

296

–

302

Ruvolo

1997

Molecular phylogeny of the hominoids: inferences from multiple independent DNA sequence data sets

Mol. Biol. Evol.

248

–

265

Satta

O’hUigin

Takahata

Klein

1993

The synonymous substitution rate of the major histocompatibility complex loci in primates. Proc. Natl. Acad. Sci

USA

7480

–

7484

Satta

Klein

Takahata

2000

DNA archives and our nearest relative: the trichotomy problem revisited

Mol. Phylogenet. Evol.

259

–

275

Siguroardóttir

Helgason

Gulcher

J R

Stefansson

Donnelly

2000

The mutation rate in the human mtDNA control region

Am. J. Hum. Genet.

1599

–

1609

Swisher

C C

Curtis

G H

Jacob

Getty

A G

Suprijo

et al. ,

1994

Age of the earliest known hominids in Java, Indonesia

Science

263

1118

–

1121

Takahata

1986

An attempt to estimate the effective size of the ancestral species common to two extant species from which homologous genes are sequenced

Genet. Res.

187

–

190

Takahata

1991

Trans-species polymorphism of HLA molecules, founder principle, and human evolution

, pp.

–

Molecular Evolution of the Major Histocompatibility Complex

, edited by

Klein

Springer

Heidelberg, Germany

Takahata

1993

Allelic genealogy and human evolution

Mol. Biol. Evol.

–

Takahata

2001

Molecular phylogeny and demographic history of humans

, pp.

299

–

305

Humanity From African Naissance to Coming Millennia—Colloquia in Human Biology and Palaeoanthropology

, edited by

Tobias

P V

Raath

M A

Moggi-Cecchi

Doyle

G A

Firenze University Press

Firenze, Italy

Takahata

Satta

1997

Evolution of the primate lineage leading to modern humans: phylogenetic and demographic inferences from DNA sequences. Proc. Natl. Acad. Sci

USA

4811

–

4815

Takahata

Satta

2002

Pre-speciation coalescence and the effective size of ancestral populations

, pp.

–

Modern Developments in Theoretical Population Genetics

, edited by

Slatkin

Veuille

Oxford University Press

Oxford

Takahata

Satta

Klein

1995

Divergence time and population size in the lineage leading to modern humans

Theor. Popul. Biol.

198

–

221

Templeton

A R

Clark

A G

Weiss

K M

Nickerson

D A

Boerwinkle

et al. ,

2000

Recombinational and mutational hotspots within the human lipoprotein lipase gene

Am. J. Hum. Genet.

–

Tremblay

Vézina

2000

New estimates of intergenera-tional time intervals for the calculation of age and origins of mutations

Am. J. Hum. Genet.

651

–

658

Wakeley

Hey

1997

Estimating ancestral population parameters

Genetics

145

847

–

855

Wall

J D

2000

A comparison of estimators of the population recombination rate

Mol. Biol. Evol.

156

–

163

Watterson

G A

1975

On the number of segregating sites in genetical models without recombination

Theor. Popul. Biol.

256

–

276

Weiss

Von Haeseler

1998

Inference of population history using a likelihood approach

Genetics

149

1539

–

1546

White

T D

Suwa

Asfaw

1994

Australopithecus ramidus, a new species of early hominid from Aramis, Ethiopia

Nature

371

306

–

312

C-I

1991

Inferences of species phylogeny in relation to segregation of ancient polymorphisms

Genetics

127

429

–

435

C-I

2001

The genic view of the process of speciation

J. Evol. Biol.

851

–

866

Yang

1997

On the estimation of ancestral population sizes of modern humans

Genet. Res.

111

–

116

Yang

2002

Likelihood and Bayes estimation of ancestral population sizes in hominoids using data from multiple loci

Genetics

162

1811

–

1823

Zhao

Fan

Jang

Mungall

A J

et al. ,

2001

Comparison of human genetic and sequence-based physical maps

Nature

409

951

–

953

Citations

Views

Altmetric

Metrics

Total Views 549

416 Pageviews

133 PDF Downloads

Since 2/1/2021

Month:	Total Views:
February 2021	1
March 2021	1
April 2021	9
May 2021	18
June 2021	17
August 2021	6
September 2021	4
October 2021	7
November 2021	16
December 2021	6
January 2022	16
February 2022	5
March 2022	7
April 2022	12
May 2022	11
June 2022	15
July 2022	7
August 2022	12
September 2022	16
October 2022	14
November 2022	9
December 2022	9
January 2023	10
February 2023	4
March 2023	13
April 2023	18
May 2023	12
June 2023	3
July 2023	26
August 2023	16
September 2023	16
October 2023	13
November 2023	9
December 2023	19
January 2024	15
February 2024	25
March 2024	14
April 2024	23
May 2024	13
June 2024	18
July 2024	12
August 2024	27
September 2024	16
October 2024	9

Estimating Ancestral Population Sizes and Divergence Times (original) (raw)

Abstract

METHODS

RESULTS

DISCUSSION

Acknowledgement

Footnotes

LITERATURE CITED

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Cited

Estimating Ancestral Population Sizes and Divergence Times (original) (raw)

Abstract

METHODS

RESULTS

DISCUSSION

Acknowledgement

Footnotes

LITERATURE CITED

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited