Theoretical Foundation of the RelTime Method for Estimating Divergence Times from Variable Evolutionary Rates (original) (raw)

Abstract

RelTime estimates divergence times by relaxing the assumption of a strict molecular clock in a phylogeny. It shows excellent performance in estimating divergence times for both simulated and empirical molecular sequence data sets in which evolutionary rates varied extensively throughout the tree. RelTime is computationally efficient and scales well with increasing size of data sets. Until now, however, RelTime has not had a formal mathematical foundation. Here, we show that the basis of the RelTime approach is a relative rate framework (RRF) that combines comparisons of evolutionary rates in sister lineages with the principle of minimum rate change between evolutionary lineages and their respective descendants. We present analytical solutions for estimating relative lineage rates and divergence times under RRF. We also discuss the relationship of RRF with other approaches, including the Bayesian framework. We conclude that RelTime will be useful for phylogenies with branch lengths derived not only from molecular data, but also morphological and biochemical traits.

Introduction

The inference of divergence times is accomplished by assuming a constant rate throughout the tree (a strict molecular clock) or by relaxing the strict molecular clock (Ho and Duchêne 2014; dos Reis et al. 2016; Kumar and Hedges 2016). Bayesian approaches are widely applied to relaxed clock dating analyses. They require specification of a probability distribution of evolutionary rates in the tree (e.g., lognormal distribution), as well as an assumption of the presence or absence of rate autocorrelation among branches. In contrast, the RelTime approach does not require specification of such priors, and produces relative node ages that can then be transformed into absolute dates by using calibration constraints for one or more nodes (Tamura et al. 2012, 2013). RelTime performs well for estimating divergence times in analyses of many large empirical data sets (Mello et al. 2017) and simulated data sets (Tamura et al. 2012; Filipski et al. 2014).

RelTime’s computational speed (fig. 1) and accuracy have led to its use for estimating divergence time for large data sets (Tamura et al. 2012; Mahler et al. 2013; Bond et al. 2014; Bonaldo et al. 2016). However, a mathematical foundation for the RelTime method has not yet been provided, which is needed not only to understand basic properties and assumptions of RelTime, but also to reveal its relationship with other molecular dating methods (Ho and Duchêne 2014; dos Reis et al. 2016; Kumar and Hedges 2016). In the following, we present the theoretical foundation of the RelTime method. We also assess the absolute performance of RelTime and compare it with other methods by analyzing data generated using computer simulations in which sequences were evolved according to three different branch rate models: independent (Drummond et al. 2006), autocorrelated (Kishino et al. 2001), and hybrid (Beaulieu et al. 2015).

Fig. 1.

(a) Computational time taken by RelTime and MCMCTree (Bayesian method) to estimate divergence times for data sets containing increasing number of sequences (n). The tested sequence alignment consisted of 4,493 sites in which sequence evolution was simulated with extensive rate variation (RR50 data from Tamura et al. 2012). RelTime’s speed advantage increases with data volume by O(_n_2). (b) Distribution of computation speed ratio of RelTime to MCMCTree for 70 data sets, each containing 100 ingroup sequences that were simulated as described in the Materials and Methods.

Mathematical Theory

Theoretical Analysis for a Phylogeny with Three Taxa and an Outgroup

We begin with the simplest case in which the phylogeny contains a clade with three ingroup taxa (subtree at node 5) and one outgroup taxon (fig. 2a). In this tree, branch lengths _b_1 and _b_2 represent the amount of evolutionary change that has occurred in lineages emanating from node 4 and leading to taxon 1 and taxon 2, respectively. We assume that taxon 1 and 2 are sampled at the same evolutionary time (_t_1 = 0 and _t_2 = 0), which is usually the case in phylogenetic analysis of data sampled from living species. The contemporaneous sampling of data produces sampling times equal to 0, which serve as calibration points (Tamura et al. 2012). By using branch lengths in this phylogeny (_b_’s), we can estimate the relative evolutionary rates (_r_’s) for all the lineages as well as relative divergence times (_t_’s). Here, a lineage refers to a branch and all the taxa (and branches) in the descendant subtree. For example, lineage a in figure 2a contains two taxa (1 and 2) and three branches with lengths _b_1, _b_2, and _b_4.

Fig. 2.

The relative rate framework for RelTime method. (a) A tree containing three ingroup sequences with an outgroup. Branch lengths are _b_i’s and lineage rates are _r_i’s. _L_a = _b_4 + ½(_b_1 + _b_2). (b) The case of four ingroup sequences with an outgroup. Here, _L_a = _b_5 + ½(_b_1 + _b_2). _L_b = _b_6 + ½(_b_3 + _b_4), when using the arithmetic mean. See figure 3 and its legend for a simple procedure outlining the calculation of relative rates using RRF.

The following system of equations formalizes the RelTime approach mathematically by linking relative rates for lineages (_r_i) with branch lengths (_b_i) in figure 2a. Here,,

where _L_a is depth of node 5 on lineage a that contains taxon 1 and 2; _L_a = _b_4 + ½(_b_1 + _b_2).

The setting of equalities in equations (3) and (4) leads to preference for the minimum rate change between the lineage originating at node 5 and the descendant lineage originating at node 4. The selection of lineage rate _r_a will be constrained by rate ratios imposed by equations (1) and (2), which relaxes the strict molecular clock.

Because all the lineage rates are relative, one unknown is reduced by setting the group rate at the most recent common ancestor of the ingroup node to be 1, that is,

We solve for each _r_i by using [equations (1–5)](#E1 E2 E3 E4 E5) and get:

r1=4b1(b1+b2+2b4)/[(b1+b2)(b1+b2+2b3+2b4)],

(6)

r2=4b2(b1+b2+2b4)/[(b1+b2)(b1+b2+2b3+2b4)],

(7)

ra=2(b1+b2+2b4)/(b1+b2+2b3+2b4).

(9)

The estimates of relative lineage rates and lengths produce an ultrametric tree with relative times for node 4 (_t_4) and node 5 (_t_5):

t4=(b1+b2)(b1+b2+2b3+2b4)/4(b1+b2+2b4),

(10)

The above [equations (6–11](#E6 E7 E8 E9 E10 E11)) constitute the relative rate framework (RRF) for phylogenies with three taxa and one outgroup. These equations yield point estimates for lineage rates and node ages. Because branch lengths have variances, the resulting lineage rate estimates have variances. The variance of relative node ages will be a function of branch length variances and the variance contributed by evolutionary rate differences among branches (see Discussion).

Theoretical Analysis for Phylogenies Containing Four Taxa and an Outgroup

To estimate six evolutionary rates (_r_1–_r_4, _r_a, and _r_b) using six branch length estimates (_b_1–_b_6) for a phylogeny containing four ingroup taxa (fig. 2b) we write a set of seven equations:

Here, La=b5+½(b1+b2) and Lb=b6+½(b3+b4)⁠. [Equations (14–16)](#E14 E15 E16) above lead to a preference for minimum changes in rates between ancestor–descendant lineage pairs. In this system, _r_i ’s are not required to be equal to one another and the rate assignments are constrained by the rate ratios among lineages (eqs. 12, 13, and 17), which relaxes the strict molecular clock.

We solve for each _r_i from [equations (12–18)](#E12 E13 E14 E15 E16 E17 E18) to estimate relative rates and node ages.

r1=4b1(b1+b2+2b5)/[(b1+b2)(b1+b2+b3+b4+2b5+2b6)],

(19)

r2=4b2(b1+b2+2b5)/[(b1+b2)(b1+b2+b3+b4+2b5+2b6)],

(20)

r3=4b3(b3+b4+2b6)/[(b3+b4)(b1+b2+b3+b4+2b5+2b6)],

(21)

r4=4b4(b3+b4+2b6)/[(b3+b4)(b1+b2+b3+b4+2b5+2b6)],

(22)

ra=2(b1+b2+2b5)/(b1+b2+b3+b4+2b5+2b6),

(23)

rb=2(b3+b4+2b6)/(b1+b2+b3+b4+2b5+2b6).

(24)

The estimates of relative node ages _t_5, _t_6, _t_7 for nodes 5, 6, and 7, respectively, are:

t5=(b1+b2)(b1+b2+b3+b4+2b5+2b6)/[4(b1+b2+2b5)],

(25)

t6=(b3+b4)(b1+b2+b3+b4+2b5+2b6)/[4(b3+b4+2b6)],

(26)

t7=(b1+b2+b3+b4+2b5+2b6)/4.

(27)

The above [equations (19–27](#E19 E20 E21 E22 E23 E24 E25 E26 E27)) establish RRF of the RelTime approach for a tree containing four taxa and one outgroup. As previously mentioned, point estimates of node ages and lineage rates have variances because branch lengths have variances and because evolutionary rates are not equal among lineages (see Discussion).

Relative Rate Framework with Geometric Means

In both the original RelTime approach (Tamura et al. 2012) and the mathematical formulations above, we considered an arithmetic mean when averaging branch lengths to minimize evolutionary rate changes. This approach does not assume an equal rate, but is rather a natural way to calculate node depths by averaging branch lengths. We have now developed analytical formulas for an alternative RRF using the geometric mean that balances the rate changes between two descendant lineages. For example, if _b_1 = 1 and _b_2 = 4 in figure 2a, then the arithmetic mean will give 2.5. Thus, evolutionary rate _r_1 is 2.5 times slower and _r_2 is 1.6 times faster when compared with the average rate. The difference in rate change (between 2.5 and 1.6, in the present case) becomes larger as the difference between _b_1 and _b_2 becomes larger when using the arithmetic mean. In contrast, the geometric mean would give 2.0, which results in two times slower rate in _b_1 and two times faster rate in _b_2, as compared with the average rate. That is, the difference in rate between the ancestral and descendant lineages is always equal for sister lineages when using the geometric mean, which is not the case if the arithmetic mean is used.

Using a geometric means approach, we obtain the following analytical formulas for a phylogeny containing three ingroup taxa (fig. 2a):

For the 4-taxon case in figure 2b, the equations are as follows:

Relative Rate Framework for a General Case

Next, we consider a general case of a phylogeny with more than four ingroup taxa. In this case, RelTime applies RRF with a bottom-up approach, starting from the tips (external branches) of the phylogeny and moving toward the root. In the first step, we generate local relative lineage rates for the subtrees containing three and four taxa (subtree x and y, respectively) in a phylogeny containing eight taxa and an outgroup (fig. 3). Subtree y contains four taxa, so we apply [equations (34–39)](#E34 E35 E36 E37 E38 E39) to generate relative rates. [Equations (28–31)](#E28 E29 E30 E31) will be used to estimate rates for clade x which contains three ingroup taxa.

Fig. 3.

Calculating relative lineage rates in a phylogeny. (a) A phylogeny containing eight ingroup sequences (taxa) that consists of three taxa nodes (x and w) and four taxa nodes (y and z). (b) Reduced phylogeny and branch lengths after applying RRF to nodes x and y. (c) Final phylogeny after applying RRF to node z in panel b, which produces node w in a three taxa configuration. After applying RRF to node w, a preorder traversal scales descendant lineage rates by multiplying them by their ancestral lineage rates to generate final relative rates. The ingroup root node (w) has an average relative lineage rate equal to 1. Multisequence taxa are indicated by Roman numerals.

In the next step, we consider the parent clade (z) that has four taxa: composite taxon I consisting of two taxa (1 and 2), taxon II consisting of one taxon (3), composite taxon III consisting of two taxa (4 and 5), and composite taxon IV containing two taxa (6 and 7) (fig. 3b). We estimate branch lengths (_b_I, _b_II, _b_III, and _b_IV) by using the geometric means: bI=b1b2+b9⁠, bII=b3⁠, bIII=b4b5+b10⁠, and bIV=b6b7+b11⁠. Then, we use [equations (34–39)](#E34 E35 E36 E37 E38 E39) to compute relative rates for all the lineages using these branch lengths. In bigger phylogenies, this process is carried out for every internal node in a postorder traversal. In the current example, node w is the common ancestor of all the ingroup taxa and is in a 3-taxon configuration (fig. 3c). We now have bV=bIbII+bx and bVI=bIIIbIV+by and apply [equations (28–31)](#E28 E29 E30 E31). At this stage, we have local (relative) lineage rate estimates for nodes in the ingroup tree. Finally, all the rates in the tree are computed by multiplying descendant lineage rates by their respective ancestral lineage rates in preorder traversal to generate final relative rates such that the ingroup root node (e.g., node w in fig. 3c) has an average relative lineage rate equal to 1.

Results

We evaluated the performance of RRF for correctly estimating lineage rates and divergence times by analyzing data generated using computer simulation in which sequences were evolved according to the autocorrelated rate (AR) model (Kishino et al. 2001), the independent rate (IR) model (Drummond et al. 2006), as well as a model that contains multiple distributions of rates (hybrid rates [HR]). We present results from the analysis of a collection of small data sets (three ingroup sequences; AR and IR models) and two collections of large data sets: one containing 100 ingroup sequences (AR and IR models) and another containing 91-ingroup sequences (HR model).

Analysis of Small Data Sets (Phylogeny with Three Ingroup Sequences)

We first tested the accuracy of divergence times estimated via RRF by analyzing data sets containing three ingroup sequences. We used RRF with geometric means and compared the modeled (true) values with estimated lineage rates as well as node ages (fig. 4). The lineage rates produced by RRF were similar to the true rates for data sets that were evolved under the AR model (fig. 4a); the relationship showed a linear regression slope of 0.99. RRF estimates for external lineages, were similarly good (blue circles in fig. 4a, _r_2 = 0.68). The relationship was also strong for internal branches, but with greater dispersion (red circles in fig. 4a, _r_2 = 0.29). RRF performance for AR data sets was similar to that observed for IR data sets (fig. 4b). Overall, the success in estimating lineage-specific evolutionary rates translates into robust estimates of relative times for AR (fig. 4c) and IR data sets (fig. 4d).

Fig. 4.

Performance of RRF and MCMCTree Bayesian analyses for three ingroup sequences with an outgroup (topology in fig. 2a). RRF lineage rate estimates are compared with the true lineage rate estimates for sequences that evolved under (a) autocorrelated and (b) independent rate models. Blue circles represent external lineages (single taxon, _r_1, _r_2, and _r_3) and red circles represent the internal lineage (_r_a). RRF estimates of divergence times for (c) autocorrelated rate and (d) independent rate data sets. Bayesian (MCMCTree) estimates of branch rates are compared with the true branch rates for sequences evolved under (e) autocorrelated and (f) independent rate models. Blue circles indicate external branches (_r_1, _r_2, and _r_3) and red circles show the internal branch (_r_4). Bayesian estimates of divergence times for (g) autocorrelated rate and (h) independent rate data sets. Each panel contains results from 50 simulated data sets. All rates and divergence time estimates are normalized to allow direct comparison between true and estimated values. Regression slope and correlation coefficient (_r_2) are shown for each panel.

Bayesian methods produce branch-specific rates under a given statistical distribution of rates (e.g., lognormal), which differs from RRF where relative lineage rates are considered. Therefore, we compared the true values of branch rates with the branch rates derived using a Bayesian method. We provided correct priors (based on simulation parameters) and conducted the analyses using MCMCTree software (Yang 2007). Bayesian branch rate estimates showed a more diffuse relationship with true rates in internal branches for AR data sets (fig. 4e;_r_2 = 0.00) as compared with RRF (fig. 4a;_r_2 = 0.29). Fundamental reasons underlying these patterns are presented in the Discussion section. These patterns notwithstanding, Bayesian estimates of times for AR data sets showed a slope of 0.99 with true time estimates (fig. 4g), which means that node age estimates are generally robust to difficulties in estimating branch-specific rates. This robustness was also seen for IR data sets, where Bayesian branch rates showed a more diffuse relationship with the true rates (fig. 4f), but estimated times showed a slope close to 1 with a high _r_2 (0.91). Overall, both Bayesian and RRF approaches showed similar or lower accuracy in estimation of rates and node ages for IR data sets as compared with AR data sets, potentially because rate independence requires the estimation of a greater number of free parameters.

Analysis of Large Data Sets (Phylogeny with 100 Ingroup Sequences)

We next analyzed data sets consisting of 100 ingroup sequences, which were evolved over a range of empirical rate variation parameters. As observed for data sets containing only three ingroup sequences, RRF lineage rate estimates were highly correlated with the true rates for AR data sets (fig. 5a). Lineage rate correlations were generally lower for IR data sets and these correlations were higher for external (tip) lineages (fig. 5b). Importantly, distributions of the slopes of RelTime node age estimates and true times were centered at close to 1 for both AR and IR data sets (fig. 5c).

Fig. 5.

Performance of RRF in the analysis of data sets with 100 ingroup sequences and an outgroup. Fraction of data sets for which RRF inferred lineage rates are correlated with true rates at different levels of correlation for data sets simulated with (a) autocorrelated rates and (b) independent rates. Dotted lines represent external branches and solid lines indicate the internal branches. (c) Distribution of the linear regression slopes of RRF estimates and true times for different data sets. The results presented are from analyses of 35 data sets in which sequences evolved with autocorrelated branch rates (blue lines) and another 35 data sets that were evolved with independent branch rates (red lines). In regression analysis, the intercept was set to zero, because the estimated node age is expected to be zero when the true node age is zero. The regression slopes generated with and without this assumption produced similar patterns.

Analysis of Hybrid Rate Data Sets (Phylogeny Containing 91 Ingroup Sequences)

We also examined the performance of RRF in an analysis of simulated data from Beaulieu et al. (2015), who simulated two lognormal distributions (hybrid rate model) for an angiosperm phylogeny in which herbaceous clades exhibited higher and more variable evolutionary rates than woody clades (fig. 6a). They reported that one rate model Bayesian methods produced considerably more ancient date estimates for the divergence of herbaceous and woody clades. This overestimation of divergence time became more severe as the difference between the two rates increased (fig. 6b). Application of RRF produced divergence time estimates that were much closer to true times (fig. 6c and d), which shows that RRF can be useful in cases where the rate distribution differs among clades (Smith and Donoghue 2008; Dornburg et al. 2012; Beaulieu et al. 2015) or when clocks are local (Drummond and Suchard 2010; Crisp et al. 2014).

Fig. 6.

(a) Hybrid distribution of rates for branches leading to woody taxa (brown) and herbaceous taxa (green), with the former evolving three times (3×) slower than the latter. (b) Bayesian estimates reported by Beaulieu et al. (2015) when the rate difference between clades was 3-fold (3×, solid line) and 6-fold (6×, dashed line), with the simulated angiosperm age of 140 Ma shown by a red line. RelTime estimates of angiosperm age for Beaulieu et al. (2015)’s alignments with (c) 3× mean rate difference and (d) 6× mean rate difference. Median and SD for age estimates are shown. Beaulieu et al. (2015) simulated 100 replicates (1,000 bases) under a GTR model for each scenario. Bayesian analyses were conducted using a single uncorrelated lognormal rate prior in Beaulieu et al. (2015). The same alignments, topology, and ingroup calibrations were used in RRF analyses.

Furthermore, Tamura et al. (2012) found that RelTime produced accurate time estimates in simulations with a very large number of sequences even when one clade possessed accelerated evolutionary rates, where penalized likelihood methods did not perform as well. In general, we expect that the limitations of on rate model Bayesian analyses will be overcome by local clock methods (Drummond and Suchard 2010; Höhna et al. 2016; Lartillot et al. 2016), but the computational time required to analyze even modestly sized data sets via these approaches can be prohibitive. So, the current RRF approach, which does not assume a specific model for rate variation, may be suitable for such data in its current implementation or as a foundation for future methodological refinements.

Discussion

We have presented a mathematical foundation for the relative rate framework (RRF) underlying the RelTime method. In the following, we compare RRF with other approaches for estimating divergence times and present the motivation behind RRF.

Lineage Rates versus Branch Rates

In RRF, evolutionary rate heterogeneity in a phylogeny is considered by comparing rates between sister lineages emanating from internal nodes in a phylogeny. A lineage rate is an average of all the lineage rates that belong to that lineage, including all the descendants (e.g., node x in fig. 3). RRF’s estimation of evolutionary lineage rates is fundamentally different from the comparison and modeling of branch rates in other approaches. For example, the penalized likelihood methods consider differences in rates between ancestral and descendant branches (Sanderson 1997, 2003), and Bayesian methods model branch rates to share a probabilistic distribution (e.g., lognormal distribution).

The distinction between lineage and branch rates complicates direct comparisons of evolutionary rates produced by using RelTime and Bayesian methods, except for external (tip) branches for which the lineages consist of only one branch. In this case, RelTime and Bayesian estimates of rates show similar trends (fig. 4, blue circles). This trend was also observed in the analysis of 100 sequence data sets, where the correlation of the estimated external branch rates with true branch rates was high for both RelTime (median correlation = 0.76 and 0.69) and Bayesian (median correlation = 0.72 and 0.86) analyses of AR and IR data sets, respectively.

For internal branches, computer simulations showed greater similarity between RRF estimates of lineage rates and the true lineage rates (fig. 4a and b) as compared with the similarity observed between Bayesian estimates of branch rates and the true branch rates (fig. 4e and f, respectively). This was also the case in the analysis of 100 sequence data sets: internal branch rates from Bayesian methods were less strongly correlated with the true rates (median correlation = 0.42 and 0.36 for AR and IR data sets, respectively), as compared with those observed for lineage rates from RRF (median correlation = 0.79 and 0.55 for AR and IR data sets, respectively).

These trends are due to the fact that the estimate of a branch rate is a function of two time estimates: one for the ancestral node and another for the descendant node. For example, the variance of the rate on branch with length _b_4 in figure 2a is a function of the variance of two time estimates (_t_4 and _t_5), in addition to the variance of _b_4. In contrast, the variance of a lineage rate (e.g., _r_a in fig. 2a) is a function of the variance of only one time estimate (_t_5), in addition to the lineage depth (_L_a), because the other time point is zero in contemporary sequence sampling. Thus, branch rates are estimated with greater variance than lineage rates, which results in lower correlations seen for Bayesian approaches.

Underlying Evolutionary Rate Model in RRF

RRF exploits the fact that the estimation of the ratio of lineage rates at any node in a phylogeny is independent of that node’s age. For example, the ratio of evolutionary rates at node 4, _r_1/_r_2, does not depend on _t_4 (Fig. 7a). It is clear that the rate of evolution is higher in the lineage leading from node 4 to taxon 2 than to taxon 1 (_r_2 > _r_1), because _b_2 is longer than _b_1 in figure 7a. In fact, we can estimate _r_1/_r_2 (= _b_1/_b_2) without knowing anything about the probability distribution of evolutionary rates throughout the tree. Similarly, the other rate ratio in this tree does not depend on knowledge of the distribution of rates among branches or lineages, it is simply [(_b_1 + _b_2)/2 + _b_4]/_b_3 when using the arithmetic means and (b1b2+b4)/b3when using the geometric means.

A phylogenetic tree of three taxa (1, 2, and 3). (a) Original phylogenetic tree with the observed branch lengths (b’s), which are necessary to estimate node times (t’s) shown in panel (b). Evolutionary trees where the rate for the subtree containing taxon 1 and 2 (s4) is much (c) faster or (d) slower than that of its ancestral branch (r4).

Fig. 7.

A phylogenetic tree of three taxa (1, 2, and 3). (a) Original phylogenetic tree with the observed branch lengths (_b_’s), which are necessary to estimate node times (_t_’s) shown in panel (b). Evolutionary trees where the rate for the subtree containing taxon 1 and 2 (_s_4) is much (c) faster or (d) slower than that of its ancestral branch (_r_4).

However, the node-by-node specification of relative lineage rates is not sufficient to estimate relative times _t_4 and _t_5. For that, we need to know the relationship of subtree rate _s_4 and branch rate _r_4, where _s_4 is the overall evolutionary rate of the subtree originating at node 4 (contains taxon 1 and 2) and _r_4 is the evolutionary rate on branch _b_4 (fig. 7a). Without assuming a specific distribution of rates, _s_4/_r_4 cannot be determined uniquely and _t_4 can be at any point between 0 and _t_5. Figure 7c and d present two extreme possibilities. In one, if the subtree rate (_s_4) is much higher after the divergence event at node 4 (_s_4 ≫ _r_4), then the estimate of _t_4 will be small and the divergence event recent (fig. 7c). Alternately, if the subtree rate is much slower after the divergence event at node 4 (_s_4 ≪ _r_4), then _t_4 will be much more ancient (fig. 7d).

In its mathematical formulation, RRF considers the best estimate of the rate of evolution of an ancestral lineage to be the average of the rate of evolution of its two descendant lineages (e.g., eqs. 3, 14, 15, and 16), while accommodating the relative rates among lineages. In the current example, this would result in the timetree shown in figure 7b. This is the principle of minimum rate change from the ancestor to its immediate descendants, where we do not favor extreme rate assignments, for example, those in figure 7c and d. Probabilities of such extreme rate assignments are also low in commonly used branch rate distributions (e.g., lognormal, normal, and exponential distributions), so Bayesian methods also tend to favor the smallest rate change needed to explain the data.

Relationship of RRF with Other Molecular Clock Methods

The treatment of evolutionary rates is conceptually different between Bayesian methods and RRF, because RRF does not assume a specific statistical distribution for modeling (lineage) rate variation at the outset and Bayesian methods model branch rates rather than the lineage rates. The application of the principle of minimum rate change in RRF is different from nonparametric and semiparametric approaches based on the idea of Sanderson (1997), because RRF minimizes lineage rate changes rather than the branch rate changes from an ancestor to its immediate descendants. In addition, RRF does not attempt to estimate a universal penalty for the speed of rate change throughout the tree.

RRF is distinct from strict clock approaches because strict clock methods only apply rate averaging (e.g., eqs. 3 and 4) at each node in the tree, but RRF imposes an additional constraint that the ratio of sister lineage rates be the ratio of their lineage lengths (e.g., eqs. 1 and 2). These additional constraints relax the strict clock and allow rates to vary throughout the tree. For this reason, RelTime is also different from some molecular clock methods that assume the ratio of ages between two nodes to be proportional to the ratio of their average node-to-tip distances (Purvis 1995; Britton et al. 2002,, 2007). This assumption is tantamount to assuming equality of rates among lineages, and, thus, a strict molecular clock. For this reason, these approaches require pruning of taxa or lineages for which the rate equality does not apply (Takezaki et al. 1995). RRF does not assume equality among any lineage rates at any time, and it does not require the removal of rate heterogeneous lineages.

Statistical Distribution of Relative Lineage Rates

Relative lineage rates produced by RRF will show extensive correlation, because the evolutionary rate of a lineage is a function of evolutionary rates of all its descendant lineages. This would result in both local and global correlation, which is expected to be present in phylogenies with autocorrelated as well as independent branch rates. As expected, the analysis of 100 sequence data sets showed correlation between ancestral and descendant lineage rates when branch rates were autocorrelated (median correlation = 0.88) or independent (median correlation = 0.77). Therefore, RRF is fundamentally different from the autocorrelated (branch) rate model of Kishino et al. (2001) as well as the independent (branch) rate model of Drummond et al. (2006), as they deal with branches rather than lineages, as defined here. For this reason, the lineage rates produced by RRF are not directly comparable with those observed for branch rates produced by Bayesian methods.

Even though RRF does not require a statistical distribution of lineage rates at the outset, the resulting estimates of lineage rates may follow a statistical distribution. We examined this relationship in an analysis of 100 ingroup sequence data sets in which branch rates were simulated with lognormal, exponential, or uniform distributions. When branch rates followed a lognormal distribution, the distribution of true lineage rates was also lognormal, as was the distribution of RRF lineage rate estimates (fig. 8a–d). When the branch rates followed an exponential distribution, the RRF and true lineage rates showed a similar distribution (fig. 8e). In the case of a uniform distribution of branch rates, the lineage rates showed a normal-like distribution and the RRF rate estimates were lognormally distributed (fig. 8f). All of these results suggest that a flexible lognormal distribution will generally fit the distribution of RRF lineage rates. Importantly, time estimates showed a linear relationship with the true times, with slopes close to 1.0 (fig. 8g–i).

Fig. 8.

Distributions of true and RRF-derived estimates of lineage rates. Branch rates were simulated under autocorrelated lognormal rate models with (a) low and (b) high dispersions, independent lognormal rate models with (c) low and (d) high dispersions, and independent rate models with (e) exponential and (f) uniform distributions. Green lines represent the fitted curves of the true lineage rate distributions and blue bars show the distributions of RRF rates. Rate unit is substitutions per site per million years. (g_–_i) Relationships between true times and RRF times in six rate scenarios. Black circles and lines represent the average times of 5 replicates simulated under rate scenarios in (a), (c), and (e), and gray triangles and lines represent the average times of 5 replicates simulated under rate scenarios in (b), (d), and (f). All times are normalized to the sum of ingroup divergence times. Regression slopes and correlation coefficients (_r_2) are shown.

Point Estimates and Their Variances

RRF yields point estimates for (relative) lineage rates and divergence times based on branch lengths. As is the common practice in classical statistics, estimates of dispersion such as SEs and confidence intervals accompany all estimates. The variance of a lineage rate estimate is a function of branch length variances. It can be obtained analytically by using the equations for lineage rate (e.g., [eqs. 28–31](#E28 E29 E30 E31) for the case of three taxa with outgroup by the delta method) or simply by using a bootstrap sampling procedure. However, the estimation of confidence intervals around the node ages depends on branch length variances as well as the degree of inequality of evolutionary rates among lineages (Kumar and Hedges 2016). Tamura et al. (2013) proposed a method to estimate confidence intervals within RelTime, which produces rather wide confidence intervals. We are currently investigating an advanced approach to narrow confidence intervals while maintaining appropriate coverage probabilities, but this subject is beyond the scope of the current article.

Computational Speed of RelTime

RRF scales well with increasing numbers of sequences and is much faster than Bayesian methods for analyzing of molecular sequence data (fig. 1). The fast computational speed is due to the innovation that RRF uses all the data first to map a large alignment onto a phylogeny, and then it uses the resulting branch lengths to generate relative divergence times and evolutionary lineage rates. The computational time taken by RRF is the sum of time taken to generate maximum likelihood estimates of branch lengths for a given sequence alignment and phylogeny and the time taken to estimate rates and dates using RRF. The latter is negligible compared with the former, because of the analytical nature of RRF. In comparison, Bayesian methods are computationally demanding because they require a substantial exploration of likelihood space using prior distributions to generate posterior estimates of rates and divergence times.

RelTime for Phylogenies with Branch Lengths

The above decomposition has a positive side effect, in addition to making RelTime computationally speedy. RRF may be applied to any phylogeny where branch lengths reflect the amount of change. For example, RRF is directly applicable when branch lengths are estimated by using pairwise evolutionary distances and a least squares approach for a given tree topology (Rzhetsky and Nei 1993). In addition to multiple sequence alignments, such distances can come from unaligned and locally aligned genome or genomic segments (Otu and Sayood 2003; Henz et al. 2005; Auch et al. 2006; Deng et al. 2006; Gao and Qi 2007; Lin et al. 2009; Xu and Hao 2009).

In fact, RRF can be applied to any phylogeny where branch lengths are generated using other types of molecular data (e.g., gene expression patterns and breakpoint distances) or nonmolecular data (e.g., morphological and life history traits) (Herniou et al. 2001; Gramm and Niedermeier 2002; King et al. 2016; Cooney et al. 2017). Of course, the accuracy of the relative rate and time inferences made for such data depends directly on the accuracy of the phylogenetic tree and the branch lengths, so the biological interpretations of the results obtained will require utmost care.

Usefulness of Relative Node Ages

RRF’s accurate estimation of relative node ages without assuming a speciation model or calibration priors can benefit many applications (Tamura et al. 2012). For example, relative node ages from molecular data are directly comparable with those from fossil data. This allows evaluation of biological hypotheses without the circularity created by the use of calibration priors and densities inferred from molecular data (Battistuzzi et al. 2015; Gold et al. 2017). Along these lines, RRF has been used to develop a protocol to identify calibration priors that have the strongest influence on the final time estimates in Bayesian dating (Battistuzzi et al. 2015), because the cross-validation methods are unlikely to be effective (Warnock et al. 2012,, 2015).

Inferring Absolute Times from Relative Node Ages

By placing calibration constraints on one or more nodes in the tree, we can generate an absolute timetree from the ultrametric tree containing relative node ages. Tamura et al. (2013) presented an algorithmic approach for adjusting relative rates to ensure that the estimated times for calibrated nodes are within the boundaries. This process uses the maximum and minimum boundaries only for calibration, which is preferable when the uncertainty distribution of calibrations is not known precisely. Otherwise, there is high probability of biased time estimation (Hedges and Kumar 2004; Ho and Phillips 2009; Inoue et al. 2010; Heath et al. 2014; Ho and Duchêne 2014; dos Reis et al. 2015). Tamura et al.’s approach worked well in the analysis of large data sets, because Bayesian time estimates reported in multiple large-scale studies are similar to those produced by RRF using ultrametric trees with relative times that were transformed into timetrees using many calibration constraints (Mello et al. 2017).

Conclusions

We have presented a mathematical foundation for the RelTime method and elucidated its relationship with other relaxed and strict clock methods. We have shown that the relative rate framework (RRF) produces excellent estimates of rates and divergence times for evolutionary lineages. It is, however, important to note that estimates of divergence times in a phylogeny are only biologically meaningful when the reconstruction of evolutionary relationships is robust. Therefore, the best practice is to first obtain a reliable phylogeny and then estimate divergence times. We must also consider the confidence intervals associated with node ages to assess the precision of time estimates prior to making biological inferences.

Materials and Methods

Computer Simulations and Analysis

We simulated 200 multisequence alignments: 50 each for two models of evolutionary rates (independent and autocorrelated among branches) for two model topologies containing three- and four-ingroup taxa (fig. 2a and b, respectively). The node height of the ingroup subtree was set to be 10 time units, while the node heights of all descendent subtrees varied randomly between 0 and 10 time units. For each resulting model timetree, branch rates were sampled from a lognormal distribution, where the mean rate was drawn randomly from an empirical distribution (Rosenberg and Kumar 2003) and the SD varied from 0.25 to 0.75 for all branches independently. For the autocorrelated rates, the initial rate was drawn randomly from an empirical distribution (Rosenberg and Kumar 2003) and the autocorrelation parameter was varied from 0.1 to 0.3. This rate sampling resulted in a phylogram with branch lengths that could be used as input for SeqGen (Grassly et al. 1997). We used the Hasegawa–Kinshino–Yano (HKY) model (Hasegawa et al. 1985) with 4 gamma categories and empirically derived GC content and transition/transversion ratio (Rosenberg and Kumar 2003) as simulation parameters. We generated simulated multispecies alignments where each sequence was 3, 000 base pairs long. The results from three-ingroup sequence analysis (fig. 4) were similar to those from the analysis of four-ingroup sequences (not shown).

Using the same simulation strategy, we created 35 alignments each under independent and autocorrelated rate scenarios following a master phylogeny of 100 taxa that was sampled from the bony-vertebrate clade in the Timetree of Life (Hedges and Kumar 2009). In the independent rate case, the SD varied from 0.3 to 0.5. In the autocorrelated rate case, the autocorrelation parameter varied from 0.01 to 0.04. All other simulation parameters (GC contents, transition/transversion ratio, and sequence length) were derived from empirical distributions (Rosenberg and Kumar 2003).

Using the same 100 ingroup taxon master phylogeny of bony vertebrates, we simulated five additional sequence data sets under (1) an autocorrelated lognormal rate model with low dispersion, where the initial rate and the autocorrelation parameter were set to be r and 0.02, respectively; (2) an autocorrelated lognormal rate model with high dispersion, where the initial rate and the autocorrelation parameter were set to be r and 0.05, respectively; (3) an independent lognormal rate model with low dispersion, where the mean rate and the SD (in log scale) were set to be r and 0.25, respectively; (4) an independent lognormal rate model with high dispersion, where the mean rate and the SD (in log scale) were set to be r and 0.75, respectively; (5) an independent exponential rate model, where the mean rate was set to be r; and (6) an independent uniform rate model, where the branch-specific rates were sampled from an uniform distribution from 0 to 2_r_. GC contents, transition/transversion ratio, sequence length, and evolutionary rate r were derived from empirical distributions (Rosenberg and Kumar 2003).

MEGA software (Kumar et al. 2012, 2016) was used to obtain maximum likelihood estimates of branch lengths from simulated sequence alignments, where the correct substitution model and tree topology were used. RRF ([eqs. 28–42](#E28 E29 E30 E31 E32 E33 E34 E35 E36 E37 E38 E39 E40 E41 E42)) was applied to the resulting phylogram with branch lengths, and relative lineage rates and times were obtained. One calibration (true age ± 10 Ma) at the crown node of the ingroup was used to convert relative time estimates for comparison with true times (fig. 8). No calibrations were used in other RRF analyses.

All Bayesian analyses were conducted using MCMCTree (Yang 2007) using correct priors; two independent runs of five million generations were carried out. Results were checked for convergence using Tracer (Rambaut et al. 2014). ESS values were higher than 200 after removing 10% burn-in samples in each run. MCMCTree analyses used one root calibration (true age ± 0.1 time units).

Analysis of Hybrid Rate Models

Simulated data sets and BEAST results were provided by Beaulieu et al. (2015), or retrieved from the Dryad Repository. All outgroup and root calibrations are automatically disregarded in RelTime, because the assumption of equal rates of evolution between the ingroup and outgroup sequences is not testable (Kumar et al. 2016). Lognormal distributions with fixed median values of “true ages” were used as calibration densities in the original study (Beaulieu et al. 2015). Because RelTime does not require specific density distributions for calibrations, we used true age ± 5 Ma for all 15 ingroup calibrated nodes in the reanalysis to directly compare RelTime divergence time estimates with those from BEAST. Calibrations employed in RelTime (true age ± 5 Ma) had boundaries similar to 99% probability densities of lognormal distributions originally employed as calibrations. The same alignments, topology, and ingroup calibrations were used in RRF analyses. Estimates of angiosperm age were obtained by summarizing estimates of 100 data sets each in which the herbaceous clades have 3 times (3×) and 6 times (6×) higher rates than those of woody clades.

Acknowledgments

We are thankful to Heather Rowe for editorial comments and Beatriz Mello for many helpful discussions. We thank Jeremy Beaulieu for providing simulated data and Bayesian results. This research was supported by grants from NIH (R01GM126567-01), National Aeronautics and Space Administration (NASA, NNX16AJ30G), and Tokyo Metropolitan University (DB105).

References

Auch

Henz

Holland

Göker

2006

Genome BLAST distance phylogenies inferred from whole plastid and whole mitochondrion genome sequences

BMC Bioinformatics

Battistuzzi

Billing-Ross

Murillo

Filipski

Kumar

2015

A protocol for diagnosing the effect of calibration priors on posterior time estimates: a case study for the Cambrian explosion of animal phyla

Mol Biol Evol

1907

–

1912

Beaulieu

O’Meara

Crane

Donoghue

MJ.

2015

Heterogeneous rates of molecular evolution and diversification could explain the Triassic age estimate for angiosperms

Syst Biol

869

–

878

Bonaldo

Ribeiro

Lima

Dos Santos

AAC

Menezes

LSR

da Cruz

SOD

de Mello

Furtado

de Moura

Damasceno

et al. , .

2016

Isolation of infective Zika virus from urine and saliva of patients in Brazil

PLoS Negl Trop Dis

e0004816.

Bond

Garrison

Hamilton

Godwin

Hedin

Agnarsson

2014

Phylogenomics resolves a spider backbone phylogeny and rejects a prevailing paradigm for orb web evolution

Curr Biol

1765

–

1771

Britton

Anderson

Jacquet

Lundqvist

Bremer

2007

Estimating divergence times in large phylogenetic trees

Syst Biol

741

–

752

Britton

Oxelman

Vinnersten

Bremer

2002

Phylogenetic dating with confidence intervals using mean path lengths

Mol Phylogenet Evol

–

Cooney

Bright

Capp

EJR

Chira

Hughes

Moody

CJA

Nouri

Varley

Thomas

GH.

2017

Mega-evolutionary dynamics of the adaptive radiation of birds

Nature

542

7641

344

–

347

Crisp

Hardy

Cook

LG.

2014

Clock model makes a large difference to age estimates of long-stemmed clades with no internal calibration: a test using Australian grasstrees

BMC Evol Biol

263

–

279

Deng

Huang

Wang

Huang

Yang

Feng

Wang

2006

PTreeRec: Phylogenetic Tree Reconstruction based on genome BLAST distance

Comput Biol Chem

300

–

302

Dornburg

Brandley

McGowen

Near

TJ.

2012

Relaxed clocks and inferences of heterogeneous patterns of nucleotide substitution and divergence time estimates across whales and dolphins (Mammalia: cetacea)

Mol Biol Evol

:721–243.

dos Reis

Donoghue

Yang

2016

Bayesian molecular clock dating of species divergences in the genomics era

Nat Rev Genet

–

dos Reis

Thawornwattana

Angelis

Telford

Donoghue

Yang

2015

Uncertainty in the timing of origin of animals and the limits of precision in molecular timescales

Curr Biol

:2939–2912.

Drummond

SYW

Phillips

Rambaut

2006

Relaxed phylogenetics and dating with confidence

PLoS Biol

e88

–

e99

Drummond

Suchard

MA.

2010

Bayesian random local clocks, or one rate to rule them all

BMC Biol

114

–

125

Filipski

Murillo

Freydenzon

Tamura

Kumar

2014

Prospects for building large timetrees using molecular data with incomplete gene coverage among species

Mol Biol Evol

2542

–

2550

Gao

2007

Whole genome molecular phylogeny of large dsDNA viruses using composition vector method

BMC Evol Biol

Gold

Caron

Fournier

Summons

RE.

2017

Paleoproterozoic sterol biosynthesis and the rise of oxygen

Nature

543

7645

420

–

423

Gramm

Niedermeier

2002

Breakpoint medians and breakpoint phylogenies: a fixed-parameter approach

Bioinformatics

(

Suppl 2

S128

–

S139

Grassly

Adachi

Rambaut

1997

Seq-Gen: an application for the Monte Carlo simulation of protein sequence evolution along phylogenetic trees

Comput Appl Biosci

235

–

238

Hasegawa

Kishino

Yano

1985

Dating of the human-ape splitting by a molecular clock of mitochondrial DNA

J Mol Evol

160

–

174

Heath

Huelsenbeck

Stadler

2014

The fossilized birth-death process for coherent calibration of divergence-time estimates

Proc Natl Acad Sci U S A

111

E2957

–

E2966

Hedges

Kumar

2004

Precision of molecular time estimates

Trends Genet

242

–

247

Hedges

Kumar

2009

The timetree of life

New York

Oxford University Press

Henz

Huson

Auch

Nieselt-Struwe

Schuster

SC.

2005

Whole-genome prokaryotic phylogeny

Bioinformatics

2329

–

2335

Herniou

Luque

Chen

Vlak

Winstanley

Cory

O’Reilly

DR.

2001

Use of whole genome sequence data to infer baculovirus phylogeny

J Virol

8117

–

8126

Duchêne

2014

Molecular-clock methods for estimating evolutionary rates and timescales

Mol Ecol

5947

–

5965

SYW

Phillips

MJ.

2009

Accounting for calibration uncertainty in phylogenetic estimation of evolutionary divergence times

Syst Biol

367

–

380

Höhna

Landis

Heath

Boussau

Lartillot

Moore

Huelsenbeck

Ronquist

2016

RevBayes: bayesian phylogenetic inference using graphical models and an interactive model-specification language

Syst Biol

726

–

736

Inoue

Donoghue

PCJ

Yang

2010

The impact of the representation of fossil calibrations on Bayesian estimation of species divergence times

Syst Biol

–

King

Qiao

Lee

MSY

Zhu

Long

JA.

2016

Bayesian morphological clock methods resurrect placoderm monophyly and reveal rapid early evolution in jawed vertebrates

Syst Biol

499

–

516

Kishino

Thorne

Bruno

WJ.

2001

Performance of a divergence time estimation method under a probabilistic model of rate evolution

Mol Biol Evol

352

–

361

Kumar

Hedges

SB.

2016

Advances in time estimation methods for molecular data

Mol Biol Evol

863

–

869

Kumar

Stecher

Peterson

Tamura

2012

MEGA-CC: computing core of molecular evolutionary genetics analysis program for automated and iterative data analysis

Bioinformatics

2685

–

2686

Kumar

Stecher

Tamura

2016

MEGA7: Molecular Evolutionary Genetics Analysis version 7.0 for bigger datasets

Mol Biol Evol

1870

–

1874

Lartillot

Phillips

Ronquist

2016

A mixed relaxed clock model

Philos Trans R Soc B

371

1699

Lin

Cai

Lin

Chakraborty

2009

ComPhy: prokaryotic composite distance phylogenies inferred from whole-genome gene sets

BMC Bioinformatics

(

Suppl 1

S5.

Mahler

Ingram

Revell

Losos

JB.

2013

Exceptional convergence on the macroevolutionary landscape in island lizard radiations

Science

341

6143

292

–

295

Mello

Tao

Tamura

Kumar

2017

Fast and accurate estimates of divergence times from big data

Mol Biol Evol

–

Otu

Sayood

2003

A new sequence distance measure for phylogenetic tree construction

Bioinformatics

2122

–

2130

Purvis

1995

A composite estimate of primate phylogeny

Philos Trans R Soc B

348

1326

405

–

421

Rambaut

Suchard

Xie

Drummond

2014

. Tracer v1.6. Institute of Evolutionary Biology, University of Edinburgh, UK. Available from: http://beast.bio.ed.ac.uk/Tracer, last accessed March 29, 2018.

Rosenberg

Kumar

2003

Heterogeneity of nucleotide frequencies among evolutionary lineages and phylogenetic inference

Mol Biol Evol

610

–

621

Rzhetsky

Nei

1993

Theoretical foundation of the minimum-evolution method of phylogenetic inference

Mol Biol Evol

1073

–

1095

Sanderson

MJ.

1997

A nonparametric approach to estimating divergence times in the absence of rate constancy

Mol Biol Evol

1218

–

1231

Sanderson

MJ.

2003

r8s: inferring absolute rates of molecular evolution and divergence times in the absence of a molecular clock

Bioinformatics

301

–

302

Smith

Donoghue

MJ.

2008

Rates of molecular evolution are linked to life history in flowering plants

Science

322

5898

–

Takezaki

Rzhetsky

Nei

1995

Phylogenetic test of the molecular clock and linearized trees

Mol Biol Evol

823

–

833

Tamura

Battistuzzi

Billing-Ross

Murillo

Filipski

Kumar

2012

Estimating divergence times in large molecular phylogenies

Proc Natl Acad Sci U S A

109

19333

–

19338

Tamura

Stecher

Peterson

Filipski

Kumar

2013

MEGA6: molecular evolutionary genetics analysis version 6.0

Mol Biol Evol

2725

–

2729

Warnock

Parham

Joyce

Lyson

Donoghue

PC.

2015

Calibration uncertainty in molecular dating analyses: there is no substitute for the prior evaluation of time priors

Proc R Soc B

282

1798

Warnock

Yang

Donoghue

PC.

2012

Exploring uncertainty in the calibration of the molecular clock

Biol Lett

156

–

159

Hao

2009

CVTree update: a newly designed phylogenetic study platform using composition vectors and whole genomes

Nucleic Acids Res

(

Web Server

W174

–

W178

Yang

2007

PAML 4: phylogenetic analysis by maximum likelihood

Mol Biol Evol

1586

–

1591

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com