Reconstruction of highly heterogeneous gene-content evolution across the three domains of life (original) (raw)

Abstract

Motivation: Reconstruction of gene-content evolutionary history is fundamental in studying the evolution of genomes and biological systems. To reconstruct plausible evolutionary history, rates of gene gain/loss should be estimated by considering the high level of heterogeneity: e.g. genome duplication and parasitization, respectively, result in high rates of gene gain and loss. Gene-content evolution reconstruction methods that consider this heterogeneity and that are both effective in estimating the rates of gene gain and loss and sufficiently efficient to analyze abundant genomic data had not been developed.

Results: An effective and efficient method for reconstructing heterogeneous gene-content evolution was developed. This method comprises analytically integrable modeling of gene-content evolution, analytical formulation of expectation-maximization and efficient calculation of marginal likelihood using an inside-outside-like algorithm. Simulation tests on the scale of hundreds of genomes showed that both the gene gain/loss rates and evolutionary history were effectively estimated within a few days of computational time. Subsequently, this algorithm was applied to an actual data set of nearly 200 genomes to reconstruct the heterogeneous gene-content evolution across the three domains of life. The reconstructed history, which contained several features consistent with biological observations, showed that the trends of gene-content evolution were not only drastically different between prokaryotes and eukaryotes, but were highly variable within each form of life. The results suggest that heterogeneity should be considered in studies of the evolution of gene content, genomes and biological systems.

Availability: An R script that implements the algorithm is available upon request.

Contact:iwasaki@cb.k.u-tokyo.ac.jp

1 INTRODUCTION

Genome projects have determined the sequence of more than 500 genomes and more than 1800 projects are still ongoing, reporting more than one genome sequence per week on average. The abundant genome sequences provide us with opportunities to study genome evolution at the universal-tree scale. Since the most fundamental information representing the genome sequences is gene content (the set of genes that each organism possesses), the reconstruction of gene-content evolutionary history is essential for studying the evolution of genomes, as well as the evolution of biological systems encoded in these genomes, such as metabolic pathways and signaling cascades.

A serious problem in gene-content evolution reconstruction is that the rates of gene gain and loss are not known in advance, in contrast to the well-established transition matrices available for nucleotide and amino acid evolution. To make matters more difficult, biological observations suggest that the rates of gene gain and loss should not be the same across a phylogenetic tree. For example, parasitization frequently results in massive gene loss (Nakabachi et al., 2006) and genome duplication results in massive gene gain (Dehal and Boore, 2005). Thus, the heterogeneity that exists in the rates of gene-content evolution must be considered. Hereafter, we use the term heterogeneity to represent the non-uniformity of gene gain/loss rates among different edges of a phylogenetic tree.

To date, various methods for reconstructing gene-content evolution have been proposed (Csűrös and Miklós, 2006; Hahn et al., 2005; Kunin and Ouzounis, 2003; Mirkin et al., 2003; Snel et al., 2002), in parallel with elaborate studies on the mathematical aspects of gene-content evolution (Dutih et al., 2004; Gu and Zhang, 2004; Huson and Steel, 2004; Karev et al., 2002, 2003, 2004). In these methods, the absence/presence or the number of genes are formulated as states of each ortholog group, and gene gain and loss are treated as transitions among these states, so that the maximum-parsimony or the maximum-likelihood estimation of ancestral states is applied to the reconstruction of the gene-content evolution. The ortholog-based approach is selected because it is practically impossible to make reliable molecular phylogenetic trees for all ortholog groups, both because of computational costs and for statistical reasons: multiple alignments of many sequences often result in small numbers of unambiguously aligned positions that are insufficient to give significant trees.

To the best of our knowledge, methods for reconstructing gene-content evolution that account for heterogeneity, effectively estimate the evolutionary rates, and are applicable to hundreds of genomes have not been developed. The pioneering maximum-parsimony-based methods (Kunin and Ouzounis, 2003; Mirkin et al., 2003; Snel et al., 2002) not only assume homogeneous evolution, but also require arbitrary parameters or criteria to estimate evolutionary history. Although the subsequent maximum-likelihood-based method (Hahn et al., 2005) introduced the concept of time, this method also assumes homogeneity. Recently, a highly sophisticated three-parameter model, which can encompass heterogeneity, has been proposed (Csűrös and Miklós, 2006). The elaborate mathematics of the model enables efficient calculation of likelihood if the heterogeneous parameters are obtained in advance. However, partly due to the mathematical complexity of the model, an efficient and effective algorithm for parameter estimation has not been proposed.

In the present article, we present a novel method for reconstructing gene-content evolution, which is the first method to allow effective and efficient reconstruction while taking heterogeneity into account. Simulation tests were used to demonstrate the effectiveness and efficiency of the method, which was subsequently applied to the data from about 200 genomes. The reconstructed history shows that heterogeneity occurs widely across the three domains of life and should be considered in studies on the evolution of gene content, genomes and biological systems.

2 METHODS

2.1 Input data and problem definition

Suppose that genome sequences from N organisms (⁠ formula ⁠) are given. The proposed method requires two types of data precomputed from these N genomes, i.e. ortholog groups and phylogenetic topology (note that the term ortholog is inappropriate in the strict sense; however, we use the term by convention, to describe genes with similar functions).

For the first input, existing well-studied methods are used for creating ortholog groups, such as COG (Tatusov et al., 1997) and MBGD (Uchiyama, 2003). Let K be the number of ortholog groups calculated by some of those methods for genes in the N genomes. Then, the gene content of formula can be represented by a _K_-length vector formula ⁠, where formula is the number of genes belonging to the _k_-th ortholog group on the genome of formula ⁠. In the present study, fourth or more gene copies are not counted, so that formula or 3 for formula ⁠. This is because high-copy number genes often have special characteristics (e.g. transposons) that would cause deviation from the general evolutionary mode of the whole gene content.

The second input is phylogenetic topology, which is represented by a binary-rooted tree T that has N leaves and N − 1 internal nodes. These internal nodes correspond to the ancestral species, denoted by formula ⁠, so that formula if formula is the parent of formula ⁠, which means that the root of T is formula ⁠. To estimate T, both molecular phylogenetic trees for concatenated highly conserved genes, such as ribosomal RNAs and proteins, and so-called genome trees calculated by sophisticated methods (Dutilh et al., 2004) can be used. Since the main goal is not the estimation of phylogenetic relationships, T is supposed to be given based on existing methods.

With these input data, the reconstruction of the gene-content evolution problem is defined as the estimation of likely formula ⁠, given T and formula ⁠.

2.2 Gene-content evolution model

Gene-content evolution is formulated as a finite-state, continuous-time Markov process. In the process, gene gain/losses are represented as transitions among the four states. Consider an ortholog group that evolves along an edge formula i.e. from formula to formula ⁠. A two-parameter birth-and-death model is utilized in which gene gain and gene loss occur with probabilities formula and formula ⁠, respectively, in a short time formula ⁠. This means that, e.g. a high value for α n indicates massive gene gain on formula by gene genesis, horizontal gene transfers, gene duplications and/or other mechanisms. The values α n and β n are different among the edges, which implies heterogeneity of the gene-content evolution.

Let formula be the formula transition probability matrix, where the (i, j)-th element formula represents the transition probability from the state i to j in time t along formula ⁠. Then, formula follows a differential equation:

formula

(1)

where

formula

(2)

This equation has a solution of formula ⁠, where formula is the matrix exponential. Since the above definition enables the diagonalization of formula ⁠, formula has an analytical solution of

formula

(3)

where by omitting subscripts n of α n and β n for clarity,

formula

(4)

formula

(5)

formula

(6)

Ortholog groups were assumed to evolve independently, as is the case in the typical point-mutation models used to analyze sequence evolution. In addition, to focus on the general evolutionary mode of the whole gene content and to avoid overfitting of the parameters, we hypothesized that along formula ⁠, any ortholog group follows the same parameters α n and β n; however, if the ortholog groups are classified, different parameters can easily be assigned to each class.

Hereafter, t is defined as a relative time scale so that the length of any edge is 1. That is, the overall transition matrix along formula is formula (abbreviated as formula ⁠) and α n and β n are the gene gain and loss rates, respectively, per proportion of the edge. This definition was used because of the mathematical redundancy and the ambiguity associated with estimating evolutionary time. Alternatively, if given the evolutionary time tn for formula ⁠, dividing α n and β n by tn simply gives the evolutionary rate in time units.

2.3 Maximum-likelihood estimation of ancestral states using an expectation-maximization algorithm

If the model parameters θ (i.e. formula and the prior probability distribution at the last common ancestor formula ⁠) are given in advance, the most likely evolutionary history of gene content across the entire tree can be calculated. However, these parameters are not known in practice, so they must be estimated from the observed data formula ⁠. In the present article, an efficient and effective algorithm based on the expectation-maximization algorithm (Dempster et al., 1977) is presented for estimating these parameters. For the detailed formulation, we referred to the related method for training amino acid substitution matrices while considering the structural context of proteins (Holmes and Rubin, 2002).

The expectation-maximization algorithm iteratively estimates more likely parameters with tentative parameters (denoted by primes hereafter) by maximizing

formula

(7)

This function gives the expectation value of formula with respect to formula under θ′. Thus, if given formula and formula ⁠, which are the expected numbers of ortholog groups of formula and formula ⁠, respectively, we can write

formula

(8)

where by letting formula be a particular evolutionary history that starts from the state i, evolves along formula ⁠, and ends with j,

formula

(9)

Since the logarithm of the probability for the continuous Markov process is given by an integral of

formula

(10)

by changing the order of the integral and the summation, we derive

formula

(11)

This equation is transformed as follows:

formula

(12)

where

formula

(13)

Here, formula and formula are formula and formula ⁠, respectively. The infinite negative formula means that any particular history has the probability of zero. Since this term can be ignored for the comparison of parameters, we ignore formula hereinafter. In addition, the Taylor expansion of formula gives formula ⁠. Thus, we get

formula

(14)

where by defining formula to be formula ⁠,

formula

(15)

formula

(16)

formula

(17)

formula

(18)

then,

formula

(19)

where

formula

(20)

Therefore, we can directly calculate formula ⁠, formula ⁠, formula and formula with formula and formula ⁠, utilizing equations (4), (6) and (15–20). Finally, we have the formula -function in the following form:

formula

(21)

Next, we solve θ that maximize formula ⁠, with the condition of formula ⁠. By solving the partial differentiation of the Lagrangian form formula ⁠, we get

formula

(22)

formula

(23)

formula

(24)

Therefore, through iterative updates of the parameters following the above equations, the most likely parameters formula can be estimated. To iterate the computation to convergence, we must efficiently calculate both formula and formula ⁠. The iteration can be carried out efficiently with the time complexity of O(nk), using an algorithm similar to the inside-outside algorithm for stochastic context-free grammars (Lari and Young, 1990), which is a special example of the belief propagation algorithm (Pearl, 1986).

Next, let formula be the conditional probability that formula is i, given the outside of formula (i.e. T except for formula and its descendants). Again, we have the recursive formulae in the descending order (⁠ formula ⁠):

formula

(26)

where formula is the sibling of formula ⁠.

Finally, we have

formula

(27)

formula

(28)

Since formula and formula can be calculated with the time complexity of O(n), the time complexity for formula and formula becomes O(nk).

For the initial parameters, the uniform rates were adopted. Alternatively, any parameter set can be used if some previous knowledge is available.

Consequently, using the estimated parameters formula ⁠, we can calculate the most likely gene content of ancestral organisms:

formula

(29)

formula

(30)

This calculation can also be performed with the time complexity of O(nk), using the dynamic programming that resembles the Viterbi algorithm for the hidden Markov models (for the implementation, refer to the similar algorithm used to reconstruct ancestral biological sequences (Pupko et al., 2000)).

3 RESULTS AND DISCUSSION

3.1 Tests on simulation data sets

We first tested the effectiveness of the proposed algorithm on simulation data sets. Gene content that comprised 20 000 ortholog groups evolved along equilibrated binary phylogenetic trees, according to the Markov process. We tested a wide range of tree sizes with 32, 64, 128, 256, 512 and 1024 leaves. For the evolutionary processes, different gene gain/loss rates in the range of 0.1–0.0001 were assigned to different edges. This corresponds to ∼2–2000 gene gain/losses per edge, which appears to be biologically reasonable. The parameters were randomly determined so that common logarithms of the parameters were equally distributed between −4 and −1. This variability among the parameters should reflect the heterogeneity of the actual gene-content evolution, during which massive gene gains and losses sometimes occur, whereas the edges between close organisms usually accompany little gene-content evolution. Then, by using only the gene content at the leaves of the tree, we estimated the evolutionary history of the gene content and compared it with the ‘actual’ history.

The algorithm successfully estimated the parameters with mean relative errors that ranged from 20 to 30%, with somewhat large values of the SDs (Table 1, the two columns farthest to the left) despite the formula orders of the parameter distribution and the limitation on the observable data. In addition, since the parameters for older periods were expected to be more difficult to predict, the mean relative errors for each level of the tree were examined (Table 1). Remarkably, the estimations of the older edges showed accuracies comparable to those of the most recent ones, which means that the method also allows estimations of the gene gain/loss rates of older periods. A notable exception was the oldest edge, which was probably due to difficulties in predicting the states of the last common ancestor. These difficulties include the necessity of predicting the prior distribution and ambiguousness in specifying the position of the last common ancestor between its two children. However, considering again the formula orders of the parameter distribution, the accuracy of the estimated parameters would be sufficient to reconstruct most of the gene-content evolution, as described subsequently.

Table 1.

Mean relative errors of estimated parameters

Number of leaves	All edges	Distance from leaves
1	2	3	4	5	6	7	8	9	10
32	0.23(0.87)	0.20(0.31)	0.22(0.48)	0.22(0.41)	0.20(0.28)	0.99(4.34)	—	—	—	—	—
64	0.22(0.75)	0.20(0.35)	0.21(0.38)	0.23(0.42)	0.23(0.51)	0.27(0.89)	1.15(4.95)	—	—	—	—
128	0.22(0.52)	0.20(0.34)	0.23(0.44)	0.22(0.41)	0.23(0.46)	0.21(0.33)	0.24(0.58)	1.04(3.74)	—	—	—
256	0.22(0.63)	0.20(0.35)	0.22(0.42)	0.23(0.43)	0.23(0.50)	0.23(0.42)	0.22(0.36)	0.26(0.48)	1.79(7.71)	—	—
512	0.22(0.46)	0.20(0.35)	0.23(0.47)	0.23(0.45)	0.23(0.49)	0.22(0.47)	0.22(0.37)	0.21(0.35)	0.26(0.53)	1.26(4.52)	—
1024	0.22(0.44)	0.20(0.35)	0.23(0.53)	0.23(0.48)	0.23(0.48)	0.24(0.50)	0.23(0.44)	0.24(0.41)	0.26(0.45)	0.21(0.39)	0.93(2.87)

Number of leaves	All edges	Distance from leaves
1	2	3	4	5	6	7	8	9	10
32	0.23(0.87)	0.20(0.31)	0.22(0.48)	0.22(0.41)	0.20(0.28)	0.99(4.34)	—	—	—	—	—
64	0.22(0.75)	0.20(0.35)	0.21(0.38)	0.23(0.42)	0.23(0.51)	0.27(0.89)	1.15(4.95)	—	—	—	—
128	0.22(0.52)	0.20(0.34)	0.23(0.44)	0.22(0.41)	0.23(0.46)	0.21(0.33)	0.24(0.58)	1.04(3.74)	—	—	—
256	0.22(0.63)	0.20(0.35)	0.22(0.42)	0.23(0.43)	0.23(0.50)	0.23(0.42)	0.22(0.36)	0.26(0.48)	1.79(7.71)	—	—
512	0.22(0.46)	0.20(0.35)	0.23(0.47)	0.23(0.45)	0.23(0.49)	0.22(0.47)	0.22(0.37)	0.21(0.35)	0.26(0.53)	1.26(4.52)	—
1024	0.22(0.44)	0.20(0.35)	0.23(0.53)	0.23(0.48)	0.23(0.48)	0.24(0.50)	0.23(0.44)	0.24(0.41)	0.26(0.45)	0.21(0.39)	0.93(2.87)

Mean relative error is (⁠ formula ⁠. Standard deviations are shown in parentheses. Each of the trees with 32, 64, 128 and 256 leaves was generated 100 times. Owing to computational time constraints, trees with 512 and 1024 leaves were generated 61 and 30 times, respectively.

Table 1.

Mean relative errors of estimated parameters

Number of leaves	All edges	Distance from leaves
1	2	3	4	5	6	7	8	9	10
32	0.23(0.87)	0.20(0.31)	0.22(0.48)	0.22(0.41)	0.20(0.28)	0.99(4.34)	—	—	—	—	—
64	0.22(0.75)	0.20(0.35)	0.21(0.38)	0.23(0.42)	0.23(0.51)	0.27(0.89)	1.15(4.95)	—	—	—	—
128	0.22(0.52)	0.20(0.34)	0.23(0.44)	0.22(0.41)	0.23(0.46)	0.21(0.33)	0.24(0.58)	1.04(3.74)	—	—	—
256	0.22(0.63)	0.20(0.35)	0.22(0.42)	0.23(0.43)	0.23(0.50)	0.23(0.42)	0.22(0.36)	0.26(0.48)	1.79(7.71)	—	—
512	0.22(0.46)	0.20(0.35)	0.23(0.47)	0.23(0.45)	0.23(0.49)	0.22(0.47)	0.22(0.37)	0.21(0.35)	0.26(0.53)	1.26(4.52)	—
1024	0.22(0.44)	0.20(0.35)	0.23(0.53)	0.23(0.48)	0.23(0.48)	0.24(0.50)	0.23(0.44)	0.24(0.41)	0.26(0.45)	0.21(0.39)	0.93(2.87)

Number of leaves	All edges	Distance from leaves
1	2	3	4	5	6	7	8	9	10
32	0.23(0.87)	0.20(0.31)	0.22(0.48)	0.22(0.41)	0.20(0.28)	0.99(4.34)	—	—	—	—	—
64	0.22(0.75)	0.20(0.35)	0.21(0.38)	0.23(0.42)	0.23(0.51)	0.27(0.89)	1.15(4.95)	—	—	—	—
128	0.22(0.52)	0.20(0.34)	0.23(0.44)	0.22(0.41)	0.23(0.46)	0.21(0.33)	0.24(0.58)	1.04(3.74)	—	—	—
256	0.22(0.63)	0.20(0.35)	0.22(0.42)	0.23(0.43)	0.23(0.50)	0.23(0.42)	0.22(0.36)	0.26(0.48)	1.79(7.71)	—	—
512	0.22(0.46)	0.20(0.35)	0.23(0.47)	0.23(0.45)	0.23(0.49)	0.22(0.47)	0.22(0.37)	0.21(0.35)	0.26(0.53)	1.26(4.52)	—
1024	0.22(0.44)	0.20(0.35)	0.23(0.53)	0.23(0.48)	0.23(0.48)	0.24(0.50)	0.23(0.44)	0.24(0.41)	0.26(0.45)	0.21(0.39)	0.93(2.87)

The percentage of correctly reconstructed fractions of the gene-content evolutionary history was examined using the estimated parameter sets (Fig. 1). The proposed method reconstructed the complete evolutionary history of over 80% of the ortholog groups for the trees with ∼256 leaves. Moreover, this method succeeded in reconstructing the history of almost half of the ortholog groups for the tree with 1024 leaves. Notably, both results exceeded those obtained by the traditional maximum-parsimony method, which is also applicable to large data sets but assumes homogeneous evolution.

Fig. 1.

Percentage of correctly estimated gene-content evolutionary history. The proportions of the ortholog groups whose evolutionary history were perfectly reconstructed are shown. Error bars indicate SDs. Each of the trees with 32, 64, 128 and 256 leaves was generated 100 times. Owing to computational time constraints, trees with 512 and 1024 leaves were generated 61 and 30 times, respectively. For comparison, the results of the maximum-parsimony reconstruction are shown.

3.2 Reconstruction of heterogeneous gene-content evolution across the three domains of life

One of the major advantages of the proposed algorithm is its efficiency. Then, the algorithm was applied to universal tree-scale genomic data. The expanded COG ortholog table from the STRING database (von Mering et al., 2005), which contains 26 200 ortholog groups from 179 genomes covering the three domains of life, was utilized. For the phylogenetic topology, the highly resolved tree of life created from concatenated well-conserved genes (Ciccarelli et al., 2006) was used. With implementation in the R language on a Linux machine with a 3 GHz Intel Pentium 4 processor, parameter estimation and subsequent reconstruction of the evolutionary history occupied memory space of ∼300 MB and required 19.3 h of computation. The Akaike information criterion was formula and smaller than that of the uniform rate model (⁠ formula ⁠), which indicates that the present model better reflects the gene-content evolution.

The reconstructed gene-content evolutionary history shows that heterogeneity is common throughout the universal tree of life (Fig. 2). In particular, eukaryotes and prokaryotes exhibit drastically different trends in terms of gene-content evolution: the former shows excessive gene gain whereas the latter shows superiority of gene loss. As a simple interpretation, these trends seem to reflect differences in general survival tactics between the two forms of life; i.e. prokaryotes with small genomes have evolved to proliferate rapidly, whereas eukaryotes have increased their genome sizes to acquire new gene sets to evolve various survival strategies (e.g. cell communication genes required for multicellularization).

Fig. 2.

Reconstructed heterogeneous gene-content evolutionary history across the three domains of life. Edge length represents the numbers of gained or lost genes on each edge. Edge colors indicate the ratio between gain and loss: red edges denote active gene gains and blue edges represent gene losses. Higher taxonomic classifications are taken from NCBI Taxonomy. Circled letters are provided for the convenience of the reader.

Not only are eukaryotes and prokaryotes characterized by different trends in gene-content evolution, significant variations are also observed within each form of life. The gene gain trend in eukaryotic genome evolution is prominent in the large-scale duplications. Genomic studies have revealed that multiple rounds of genome-scale duplications have occurred in the ancient species of vertebrates (Dehal and Boore, 2005, two rounds) and plants (Blanc et al., 2003, at least two rounds). Both events correspond to high rates of gene gain in the estimation (Fig. 2, circled a and b, respectively). On the other hand, the edge that leads to the last common ancestor of animals and fungi is remarkably short (Fig. 2, circled c), possibly because the diversification of animals, fungi and plants occurred within a short period of time (Douzery et al., 2004).

In contrast to that of eukaryotes, prokaryotic evolution is characterized by the general trend of gene loss. Among the prokaryotes, γ-, β- and α-Proteobacteria are estimated to have experienced significant gene gains (Fig. 2, circled d-g). This observation is consistent with a large-scale molecular phylogenetic study suggesting that frequent horizontal gene transfers have increased the gene repertories of these three proteobacterial subclasses (Beiko et al., 2005). Another striking exception is the ancestor of the genus Streptomyces (Fig. 2, circled h). Streptomyces species, which produce natural antibiotics for medical treatments, are known for their large genome sizes, with more than 7500 ORFs (Bentley et al., 2002; Ikeda et al., 2003), which are larger than those of some eukaryotic species. In fact, a Streptomyces mutant species with an almost duplicated genome, such as seen in eukaryotic genomes, has been identified (Wenner et al., 2003). Thus, Streptomyces species are expected to be highly exceptional with respect to prokaryotic gene-content evolution.

Note that one major difficulty with the present method is also illustrated in Fig. 2. The difficulty is that the estimated history must inherit errors in preparing input data. For instance, the massive gene loss toward Gallus gallus is likely due to gene annotation problems. Likewise, the present method cannot overcome errors in detecting similar sequences, splitting multi-domain genes or inferring phylogenetic trees. Since these problems have been rigorously examined by the experts on each technique, we do not discuss them later; however, these possibilities should be noted when evaluating estimated results of the present method.

3.3 An example of reconstructed evolutionary history

To show an example of reconstructed history, we examined the evolution of gene families that function in RNA-mediated gene silencing across the eukaryotic domain. This process is widely conserved among eukaryotes, whereby double-stranded RNA induces the inactivation of the cognate sequences (Zamore and Haley 2005). Genes assigned to the keyword RNA-mediated gene silencing in the UniProt database (Wu et al., 2006) were collected, and the reconstructed history of the sixteen ortholog groups that include these genes was investigated (Fig. 3). We used the phylogenetic tree by Ciccarelli et al. (Ciccarelli et al., 2006). Note, however, that the tree of eukaryotes is controversial, especially concerning the positions of nematodes (Aguinaldo et al., 1997; Blair et al., 2002; Halanych, 2004) and protists (Keeling et al., 2005).

Most likely evolutionary history of the gene families that function in RNA-mediated gene silencing across the eukaryotic domain. Superscript numbers denote possible alternative states estimated by at least one maximum-parsimony history that minimizes the number of transitions.

Fig. 3.

The reconstructed history illustrates some general features of the evolution of this biological process. As estimated here, the key components of the process, RNA-dependent RNAP and Argonaute, are believed to have already existed in the last common ancestor of eukaryotes (⁠ formula ⁠) in defensive roles against transposons and viruses (Cerutti and Casas-Mollano, 2006). In parallel with later acquisitions of the accessory genes, as shown in Figure 3, each gene is believed to have multiplicated so as to subfunctionalize; thus, this machinery fulfills various roles, such as miRNA-mediated gene regulation, especially along the lineages of animals and plants (e.g. Yigit et al., 2006). In addition, this process has disappeared along the lineages of the Saccharomycetaceae and Plasmodium (Cerutti and Casas-Mollano, 2006; Nakayashiki et al., 2006), reflecting the loss of the key genes Dicer and Argonaute/Piwi (Fig. 3; light-orange and yellow-green circles respectively) in the process.

Some differences were found between the history estimated by the proposed algorithm and the traditional maximum-parsimony method (Fig. 3; superscripts). The most notable difference in the estimation by the maximum-parsimony method was the absence of the key components, RNA-dependent RNAP and Argonaute, at formula ⁠. This difference arose because, by assuming the homogeneous evolution model, the maximum-parsimony estimation was strongly influenced by the gene content of Plasmodium falciparum, which has the distance of one edge to formula ⁠. Instead, the proposed algorithm considered the gene loss mode toward Plasmodium and estimated the presence of both genes. In fact, these two genes are believed to have been present at formula (Cerutti and Casas-Mollano, 2006), which demonstrates the advantage of the new method. Another notable difference is seen in the evolutionary history of the exonuclease Mut-7. The new method estimated the presence of this gene from formula to the last common ancestor of Coelomata, whereas the maximum-parsimony method could estimate independent acquisitions of this gene along several lineages. Of course, the correct scenario cannot be determined because the actual evolutionary history is unknown. However, since Mut-7 family genes are known to have significant similarities in terms of both sequence and function across the eukaryotic domain (Glazov et al., 2003), the common origin scenario carries some credibility.

The apparent error in the estimation performed by the present method (as well as by the maximum-parsimony method) was the absence of Dicer at formula ⁠. This error would also be caused by the absence of this gene at Plasmodium. This genus is exceptional among the protists, since other protists have this gene and should have the RNA-mediated gene silencing system (Cerutti and Casas-Mallano, 2006), and this is exactly the reason that formula is believed to have possessed this gene. Therefore, the main reason for this error is the bias in the utilized genomic information. More genome sequences from the ongoing genome projects are needed to correct these errors.

4 CONCLUSION AND PERSPECTIVES

The proposed algorithm is the first gene-content evolution reconstruction method that accounts for heterogeneity with effective and efficient parameter estimations. The application of this method to an actual data set across the three domains of life showed that the modes of gene-content evolution are actually never uniform, reflecting the highly variable sizes (160 kb to 3 Gb) of the sequenced genomes. This result suggests that heterogeneity should be considered in studies of gene-content evolution. Since the rates and history of gene-content evolution estimated in the present study have a mathematical foundation with maximum likelihood, they should provide objective bases for studying the heterogeneous evolution of gene content across the universal tree of life.

Four major future directions are envisioned based on the present study. The first direction is the refinement of the evolution model. We adopted a two-parameter model, which is simpler than a three-parameter model of duplication, loss and horizontal transfer (Csűrös and Miklós, 2006), for one biological and two mathematical reasons. First, estimating the causes of gene birth (gene genesis, duplication, horizontal gene transfer or other processes) often requires detailed sequence information and individual analyses. Notably, new ortholog groups can emerge by duplication, in addition to gene genesis or horizontal gene transfer (e.g. different aminoacyl-tRNA synthetases that have significant sequence similarity are usually classified into different ortholog groups (Tatusov et al., 1997)). Hence, changing the gene gain/loss rates depending on current gene counts does not necessarily reflect actual evolution processes. Second, as Csűrös and Miklós have pointed out, models with too many parameters lead to parameter overfitting. The heterogeneity consideration already requires several parameters, the number of which is proportional to the tree size. Thus, more complex models could result in a smaller than expected improvement in the estimation. Third, the simpler definition enables diagonalization of the transition rate matrix and effective computation of the algorithm. We anticipate extending the model to include additional parameters once the above obstacles have been overcome.

The second direction is to utilize the information available on gene numbers greater than three. This information was ignored for the biological reason outlined in the Methods section. However, when concentrated analyses of the eukaryotic gene-content evolution are considered, in which gene multiplications have great significance, this type of extension may be preferable. In terms of mathematics, loosening this limitation is not a difficult task. For example, a formula version of formula can also be diagonalized analytically, so that formula has an analytical solution in much more complex equations, and a similar argument holds true for the expectation-maximization algorithm. Thus, the proposed algorithm can easily be modified to utilize highly multiplicated genes. In fact, as far as formula is defined to enable its diagonalization, the new algorithm can utilize any evolution model to optimize the parameters and to estimate the most likely ancestral states. Moreover, using a numerical diagonalization approach, every evolution model, including the sophisticated duplication, loss and horizontal transfer model (Csűrös and Miklós, 2006), can be utilized; however, computational efficiency would be impaired. These derivative methods should be tested and compared in individual cases, since the best method should depend on both input data and goals of analyses.

The third direction is to apply the present method to the evolution of phenotypes. Reconstruction of ancestral states of phenotypic characters has a long history. In particular, some methods (Pagel, 1999; Schluter et al., 1997) also utilize maximum-likelihood-based estimation, but neither the edge-specific rates nor the ordered states. This is partly because much less data are available on the evolution of phenotypes than are available for gene content; however, by limiting the numbers of freely variable rates or focusing on specific phenotypes (e.g. ones that have many states or ones that have many related phenotypes whose evolutionary modes can be assumed to be the same), the present method could be applied to these estimations.

The fourth and final direction is the most challenging, in that it involves the expansion of the evolution model to consider the interactions among genes. We hypothesized that different ortholog groups would evolve independently; however, this is apparently not the case. Thus, the interactions among different gene families in the course of heterogeneous gene-content evolution should be taken into account. To date, various mathematical analyses have been performed to clarify the evolution of biological systems by assuming the homogeneity of gene-content evolution, e.g. the preferential attachment model for scale-free, power-law networks (Barabasi and Albert, 1999). The directions for future research will deepen our understanding of the effects of heterogeneity on the evolution of intricately interacting gene sets, such as gene networks, metabolic pathways and biological modules.

ACKNOWLEDGEMENTS

We are grateful to Drs Y. Yamamoto, S. Dohkan, T. Kato and three anonymous reviewers for valuable comments on the manuscript. This work was supported by KAKENHI (Grant-in-Aid for Scientific Research) on Priority Areas ‘Systems Genomics’ from the Ministry of Education, Culture, Sports, Science and Technology of Japan. W. I. was supported by Research Fellowships from the Japan Society for the Promotion of Science for Young Scientists.

Conflict of Interest: none declared.

REFERENCES

et al.

Evidence for a clade of nematodes, arthropods and other moulting animals

Nature

1997

, vol.

387

(pg.

489

493

)

Emergence of scaling in random networks

Science

1999

, vol.

286

(pg.

509

512

)

et al.

Highways of gene sharing in prokaryotes

Proc. Natl. Acad. Sci. U. S. A

2005

, vol.

102

(pg.

14332

14337

)

et al.

Complete genome sequence of the model actinomycete Streptomyces coelicolor

Nature

2002

, vol.

417

A3(2)

(pg.

141

147

)

et al.

The evolutionary position of nematodes

BMC Evol. Biol

2002

, vol.

pg.

et al.

A recent polyploidy superimposed on older large-scale duplications in the Arabidopsis genome

Genome Res

2003

, vol.

(pg.

137

144

)

On the origin and functions of RNA-mediated silencing: from protists to man

Curr. Genet

2006

, vol.

(pg.

)

et al.

Toward automatic reconstruction of a highly resolved tree of life

Science

2006

, vol.

311

(pg.

1283

1287

)

A probabilistic model for gene content evolution with duplication, loss, and horizontal transfer

RECOMB 2006, LNBI 3909

2006

(pg.

206

220

)

Two rounds of whole genome duplication in the ancestral vertebrate

PLoS Biol

2005

, vol.

pg.

e314

et al.

Maximum likelihood from incomplete data via the EM algorithm

J. R. Stat. Soc. Ser. B-Stat. Methodol

1977

, vol.

(pg.

)

et al.

The timing of eukaryotic evolution: does a relaxed molecular clock reconcile proteins and fossils?

Proc. Natl. Acad. Sci. U. S. A

2004

, vol.

101

(pg.

15386

15391

)

et al.

The consistent phylogenetic signal in genome trees revealed by reducing the impact of noise

J. Mol. Evol

2004

, vol.

(pg.

527

539

)

et al.

A gene encoding an RNase D exonuclease-like protein is required for post-transcriptional silencing in Arabidopsis

Plant J

2003

, vol.

(pg.

342

349

)

Genome phylogenetic analysis based on extended gene contents

Mol. Biol. Evol

2004

, vol.

(pg.

1401

1408

)

et al.

Estimating the tempo and mode of gene family evolution from comparative genomic data

Genome Res

2005

, vol.

(pg.

1153

1160

)

The new view of animal phylogeny

Annu. Rev. Ecol. Evol. Syst

2004

, vol.

(pg.

229

256

)

An expectation maximization algorithm for training hidden substitution models

J. Mol. Biol

2002

, vol.

317

(pg.

753

764

)

Phylogenetic trees based on gene content

Bioinformatics

2004

, vol.

(pg.

2044

2049

)

et al.

Complete genome sequence and comparative analysis of the industrial microorganism Streptomyces avermitilis

Nat. Biotechnol

2003

, vol.

(pg.

526

531

)

et al.

Birth and death of protein domains: a simple model of evolution explains power law behavior

BMC Evol. Biol

2002

, vol.

pg.

et al.

Simple stochastic birth and death models of genome evolution: was there enough time for us to evolve?

Bioinformatics

2003

, vol.

(pg.

1889

1900

)

et al.

Gene family evolution: an in-depth theoretical and simulation analysis of non-linear birth-death-innovation models

BMC Evol. Biol

2004

, vol.

pg.

et al.

The tree of eukaryotes

Trends Ecol. Evol

2005

, vol.

(pg.

670

676

)

GeneTRACE-reconstruction of gene content of ancestral species

Bioinformatics

2003

, vol.

(pg.

1412

1416

)

The estimation of stochastic context-free grammars using the inside-outside algorithm

Comput. Speech Lang

1990

, vol.

(pg.

)

et al.

Algorithms for computing parsimonious evolutionary scenarios for genome evolution, the last universal common ancestor and dominance of horizontal gene transfer in the evolution of prokaryotes

BMC Evol. Biol

2003

, vol.

pg.

et al.

The 160-kilobase genome of the bacterial endosymbiont Carsonella

Science

2006

, vol.

314

pg.

267

et al.

Evolution and diversification of RNA silencing proteins in fungi

J. Mol. Evol

2006

, vol.

(pg.

127

135

)

The maximum likelihood approach to reconstructing ancestral character states of discrete characters on phylogenies

Syst. Biol

1999

, vol.

(pg.

612

622

)

Fusion, propagation, and structuring in belief networks

Artif. Intell

1986

, vol.

(pg.

241

288

)

et al.

A fast algorithm for joint reconstruction of ancestral amino acid sequences

Mol. Biol. Evol

2000

, vol.

(pg.

890

896

)

et al.

Likelihood of ancestor states in adaptive radiation

Evolution

1997

, vol.

(pg.

1699

1711

)

et al.

Genomes in flux: the evolution of archaeal and proteobacterial gene content

Genome Res

2002

, vol.

(pg.

)

et al.

A genomic perspective on protein families

Science

1997

, vol.

278

(pg.

631

637

)

MBGD: microbial genome database for comparative analysis

Nucleic Acids Res

2003

, vol.

(pg.

)

et al.

STRING: known and predicted protein-protein associations, integrated and transferred across organisms

Nucleic Acids Res

2005

, vol.

(pg.

D433

D437

)

et al.

End-to-end fusion of linear deleted chromosomes initiates a cycle of genome instability in Streptomyces ambofaciens

Mol. Microbiol

2003

, vol.

(pg.

411

425

)

et al.

The Universal Protein Resource (UniProt): an expanding universe of protein information

Nucleic Acids Res

2006

, vol.

(pg.

D187

D191

)

et al.

Analysis of the C. elegans Argonaute family reveals that distinct Argonautes act sequentially during RNAi

Cell

2006

, vol.

127

(pg.

747

757

)

Ribo-gnome: the big world of small RNAs

Science

2005

, vol.

309

(pg.

1519

1524

)

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.