Modelling prokaryote gene content - PubMed (original) (raw)

Modelling prokaryote gene content

Matthew Spencer et al. Evol Bioinform Online. 2007.

Abstract

The patchy distribution of genes across the prokaryotes may be caused by multiple gene losses or lateral transfer. Probabilistic models of gene gain and loss are needed to distinguish between these possibilities. Existing models allow only single genes to be gained and lost, despite the empirical evidence for multi-gene events. We compare birth-death models (currently the only widely-used models, in which only one gene can be gained or lost at a time) to blocks models (allowing gain and loss of multiple genes within a family). We analyze two pairs of genomes: two E. coli strains, and the distantly-related Archaeoglobus fulgidus (archaea) and Bacillus subtilis (gram positive bacteria). Blocks models describe the data much better than birth-death models. Our models suggest that lateral transfers of multiple genes from the same family are rare (although transfers of single genes are probably common). For both pairs, the estimated median time that a gene will remain in the genome is not much greater than the time separating the common ancestors of the archaea and bacteria. Deep phylogenetic reconstruction from sequence data will therefore depend on choosing genes likely to remain in the genome for a long time. Phylogenies based on the blocks model are more biologically plausible than phylogenies based on the birth-death model.

Keywords: gene content; lateral transfer; likelihood; phylogenetics.

PubMed Disclaimer

Figures

Figure 1

Figure 1

Conditional probabilities of each possible ancestral state, given 10 members of a gene family in two taxa, each separated from a common ancestor by an edge of length 0.01 (a) or 1 (b) expected changes. Calculated under the blocks model with parameters (other than edge lengths) from Table 3.

Figure 2

Figure 2

Performance of blocks and birth-death models for two E. coli strains K12 and 0157:H7 EDL933 (a: blocks model, b: birth-death model) and for Archaeoglobus fulgidus and Bacillus subtilis (c: blocks model, d: birth-death model). The data are ij (log ij (model) − log ( ij/n)), the contribution to the log likelihood ratio between a given model and the best possible model from each pattern. ij is the LOWESS imputed count of state i (row) in the first species and state j (column) in the second species, ij (model) is the model predicted relative frequency of pattern ij, and ij/n is the LOWESS imputed relative frequency. States are ordered from 0 to ≥20 family members in both rows and columns. Cells are red where the model predicts too high a frequency and blue where it predicts too low a frequency. White cells are patterns for which there were no observations (these make no contribution to the likelihood).

Figure 3

Figure 3

Marginal distributions of gene family size for single species. Symbols are the imputed counts used as data, and lines are predictions from the stationary distributions of the models, with parameters estimated from pairs of species. a: E. coli strains K12 (circles) and 0157: H7 EDL933 (squares), b: Archaeoglobus fulgidus (circles) and Bacillus subtilis (squares). In both panels, the blocks model is the solid line and the birth-death model is the dashed line. The vertical axis is on a logarithmic scale, so we use (frequency+1) to allow zero frequencies to be represented.

Figure 4

Figure 4

Phylogeny based on birth-death distances for all 66 genomes in the COG database, estimated by least squares with inverse square weighting (three equally good topologies were found, but they differed only in the arrangement of clades separated by zero-length edges). The tree is rooted with all the archaea except Methanosarcina acetivorans as an outgroup. Edge lengths are expected numbers of gene events per gene family. The weighted sum of squares was 830.

Figure 5

Figure 5

Phylogeny based on blocks model distances for all 66 genomes in the COG database, estimated by least-squares with inverse square weighting. The tree is rooted with the archaea as an outgroup. Edge lengths are expected numbers of gene events per gene family (note the difference in scale from Figure 4). The weighted sum of squares was 157.

Figure 6

Figure 6

Relationship between number of genomes in which a gene family is found (horizontal axis, n g) and number of observations of a category in the focal pair of genomes (vertical axis, n..(n g)), where n.. is one of the categories AA (a, b), AP (c, d), PA (e, f) and PP (g, h). A indicates absent and P present in each member of the focal pair. Focal pairs are E. coli strains K12 and 0157:H7 EDL933 (a, c, e, g); Archaeoglobus fulgidus and Bacillus subtilis (b, d, f, h). Dots are observations, and solid lines are LOWESS curves with span (proportion of points used in each local regression) indicated on each panel. The vertical axis scale is fifteen times larger in a and b than in the other panels.

Figure 7

Figure 7

Double logarithmic plots of observed (n kl) versus imputed counts ( kl) for (a) E. coli strains K12 and 0157:H7 EDL933, and (b) Archaeoglobus fulgidus and Bacillus subtilis. The line indicates equality. The upper right-hand point is the (0, 0) pattern (absent from both members of the pair) for both pairs of taxa.

References

    1. Arvestad L, Berglund AC, Lagergren J, Sennblad B. Bayesian gene/species tree reconciliation and orthology analysis using MCMC. Bioinformatics, 19 Suppl. 2003;1:i7–i15. - PubMed
    1. Arvestad L, Berglund AC, Lagergren J, Sennblad B. Gene tree reconstruction and orthology analysis based on an integrated model for duplications and sequence evolution; Proceedings of the Eighth International Conference on Computational Molecular Biology; New York: ACM Press; 2004. pp. 326–335.
    1. Boucher Y, Douady C, Papke RT, Walsh DA, Boudreau MER, Nesbø CL, Case RJ, Doolittle WF. Lateral gene transfer and the origins of prokaryotic groups. Annual Review of Genetics. 2003;37:283–328. - PubMed
    1. Boussau B, Karlberg EO, Frank AC, Legault BA, Andersson SGE. Computational inference of scenarios for α-proteobacterial genome evolution. Proceedings of the National Academy of Sciences. 2004;101:9722–9727. - PMC - PubMed
    1. Chen CY, Wu KM, Chang YC, Chang CH, Tsai HC, Liao TL, Liu YM, Chen HJ, Shen ABT, Li JC, Su TL, Shao CP, Lee CT, Hor LI, Tsai SF. Comparative genome analysis of Vibrio vulnificus, a marine pathogen. Genome Research. 2003;13:2577–2587. - PMC - PubMed

LinkOut - more resources