Phylogenetic Tracings of Proteome Size Support the Gradual Accretion of Protein Structural Domains and the Early Origin of Viruses from Primordial Cells - PubMed (original) (raw)

Phylogenetic Tracings of Proteome Size Support the Gradual Accretion of Protein Structural Domains and the Early Origin of Viruses from Primordial Cells

Arshan Nasir et al. Front Microbiol. 2017.

Abstract

Untangling the origin and evolution of viruses remains a challenging proposition. We recently studied the global distribution of protein domain structures in thousands of completely sequenced viral and cellular proteomes with comparative genomics, phylogenomics, and multidimensional scaling methods. A tree of life describing the evolution of proteomes revealed viruses emerging from the base of the tree as a fourth supergroup of life. A tree of domains indicated an early origin of modern viral lineages from ancient cells that co-existed with the cellular ancestors. However, it was recently argued that the rooting of our trees and the basal placement of viruses was artifactually induced by small genome (proteome) size. Here we show that these claims arise from misunderstanding and misinterpretations of cladistic methodology. Trees are reconstructed unrooted, and thus, their topologies cannot be distorted a posteriori by the rooting methodology. Tracing proteome size in trees and multidimensional views of evolutionary relationships as well as tests of leaf stability and exclusion/inclusion of taxa demonstrated that the smallest proteomes were neither attracted toward the root nor caused any topological distortions of the trees. Simulations confirmed that taxa clustering patterns were independent of proteome size and were determined by the presence of known evolutionary relatives in data matrices, highlighting the need for broader taxon sampling in phylogeny reconstruction. Instead, phylogenetic tracings of proteome size revealed a slowdown in innovation of the structural domain vocabulary and four regimes of allometric scaling that reflected a Heaps law. These regimes explained increasing economies of scale in the evolutionary growth and accretion of kernel proteome repertoires of viruses and cellular organisms that resemble growth of human languages with limited vocabulary sizes. Results reconcile dynamic and static views of domain frequency distributions that are consistent with the axiom of spatiotemporal continuity that is tenet of evolutionary thinking.

Keywords: Heaps law; origin of viruses; phylogenomics; protein structure; proteome growth; tree of life.

PubMed Disclaimer

Figures

Figure 1

Figure 1

Comparing the indirect outgroup comparison method of rooting trees and the direct generality criterion. Rooting involves orienting an unrooted tree and pulling down a branch that will hold the ancestor of all taxa examined. In outgroup comparison, sister (outgroup) taxa external to the study group (ingroup taxa) are identified a priori of being of ancestral origin and the branch that is closest to the ingroup pulled down. This creates a new outgroup node for rooting the phylogeny. The outgroup node adds a character state vector that includes character state o, which is diagnostic of the outgroup and is assumed to be ancestral and absent in the ingroup. Once the outgroup is made ancestral, the tree is rooted and character state i is shared and derived, making it a synapomorphy. In Weston's generality criterion (Weston, 1988, 1994), the character state distributions in the phylogeny are used to polarize character transformations. Character state z is less distributed than y within the ingroup (it is present only in a minority subset of taxa) and is considered shared and derived. The figure was modified from Bryant (2001).

Figure 2

Figure 2

FSF use (occurrence) and reuse (abundance) are strongly correlated. Scatter log-log plots reveal a strong correlation between FSF use and FSF reuse for total (A) and universal ABEV FSF (B) sets for 368-taxon trees (Nasir and Caetano-Anollés, 2015). Viruses (266), Archaea (34), Bacteria (34), and Eukarya (34) are colored red, black, blue, and green, respectively. Each of these supergroups has its own power law regime that complies with a four-regime Heaps law of vocabulary growth. Individual regimes are indicated with numbers and follow V ~ _N_β relationships, with V representing FSF vocabulary size (use) and N representing FSF database size (reuse) in proteomes. Their fits to linear regression models using ordinary least squares and the estimation of the Heaps exponent β are described in Figure S2.

Figure 3

Figure 3

Trees of proteomes are robust and insensitive to the effects of genome size but sensitive to holobiont relationships defining taxa. (A) The single most parsimonious tree (taxa = 368; characters = 442; length = 45,935, retention index = 0.83, _g_1 = −0.31) describing the evolution of 102 cellular organisms (34 each from Archaea, Bacteria, and Eukarya) and 266 viruses (sampled at least 5 viruses from each family/order) (Nasir and Caetano-Anollés, 2015). The smallest proteomes for cells (I. hospitalis and A. gossypii; black and green asterisks) and viruses (bat cycloviruses; red asterisk) are indicated. The names of taxa are not shown because they would not be visible. Instead, the positions of terminals were colored according to supergroup, green (Eukarya), blue (Bacteria), black (Archaea) and red (viruses). (B) A strict consensus of two most parsimonious trees (length = 46,781, retention index = 0.83, _g_1 = −19.81 and −19.82) built using phylogenomic data from the 368 proteomes of panel (A) plus the proteomes from the two extremely reduced R. prowazekii and N. equitans (gray circles and asterisks). While no major topological distortions are observed, the consensus tree losses resolution at its base.

Figure 4

Figure 4

Scatter plots describe the relationship between ABEV FSF use (A) and reuse (B) and node distance (nd) for the 368-taxon ToL (Nasir and Caetano-Anollés, 2015). Data points for different supergroups are colored green (Eukarya), blue (Bacteria), black (Archaea) and red (viruses). The black line describes the nature of the relationship, as determined by the Locally Weighted Regression Scatter Plot Smoothing (LOWESS) method, which obtains a smoothed curve by fitting successive regression functions (q = 0.1, i = 100). The plot reveals high scatter, especially toward smaller nd values and clustering of bacterial and eukaryal taxa in the same nd range despite harboring big differences in FSF use and reuse.

Figure 5

Figure 5

Testing the SGA artifact with the Siddal and Whiting (1999) approach. A single most parsimonious phylogenomic tree (a) describes the evolutionary relationships between four proteomes sampled each from viruses, Archaea, Bacteria, and Eukarya. Taxa are colored as previously described. Numbers on branches indicated BS support values (%). Single most parsimonious trees b through e were recovered after successive elimination of the smallest viral proteomes. TL, tree length; RI, retention index.

Figure 6

Figure 6

Cellular endosymbionts differ from free-living organisms and viruses in their FSF composition profiles. Annotation of FSF domains into one of the seven major functional categories (Metabolism, Information, Intracellular Processes, Extracellular Processes, Regulation, General, and Other) for archaeal, bacterial, eukaryal, and viral proteomes sampled in our study (Nasir and Caetano-Anollés, 2015) and for nine viral and three extremely reduced cellular proteomes included by Harish et al. (2016) in their reconstructions Cand. Nausia deltocephalinicola was not part of our reconstructions (encodes only 55 universal FSFs). Obligate endosymbionts or parasites often increase the repertoire of informational FSF domains, as showcased by Cand. Tremblaya included by Harish et al. (2016), and for 311 other known obligate and facultative parasitic organisms in (Figure 3 in Nasir et al., 2011). Functional scheme as defined by Christine Vogel in SUPERFAMILY database (

http://supfam.org/SUPERFAMILY/function.html

). Category Other includes proteins with either unknown or viral functions. General includes proteins involved in binding to small molecules, ligands, and lipids, and structural proteins. Numbers in parenthesis indicate total number of proteomes included in the FSF profile representation.

Figure 7

Figure 7

Obligate parasitic taxa destabilize leaves of trees. (A) Leaf stabilities (LS maximum) were calculated with RadCon (Thorley and Page, 2000) from 2,000 unrooted BS trees. LS values are ordered in the table (A) according to the most informative strict reduced consensus (SRC) tree (33.54 bits) out of a set of 5 SRC trees, which matches the strict component consensus (consensus efficiency = 0.555) derived from the unrooted trees. (B) LS values are visualized as violin plots. Violin plot is a combination of the box plot (the black rectangle with white circle representing group median) and density plot on each side (yellow) reflecting data distribution. The spread of LS values was calculated for the control set (C) and all possible permutations of free-living Acidobacterium capsulatum (A1–A5) and the obligate endoparasite R. prowazekii (R1–R5) with individual taxa of the corresponding bacterial superkingdoms (identified with numbers following taxon labels). The density trace is plotted symmetrically around the boxplots. White circles are group medians. Asterisks are distributions significantly different from control C (Wilcoxon rank sum test, two-tailed, P < 0.01).

Figure 8

Figure 8

Viruses stabilize leaves of trees. (A) A single most parsimonious phylogenomic tree (length = 13,004, retention index = 0.61) reconstructed from the genomic abundance census of 442 universal FSFs (432 parsimony informative characters) in 24 proteomes selected equally from Archaea (black), Bacteria (blue), and Eukarya (green) (the 8880 dataset). The most stable taxa in each superkingdoms, as indicated by TII values (Table 2), are labeled with an asterisk. (B) A single most parsimonious phylogenomic tree (length = 12,033, retention index = 0.70) reconstructed from the genomic abundance census of 442 universal FSFs (428 parsimony informative characters) in 24 proteomes selected equally from viruses (red), Archaea (black), Bacteria (blue), and Eukarya (green) after replacing the most stable cellular taxa in (A) with viruses (the 6666 dataset). (C) A comparison of various LS statistics between the 8880 and 6666 BS tree datasets, as displayed by violin plots. None of the comparisons were statistically significant (Wilcoxon rank sum test, two-tailed). (D) Comparison of TII distribution for the 8880 dataset against the 6666 dataset, as displayed by violin plots. Inclusion of viral taxa significantly reduces overall tree instability. Asterisk indicates significant mean difference (Wilcoxon rank sum test, two-tailed, P < 0.01).

Figure 9

Figure 9

The space of ages of FSF structural domains reveals supergroups as distinct clouds and global evolutionary tendencies of growth in proteomes. (A) An evolutionary principal coordinate (evoPCO) analysis plot portrays in its first three axes (85% variability explained) the evolutionary distances between cellular and viral proteomes [taxa = 368, characters = 442 universal FSFs, character states = occurrence * (1−nd)]. (B) The most important evoPCO component plotted against universal ABEV FSF reuse in logarithm scale. The reconstructed proteome of the last common ancestor of modern cells was added as reference to infer the direction of evolutionary change (Kim and Caetano-Anollés, 2011). a, Lassa virus; b, Ancestor; c, Pandoravisus salinus; d, Pandoravirus dulcis; e, Acanthamoeba polyphaga mimivirus; f, Megavirus chilensis; g, Megavirus iba; h, Ignicoccus hospitalis; i, Haloarcula marismortui; j, Lactobacillus delbrueckii; k, Sorangium cellulosum; l, Ashbya gossypii; m, Emiliana huxleyi.

Similar articles

Cited by

References

    1. Aberer A. J., Krompass D., Stamatakis A. (2013). Pruning rogue taxa improves phylogenetic accuracy: an efficient algorithm and webservice. Syst. Biol. 62, 162–166. 10.1093/sysbio/sys078 - DOI - PMC - PubMed
    1. Abergel C., Legendre M., Claverie J.-M. (2015). The rapidly expanding universe of giant viruses: Mimivirus, Pandoravirus, Pithovirus and Mollivirus. FEMS Microbiol. Rev. 39, 779–796. 10.1093/femsre/fuv037 - DOI - PubMed
    1. Abrescia N. G. A., Bamford D. H., Grimes J. M., Stuart D. I. (2012). Structure unifies the viral universe. Annu. Rev. Biochem. 81, 795–822. 10.1146/annurev-biochem-060910-095130 - DOI - PubMed
    1. Andreeva A., Howorth D., Chandonia J. M., Brenner S. E., Hubbard T. J., Chothia C., et al. . (2008). Data growth and its impact on the SCOP database: new developments. Nucleic Acids Res. 36, D419–D425. 10.1093/nar/gkm993 - DOI - PMC - PubMed
    1. Bamford D. H. (2003). Do viruses form lineages across different domains of life? Res. Microbiol. 154, 231–236. 10.1016/S0923-2508(03)00065-2 - DOI - PubMed

LinkOut - more resources