Clusters of orthologous genes for 41 archaeal genomes and implications for evolutionary genomics of archaea - PubMed (original) (raw)

Clusters of orthologous genes for 41 archaeal genomes and implications for evolutionary genomics of archaea

Kira S Makarova et al. Biol Direct. 2007.

Abstract

Background: An evolutionary classification of genes from sequenced genomes that distinguishes between orthologs and paralogs is indispensable for genome annotation and evolutionary reconstruction. Shortly after multiple genome sequences of bacteria, archaea, and unicellular eukaryotes became available, an attempt on such a classification was implemented in Clusters of Orthologous Groups of proteins (COGs). Rapid accumulation of genome sequences creates opportunities for refining COGs but also represents a challenge because of error amplification. One of the practical strategies involves construction of refined COGs for phylogenetically compact subsets of genomes.

Results: New Archaeal Clusters of Orthologous Genes (arCOGs) were constructed for 41 archaeal genomes (13 Crenarchaeota, 27 Euryarchaeota and one Nanoarchaeon) using an improved procedure that employs a similarity tree between smaller, group-specific clusters, semi-automatically partitions orthology domains in multidomain proteins, and uses profile searches for identification of remote orthologs. The annotation of arCOGs is a consensus between three assignments based on the COGs, the CDD database, and the annotations of homologs in the NR database. The 7538 arCOGs, on average, cover approximately 88% of the genes in a genome compared to a approximately 76% coverage in COGs. The finer granularity of ortholog identification in the arCOGs is apparent from the fact that 4538 arCOGs correspond to 2362 COGs; approximately 40% of the arCOGs are new. The archaeal gene core (protein-coding genes found in all 41 genome) consists of 166 arCOGs. The arCOGs were used to reconstruct gene loss and gene gain events during archaeal evolution and gene sets of ancestral forms. The Last Archaeal Common Ancestor (LACA) is conservatively estimated to possess 996 genes compared to 1245 and 1335 genes for the last common ancestors of Crenarchaeota and Euryarchaeota, respectively. It is inferred that LACA was a chemoautotrophic hyperthermophile that, in addition to the core archaeal functions, encoded more idiosyncratic systems, e.g., the CASS systems of antivirus defense and some toxin-antitoxin systems.

Conclusion: The arCOGs provide a convenient, flexible framework for functional annotation of archaeal genomes, comparative genomics and evolutionary reconstructions. Genomic reconstructions suggest that the last common ancestor of archaea might have been (nearly) as advanced as the modern archaeal hyperthermophiles. ArCOGs and related information are available at: ftp://ftp.ncbi.nih.gov/pub/koonin/arCOGs/.

PubMed Disclaimer

Figures

Figure 1

Figure 1

A flow chart of the procedure employed for the construction of the arCOGs. See Materials and Methods for the description of each step.

Figure 2

Figure 2

Coverage of archaeal genomes with arCOGs and COGs. Cyan, ArCOGs, purple, COGs. Abbreviations are as in Table 1.

Figure 3

Figure 3

Distribution of the number of species in arCOGs: three classes of archaeal genes. A semi-logarithmic plot fitted with a sum of 3 exponents

Figure 4

Figure 4

Distribution of phyletic patterns by the number of arCOGs. A log-log plot.

Figure 5

Figure 5

Functional breakdown of the entire set of arCOGs and the three core sets. EA, Euryarchaea, CA, Crenarchaea.

Figure 6

Figure 6

The gene-content tree of archaea constructed on the basis of the phyletic patterns of arCOGs. The species abbreviations are as in Table 1. Cren, Crenarchaeota; Eury, Euryarchaeota.

Figure 7

Figure 7

A reconstruction of gene gain and loss in archaea. Each branch is labeled by 3 numbers: black, the (inferred) number of arCOGs in the node to which the given branch leads; blue, number of arCOGs lost along the branch; red, number of arCOGs gained along the branch. The red circles on branches denote hyperthermophiles, and blue circles denote mesophiles and moderate thermophiles.

Figure 8

Figure 8

Low-bound reconstructions for ancestral archaeal forms: genomes close in size to modern hyperthermophiles. Each column shows the total number of annotated protein-coding genes in the respective archaeal species; the colored portions (green for Crenarchaeota, blue for Euryarchaeota, and cyan for Nanoarchaeota) show genes included in arCOGs. The hatched columns show the number of arCOGs assigned to LACA, the Last CrenArchaeal Common Ancestor (LCACA) and the Last EuryArchaeal Common Ancestor (LEACA).

Figure 9

Figure 9

Taxonomic affinities of ArCOGs with bacteria and eukaryotes. For the criteria of taxonomic assignments, see Materials and Methods.A, archaea, B, bacteria, E, eukaryotes.

Similar articles

Cited by

References

    1. Fitch WM. Distinguishing homologous from analogous proteins. Systematic Zoology. 1970;19:99–106. doi: 10.2307/2412448. - DOI - PubMed
    1. Koonin EV. Orthologs, paralogs and evolutionary genomics. Annu Rev Genet. 2005;39:309–338. doi: 10.1146/annurev.genet.39.073003.114725. - DOI - PubMed
    1. Ohno S. Evolution by gene duplication. Berlin-Heidelberg-New York , Springer-Verlag; 1970.
    1. Lynch M, Katju V. The altered evolutionary trajectories of gene duplicates. Trends Genet. 2004;20:544–549. doi: 10.1016/j.tig.2004.09.001. - DOI - PubMed
    1. Galperin MY, Koonin EV. Sources of systematic error in functional annotation of genomes: domain rearrangement, non-orthologous gene displacement and operon disruption. In Silico Biol. 1998;1:55–67. - PubMed

Publication types

MeSH terms

Substances

LinkOut - more resources