KEGG OC: a large-scale automatic construction of taxonomy-based ortholog clusters - PubMed (original) (raw)

. 2013 Jan;41(Database issue):D353-7.

doi: 10.1093/nar/gks1239. Epub 2012 Nov 27.

Toshiaki Katayama, Masumi Itoh, Kazushi Hiranuka, Shuichi Kawashima, Yuki Moriya, Shujiro Okuda, Michihiro Tanaka, Toshiaki Tokimatsu, Yoshihiro Yamanishi, Akiyasu C Yoshizawa, Minoru Kanehisa, Susumu Goto

Affiliations

KEGG OC: a large-scale automatic construction of taxonomy-based ortholog clusters

Akihiro Nakaya et al. Nucleic Acids Res. 2013 Jan.

Abstract

The identification of orthologous genes in an increasing number of fully sequenced genomes is a challenging issue in recent genome science. Here we present KEGG OC (http://www.genome.jp/tools/oc/), a novel database of ortholog clusters (OCs). The current version of KEGG OC contains 1 176 030 OCs, obtained by clustering 8 357 175 genes in 2112 complete genomes (153 eukaryotes, 1830 bacteria and 129 archaea). The OCs were constructed by applying the quasi-clique-based clustering method to all possible protein coding genes in all complete genomes, based on their amino acid sequence similarities. It is computationally efficient to calculate OCs, which enables to regularly update the contents. KEGG OC has the following two features: (i) It consists of all complete genomes of a wide variety of organisms from three domains of life, and the number of organisms is the largest among the existing databases; and (ii) It is compatible with the KEGG database by sharing the same sets of genes and identifiers, which leads to seamless integration of OCs with useful components in KEGG such as biological pathways, pathway modules, functional hierarchy, diseases and drugs. The KEGG OC resources are accessible via OC Viewer that provides an interactive visualization of OCs at different taxonomic levels.

PubMed Disclaimer

Figures

Figure 1.

Figure 1.

Distribution of OCs in KEGG OC across three domains: eukaryotes, bacteria and archaea. The number indicates the number of OCs consisting of multiple genes, whereas the number in parenthesis indicates the number of singletons (OCs consisting of a single gene).

Figure 2.

Figure 2.

An example of the output page of OC Viewer of query ‘eco:b0002’ (an example of KEGG GENES ID for a gene of E. coli K-12 MG1655) as an input. The PC column shows the PCs (eco.14, ecj.17, ecd.113, etc.). These PCs are aggregated into a TC named Escherichia_col.10890 at the higher taxonomic level indicated in 5th column. As the aggregation of the TCs is iterated from the 5th column to the 2nd column in the OC table, these PCs are merged to the top-level cluster OC.149602. By using the slider at the bottom left, one can focus to arbitrary depth in the taxonomic tree indicated at the bottom right.

Similar articles

Cited by

References

    1. Pagani I, Liolios K, Jansson J, Chen IM, Smirnova T, Nosrat B, Markowitz VM, Kyrpides NC. The Genomes OnLine Database (GOLD) v.4: status of genomic and metagenomic projects and their associated metadata. Nucleic Acids Res. 2012;40:D571–D579. - PMC - PubMed
    1. Harris MA, Clark J, Ireland A, Lomax J, Ashburner M, Foulger R, Eilbeck K, Lewis S, Marshall B, Mungall C, et al. The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res. 2004;32:D258–D261. - PMC - PubMed
    1. Kanehisa M, Goto S, Sato Y, Furumichi M, Tanabe M. KEGG for integration and interpretation of large-scale molecular data sets. Nucleic Acids Res. 2012;40:D109–D114. - PMC - PubMed
    1. Tatusov RL, Koonin EV, Lipman DJ. A genomic perspective on protein families. Science. 1997;278:631–637. - PubMed
    1. Lee Y, Sultana R, Pertea G, Cho J, Karamycheva S, Tsai J, Parvizi B, Cheung F, Antonescu V, White J, et al. Cross-referencing eukaryotic genomes: TIGR Orthologous Gene Alignments (TOGA) Genome Res. 2002;12:493–502. - PMC - PubMed

Publication types

MeSH terms

LinkOut - more resources