From genomics to chemical genomics: new developments in KEGG (original) (raw)

Journal Article

,

Search for other works by this author on:

,

Search for other works by this author on:

,

Search for other works by this author on:

,

Search for other works by this author on:

,

Search for other works by this author on:

,

Search for other works by this author on:

,

Search for other works by this author on:

,

Search for other works by this author on:

Search for other works by this author on:

Received:

14 September 2005

Revision received:

17 October 2005

Accepted:

17 October 2005

Published:

01 January 2006

Cite

Minoru Kanehisa, Susumu Goto, Masahiro Hattori, Kiyoko F. Aoki-Kinoshita, Masumi Itoh, Shuichi Kawashima, Toshiaki Katayama, Michihiro Araki, Mika Hirakawa, From genomics to chemical genomics: new developments in KEGG, Nucleic Acids Research, Volume 34, Issue suppl_1, 1 January 2006, Pages D354–D357, https://doi.org/10.1093/nar/gkj102
Close

Navbar Search Filter Mobile Enter search term Search

Abstract

The increasing amount of genomic and molecular information is the basis for understanding higher-order biological systems, such as the cell and the organism, and their interactions with the environment, as well as for medical, industrial and other practical applications. The KEGG resource (http://www.genome.jp/kegg/) provides a reference knowledge base for linking genomes to biological systems, categorized as building blocks in the genomic space (KEGG GENES) and the chemical space (KEGG LIGAND), and wiring diagrams of interaction networks and reaction networks (KEGG PATHWAY). A fourth component, KEGG BRITE, has been formally added to the KEGG suite of databases. This reflects our attempt to computerize functional interpretations as part of the pathway reconstruction process based on the hierarchically structured knowledge about the genomic, chemical and network spaces. In accordance with the new chemical genomics initiatives, the scope of KEGG LIGAND has been significantly expanded to cover both endogenous and exogenous molecules. Specifically, RPAIR contains curated chemical structure transformation patterns extracted from known enzymatic reactions, which would enable analysis of genome-environment interactions, such as the prediction of new reactions and new enzyme genes that would degrade new environmental compounds. Additionally, drug information is now stored separately and linked to new KEGG DRUG structure maps.

INTRODUCTION

While traditional genomics and other types of omics approaches have contributed to our knowledge on the genomic space of possible genes and proteins that make up the biological system, the new chemical genomics initiatives will give us a glimpse of the chemical space of possible chemical substances that exist as an interface between the biological world and the natural world. The KEGG database project was initiated in 1995, the last year of the first 5-year phase of the Japanese Human Genome Programme (1). After 10 years of development in parallel with the growing number of completely sequenced genomes and increased activities in post-genomic research, the KEGG project has entered a new phase in accordance with the chemical genomics initiatives.

KEGG is a database resource for understanding higher-order functions and utilities of the biological system, such as the cell or the organism, from genomic and molecular information. In fact, we consider KEGG as a computer representation of the biological system, consisting of building blocks and wiring diagrams, which can be used for modeling and simulation as well as for browsing and retrieval (2). Originally, the wiring diagrams involved endogenous molecules, both those that are directly encoded in the genome (proteins and RNAs) and those that are indirectly encoded through biosynthetic/biodegradation pathways (metabolites, glycans and so on). Now we are extending these wiring diagrams to include exogenous molecules. This will help understand interactions between the biological system and the natural environment, and would eventually lead to representation and reconstruction of another higher-level biological system, the biological world. Here we report new developments in KEGG towards this direction.

THE KEGG RESOURCE

Overview

KEGG consists of four main databases. As illustrated in Figure 1 they are categorized as building blocks in the genomic space (GENES databases) and the chemical space (LIGAND database), wiring diagrams in the network space (PATHWAY database) and ontologies for pathway reconstruction (BRITE database). BRITE had been a separate database for many years, but it was formally included in KEGG in release 34.0 (April 2005) to establish a logical foundation for the KEGG Project. The URLs for accessing KEGG are summarized in Table 1.

Biological systems are represented in KEGG by two types of graphs, called nested graphs and line graphs in theoretical computer science. The nested graph is a graph whose nodes can themselves be graphs. It is used for representing KEGG network hierarchy and for pathway reconstruction and functional inference. The line graph is a graph derived by interchanging nodes and edges of another graph. It represents the inherent complementarity of the metabolic pathway, which can be viewed either as a network of genes (enzymes) or as a network of compounds, meaning that one can be generated from the other by the line graph transformation. Thus, the line graph is the basis for integrated analysis of genomic and chemical information.

BRITE database

KEGG BRITE is a collection of hierarchies and binary relations with two inter-related objectives corresponding to the two types of graphs: to automate functional interpretations associated with the KEGG pathway reconstruction and to assist discovery of empirical rules involving genome-environment interactions. Currently, we focus on hierarchical structuring of our knowledge on functional aspects of the genomic and chemical spaces (Table 2), including the KEGG orthology (KO) system for ortholog/paralog gene groups, the reaction classification (RC) system for biochemical reactions, and other classifications for compounds and drugs tentatively called chemical ontology as shown in Figure 1. We plan to extend the KO system to include the definition of functional modules in the KEGG pathways and to develop ontologies for computational inference of higher-order functions.

PATHWAY database

The KEGG PATHWAY database is a collection of manually drawn pathway maps for metabolism, genetic information processing, environmental information processing such as signal transduction, various other cellular processes and human diseases. During the past 2 years we have significantly increased the number of pathway maps for regulatory pathways including signal transduction, ligand–receptor interaction and cell communication, all based on extensive survey of published literature. For metabolic pathways we created two new sections, ‘Glycan Biosynthesis and Metabolism’ and ‘Biosynthesis of Polyketides and Nonribosomal Peptides’. The XML version of the pathway maps is available for both metabolic and regulatory pathways. These KEGG Markup Language (KGML) files provide graph information that can be used to computationally reproduce and manipulate KEGG pathway maps.

GENES database

The KEGG GENES database is a collection of gene catalogs for all complete genomes and some partial genomes (31 eukaryotes, 235 bacteriaand 23 archaea as of September 12, 2005), generated from publicly available resources, mostly NCBI RefSeq (3). All genomes in KEGG GENES are subject to SSDB computation and given manual KO assignments as described below. There are auxiliary collections of gene catalogs: DGENES for draft genomes (21 eukaryotes) and EGENES for expressed sequence tag consensus contigs (25 plants). These are meant to supplement the repertoire of KEGG organisms, and all are given automatic KO assignments using GENES as a reference dataset. Each GENES entry contains cross-reference information to outside databases, including NCBI gi numbers, Entrez Gene IDs and UniProt accession numbers. Starting with KEGG release 37.0 (January 2006) automatic ID conversion is implemented enabling use of such outside identifiers to access KEGG GENES and then the other KEGG databases.

KEGG orthology

There is a total of over one million genes in KEGG GENES, representing a tiny, but well-characterized part of the genomic space that makes up the biological world. From this part we organize knowledge about orthologous genes and paralogous genes, which, we hope, can be generalized for understanding the entire genomic space. This knowledge is stored in the KO system, a pathway-based classification of orthologous genes, including orthologous relationships of paralogous gene groups. The KO identifier, or the K number, is a common identifier for linking genomic information in the GENES database with network information in the PATHWAY database. The pathway nodes represented by rectangles in the KEGG reference pathway maps are given KO identifiers, so that organism-specific pathways can be computationally generated once each genome is annotated with KO's. This annotation or the KO assignment is done manually for KEGG GENES with the help of the GFIT tool using best-hit relations in pairwise genome comparisons stored in the SSDB database (4).

Because the number of ortholog groups that can be linked to pathways is limited, we have introduced two additional ways to define KO's. One is to use COG (5) to cover a broad-range of possible ortholog groups. The other is to rely on experts' classifications of protein families, which tend to be more functionally oriented resulting in narrowly defined KO's. A growing number of protein families are being added to the KO system, and they are shown in separate hierarchies different from the KEGG network hierarchy. The KO system can be best viewed from the KEGG BRITE database (Table 2).

LIGAND database

Originally, the LIGAND database consisted of just two components: ENZYME for enzyme nomenclature and COMPOUND for chemical compound structures (6). It later successively included additional components: REACTION for chemical reaction formulas, GLYCAN for glycan structures, RPAIR for reactant pair transformation patterns and DRUG for drug information. This expansion of the LIGAND collection represents our expanded efforts for understanding the chemical space that is part of the biological world.

The KEGG DRUG database is a new addition from KEGG release 36.1 (December 2005). It contains chemical structures and additional information such as therapeutic categories and target molecules. A most unique feature of KEGG DRUG is a collection of drug structure maps, which graphically illustrate, in a manner similar to KEGG pathway maps, our knowledge on groups of chemical structural patterns, therapeutic categories, their relationships and the chronology of drug development if known.

Reaction classification

The RC system in the chemical space is a counterpart of the KO system in the genomic space (Figure 1). It represents our attempt to organize knowledge on chemical reactions by categorizing chemical structure transformation patterns. The REACTION database contains individual reaction formulas taken from the ENZYME database. Each reaction formula is split into a set of substrate-product pairs, and the chemical structure comparison program SIMCOMP is applied to obtain an optimal alignment. This comparison is based on atom typing, which is the conversion of regular atomic (C, N, O, S, P and so on) representation to what we call KCF representation that consists of 68 atom types distinguishing functional groups and atomic environments (7). The chemical structure alignment generated by SIMCOMP is used to define the R atom for the reaction center, the D atom(s) for adjacent atom(s) in the mismatched region and the M atom(s) for adjacent atom(s) in the matched region (8). This is first done computationally and is followed by extensive manual curation.

The RPAIR database is still under development, but it is the basis for the RC system categorizing curated RDM patterns. Since an enzymatic reaction usually involves multiple substrates and products, one EC number corresponds to a combination of RDM patterns. The RC system has enabled automatic assignment of EC numbers from a set of substrate and product structures (8) and will further enable exploration of unknown reactions by generating plausible combinations of RDM patterns, which may then be related to possible paralogs of enzyme genes.

Glycosyltransferase reactions

Functional glycomics has been a most successful area for integrated analysis of genomic and chemical information (9). The carbohydrate sequence of glycans is determined by a specific set of biosynthetic reactions catalyzed by different types of glycosyltransferases. Thus, once we know the repertoire of glycosyltransferases in the genome or in the transcriptome, it should in principle be possible to predict the repertoire of glycan structures. Conversely, the knowledge about glycan structures can be used to search and annotate new glycosyltransferases. Composite Structure Map in KEGG GLYCAN is a tool for converting genomic or transcriptomic data to glycan structure variations based on a curated set of known glycosyltransferase reactions.

ACCESSING KEGG

Web and FTP

KEGG is the major component of the Japanese GenomeNet, which is served by the Kyoto University Bioinformatics Center. The other GenomeNet services including DBGET and BLAST/FASTA searches are now primarily developed and used to support KEGG. The official URL for GenomeNet has been modified to http://www.genome.jp/, but the former URL http://www.genome.ad.jp/ will still be made available (Table 1). To download the KEGG data, academic users may use the GenomeNet FTP site.

KEGG API

The KEGG API service has become an increasingly popular mode of access. It is the SOAP/WSDL interface to KEGG, enabling users to write their own programs to access, customize and utilize KEGG.

KegArray and KegDraw

KegArray and KegDraw are standalone Java applications that make use of the KEGG resources. KegArray is for microarray data analysis in conjunction with KEGG pathways and genomes. KegDraw is for drawing glycan structures and chemical compound structures, which can then be used to query against KEGG and PubChem databases. Both are freely available to academic and non-academic users.

The overall architecture of KEGG now consisting of four main components. KEGG BRITE has been formally added to establish a logical foundation for inference of higher-order functions.

Figure 1

The overall architecture of KEGG now consisting of four main components. KEGG BRITE has been formally added to establish a logical foundation for inference of higher-order functions.

Table 1

URLs for the KEGG resource

Table 1

URLs for the KEGG resource

Table 2

Functional hierarchies in KEGG BRITE

Network hierarchy
KO
Protein families
Enzymes
Transcription factors
Ribosome
Translation factors
ABC transporters
G-protein-coupled receptors
Ion channels
Cytokines
Cytokine receptors
Cell adhesion molecules (CAMs)
CAM ligands
CD molecules
Bacterial motility proteins
Compounds
Compounds with biological roles
Lipids
Phytochemical compounds
Compound interactions
Ion channel agonists/antagonists
Cytochrome P450 substrates
Drugs
Therapeutic category of drugs
Drug classification
Diseases
Disease genes, genomes and pathways
Organisms
KEGG organisms
Network hierarchy
KO
Protein families
Enzymes
Transcription factors
Ribosome
Translation factors
ABC transporters
G-protein-coupled receptors
Ion channels
Cytokines
Cytokine receptors
Cell adhesion molecules (CAMs)
CAM ligands
CD molecules
Bacterial motility proteins
Compounds
Compounds with biological roles
Lipids
Phytochemical compounds
Compound interactions
Ion channel agonists/antagonists
Cytochrome P450 substrates
Drugs
Therapeutic category of drugs
Drug classification
Diseases
Disease genes, genomes and pathways
Organisms
KEGG organisms

As on September 12, 2005.

Table 2

Functional hierarchies in KEGG BRITE

Network hierarchy
KO
Protein families
Enzymes
Transcription factors
Ribosome
Translation factors
ABC transporters
G-protein-coupled receptors
Ion channels
Cytokines
Cytokine receptors
Cell adhesion molecules (CAMs)
CAM ligands
CD molecules
Bacterial motility proteins
Compounds
Compounds with biological roles
Lipids
Phytochemical compounds
Compound interactions
Ion channel agonists/antagonists
Cytochrome P450 substrates
Drugs
Therapeutic category of drugs
Drug classification
Diseases
Disease genes, genomes and pathways
Organisms
KEGG organisms
Network hierarchy
KO
Protein families
Enzymes
Transcription factors
Ribosome
Translation factors
ABC transporters
G-protein-coupled receptors
Ion channels
Cytokines
Cytokine receptors
Cell adhesion molecules (CAMs)
CAM ligands
CD molecules
Bacterial motility proteins
Compounds
Compounds with biological roles
Lipids
Phytochemical compounds
Compound interactions
Ion channel agonists/antagonists
Cytochrome P450 substrates
Drugs
Therapeutic category of drugs
Drug classification
Diseases
Disease genes, genomes and pathways
Organisms
KEGG organisms

As on September 12, 2005.

The KEGG project is supported by the Institute for Bioinformatics Research and Development of the Japan Science and Technology Agency, the 21st Century COE program ‘Genome Science’, and a grant-in-aid for scientific research on the priority area from the Ministry of Education, Culture, Sports, Science and Technology of Japan. The computational resources were provided by the Bioinformatics Center, Institute for Chemical Research, Kyoto University. Funding to pay the Open Access publication charges for this article was provided by the grant-in-aid for scientific research.

Conflict of interest statement. None declared.

REFERENCES

1

Kanehisa, M.

1997

A database for post-genome analysis

Trends Genet

.

13

375

–376

2

Kanehisa, M., Goto, S., Kawashima, S., Okuno, Y., Hattori, M.

2004

The KEGG resource for deciphering the genome

Nucleic Acids Res

.

32

D277

–D280

3

Pruitt, K.D., Tatusova, T., Maglott, D.R.

2005

NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins

Nucleic Acids Res

.

33

D501

–D504

4

Kanehisa, M., Goto, S., Kawashima, S., Nakaya, A.

2002

The KEGG databases at GenomeNet

Nucleic Acids Res

.

30

42

–46

5

Tatusov, R.L., Natale, D.A., Garkavtsev, I.V., Tatusova, T.A., Shankavaram, U.T., Rao, B.S., Kiryutin, B., Galperin, M.Y., Fedorova, N.D., Koonin, E.V.

2001

The COG database: new developments in phylogenetic classification of proteins from complete genomes

Nucleic Acids Res

.

29

22

–28

6

Goto, S., Nishioka, T., Kanehisa, M.

1998

LIGAND: chemical database for enzyme reactions

Bioinformatics

14

591

–599

7

Hattori, M., Okuno, Y., Goto, S., Kanehisa, M.

2003

Development of a chemical structure comparison method for integrated analysis of chemical and genomic information in the metabolic pathways

J. Am. Chem. Soc

.

125

11853

–11865

8

Kotera, M., Okuno, Y., Hattori, M., Goto, S., Kanehisa, M.

2004

Computational assignment of the EC numbers for genomic-scale analysis of enzymatic reactions

J. Am. Chem. Soc

.

126

16487

–16498

9

Hashimoto, K., Goto, S., Kawano, S., Aoki-Kinoshita, K.F., Ueda, N., Hamajima, M., Kawasaki, T., Kanehisa, M.

2005

KEGG as a glycome informatics resource

Glycobiology

, in press

© The Author 2006. Published by Oxford University Press. All rights reserved The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. For commercial re-use, please contact journals.permissions@oxfordjournals.org

I agree to the terms and conditions. You must accept the terms and conditions.

Submit a comment

Name

Affiliations

Comment title

Comment

You have entered an invalid code

Thank you for submitting a comment on this article. Your comment will be reviewed and published at the journal's discretion. Please check for further notifications by email.

Citations

Views

Altmetric

Metrics

Total Views 9,176

7,186 Pageviews

1,990 PDF Downloads

Since 12/1/2016

Month: Total Views:
December 2016 1
January 2017 22
February 2017 36
March 2017 44
April 2017 41
May 2017 33
June 2017 23
July 2017 30
August 2017 39
September 2017 31
October 2017 38
November 2017 26
December 2017 103
January 2018 134
February 2018 113
March 2018 130
April 2018 108
May 2018 115
June 2018 113
July 2018 92
August 2018 134
September 2018 123
October 2018 88
November 2018 135
December 2018 126
January 2019 104
February 2019 96
March 2019 147
April 2019 164
May 2019 112
June 2019 83
July 2019 120
August 2019 124
September 2019 117
October 2019 93
November 2019 108
December 2019 96
January 2020 115
February 2020 127
March 2020 97
April 2020 58
May 2020 100
June 2020 150
July 2020 93
August 2020 146
September 2020 138
October 2020 138
November 2020 115
December 2020 123
January 2021 101
February 2021 92
March 2021 142
April 2021 92
May 2021 114
June 2021 87
July 2021 95
August 2021 113
September 2021 87
October 2021 140
November 2021 99
December 2021 87
January 2022 95
February 2022 115
March 2022 132
April 2022 106
May 2022 97
June 2022 107
July 2022 99
August 2022 110
September 2022 106
October 2022 129
November 2022 81
December 2022 87
January 2023 85
February 2023 100
March 2023 116
April 2023 89
May 2023 113
June 2023 61
July 2023 100
August 2023 120
September 2023 76
October 2023 67
November 2023 57
December 2023 136
January 2024 133
February 2024 110
March 2024 111
April 2024 129
May 2024 98
June 2024 51
July 2024 79
August 2024 57
September 2024 70
October 2024 63

×

Email alerts

Citing articles via

More from Oxford Academic