From genomics to chemical genomics: new developments in KEGG (original) (raw)
Journal Article
,
Search for other works by this author on:
,
Search for other works by this author on:
,
Search for other works by this author on:
,
Search for other works by this author on:
,
Search for other works by this author on:
,
Search for other works by this author on:
,
Search for other works by this author on:
,
Search for other works by this author on:
Search for other works by this author on:
Received:
14 September 2005
Revision received:
17 October 2005
Accepted:
17 October 2005
Published:
01 January 2006
Cite
Minoru Kanehisa, Susumu Goto, Masahiro Hattori, Kiyoko F. Aoki-Kinoshita, Masumi Itoh, Shuichi Kawashima, Toshiaki Katayama, Michihiro Araki, Mika Hirakawa, From genomics to chemical genomics: new developments in KEGG, Nucleic Acids Research, Volume 34, Issue suppl_1, 1 January 2006, Pages D354–D357, https://doi.org/10.1093/nar/gkj102
Close
Navbar Search Filter Mobile Enter search term Search
Abstract
The increasing amount of genomic and molecular information is the basis for understanding higher-order biological systems, such as the cell and the organism, and their interactions with the environment, as well as for medical, industrial and other practical applications. The KEGG resource (http://www.genome.jp/kegg/) provides a reference knowledge base for linking genomes to biological systems, categorized as building blocks in the genomic space (KEGG GENES) and the chemical space (KEGG LIGAND), and wiring diagrams of interaction networks and reaction networks (KEGG PATHWAY). A fourth component, KEGG BRITE, has been formally added to the KEGG suite of databases. This reflects our attempt to computerize functional interpretations as part of the pathway reconstruction process based on the hierarchically structured knowledge about the genomic, chemical and network spaces. In accordance with the new chemical genomics initiatives, the scope of KEGG LIGAND has been significantly expanded to cover both endogenous and exogenous molecules. Specifically, RPAIR contains curated chemical structure transformation patterns extracted from known enzymatic reactions, which would enable analysis of genome-environment interactions, such as the prediction of new reactions and new enzyme genes that would degrade new environmental compounds. Additionally, drug information is now stored separately and linked to new KEGG DRUG structure maps.
INTRODUCTION
While traditional genomics and other types of omics approaches have contributed to our knowledge on the genomic space of possible genes and proteins that make up the biological system, the new chemical genomics initiatives will give us a glimpse of the chemical space of possible chemical substances that exist as an interface between the biological world and the natural world. The KEGG database project was initiated in 1995, the last year of the first 5-year phase of the Japanese Human Genome Programme (1). After 10 years of development in parallel with the growing number of completely sequenced genomes and increased activities in post-genomic research, the KEGG project has entered a new phase in accordance with the chemical genomics initiatives.
KEGG is a database resource for understanding higher-order functions and utilities of the biological system, such as the cell or the organism, from genomic and molecular information. In fact, we consider KEGG as a computer representation of the biological system, consisting of building blocks and wiring diagrams, which can be used for modeling and simulation as well as for browsing and retrieval (2). Originally, the wiring diagrams involved endogenous molecules, both those that are directly encoded in the genome (proteins and RNAs) and those that are indirectly encoded through biosynthetic/biodegradation pathways (metabolites, glycans and so on). Now we are extending these wiring diagrams to include exogenous molecules. This will help understand interactions between the biological system and the natural environment, and would eventually lead to representation and reconstruction of another higher-level biological system, the biological world. Here we report new developments in KEGG towards this direction.
THE KEGG RESOURCE
Overview
KEGG consists of four main databases. As illustrated in Figure 1 they are categorized as building blocks in the genomic space (GENES databases) and the chemical space (LIGAND database), wiring diagrams in the network space (PATHWAY database) and ontologies for pathway reconstruction (BRITE database). BRITE had been a separate database for many years, but it was formally included in KEGG in release 34.0 (April 2005) to establish a logical foundation for the KEGG Project. The URLs for accessing KEGG are summarized in Table 1.
Biological systems are represented in KEGG by two types of graphs, called nested graphs and line graphs in theoretical computer science. The nested graph is a graph whose nodes can themselves be graphs. It is used for representing KEGG network hierarchy and for pathway reconstruction and functional inference. The line graph is a graph derived by interchanging nodes and edges of another graph. It represents the inherent complementarity of the metabolic pathway, which can be viewed either as a network of genes (enzymes) or as a network of compounds, meaning that one can be generated from the other by the line graph transformation. Thus, the line graph is the basis for integrated analysis of genomic and chemical information.
BRITE database
KEGG BRITE is a collection of hierarchies and binary relations with two inter-related objectives corresponding to the two types of graphs: to automate functional interpretations associated with the KEGG pathway reconstruction and to assist discovery of empirical rules involving genome-environment interactions. Currently, we focus on hierarchical structuring of our knowledge on functional aspects of the genomic and chemical spaces (Table 2), including the KEGG orthology (KO) system for ortholog/paralog gene groups, the reaction classification (RC) system for biochemical reactions, and other classifications for compounds and drugs tentatively called chemical ontology as shown in Figure 1. We plan to extend the KO system to include the definition of functional modules in the KEGG pathways and to develop ontologies for computational inference of higher-order functions.
PATHWAY database
The KEGG PATHWAY database is a collection of manually drawn pathway maps for metabolism, genetic information processing, environmental information processing such as signal transduction, various other cellular processes and human diseases. During the past 2 years we have significantly increased the number of pathway maps for regulatory pathways including signal transduction, ligand–receptor interaction and cell communication, all based on extensive survey of published literature. For metabolic pathways we created two new sections, ‘Glycan Biosynthesis and Metabolism’ and ‘Biosynthesis of Polyketides and Nonribosomal Peptides’. The XML version of the pathway maps is available for both metabolic and regulatory pathways. These KEGG Markup Language (KGML) files provide graph information that can be used to computationally reproduce and manipulate KEGG pathway maps.
GENES database
The KEGG GENES database is a collection of gene catalogs for all complete genomes and some partial genomes (31 eukaryotes, 235 bacteriaand 23 archaea as of September 12, 2005), generated from publicly available resources, mostly NCBI RefSeq (3). All genomes in KEGG GENES are subject to SSDB computation and given manual KO assignments as described below. There are auxiliary collections of gene catalogs: DGENES for draft genomes (21 eukaryotes) and EGENES for expressed sequence tag consensus contigs (25 plants). These are meant to supplement the repertoire of KEGG organisms, and all are given automatic KO assignments using GENES as a reference dataset. Each GENES entry contains cross-reference information to outside databases, including NCBI gi numbers, Entrez Gene IDs and UniProt accession numbers. Starting with KEGG release 37.0 (January 2006) automatic ID conversion is implemented enabling use of such outside identifiers to access KEGG GENES and then the other KEGG databases.
KEGG orthology
There is a total of over one million genes in KEGG GENES, representing a tiny, but well-characterized part of the genomic space that makes up the biological world. From this part we organize knowledge about orthologous genes and paralogous genes, which, we hope, can be generalized for understanding the entire genomic space. This knowledge is stored in the KO system, a pathway-based classification of orthologous genes, including orthologous relationships of paralogous gene groups. The KO identifier, or the K number, is a common identifier for linking genomic information in the GENES database with network information in the PATHWAY database. The pathway nodes represented by rectangles in the KEGG reference pathway maps are given KO identifiers, so that organism-specific pathways can be computationally generated once each genome is annotated with KO's. This annotation or the KO assignment is done manually for KEGG GENES with the help of the GFIT tool using best-hit relations in pairwise genome comparisons stored in the SSDB database (4).
Because the number of ortholog groups that can be linked to pathways is limited, we have introduced two additional ways to define KO's. One is to use COG (5) to cover a broad-range of possible ortholog groups. The other is to rely on experts' classifications of protein families, which tend to be more functionally oriented resulting in narrowly defined KO's. A growing number of protein families are being added to the KO system, and they are shown in separate hierarchies different from the KEGG network hierarchy. The KO system can be best viewed from the KEGG BRITE database (Table 2).
LIGAND database
Originally, the LIGAND database consisted of just two components: ENZYME for enzyme nomenclature and COMPOUND for chemical compound structures (6). It later successively included additional components: REACTION for chemical reaction formulas, GLYCAN for glycan structures, RPAIR for reactant pair transformation patterns and DRUG for drug information. This expansion of the LIGAND collection represents our expanded efforts for understanding the chemical space that is part of the biological world.
The KEGG DRUG database is a new addition from KEGG release 36.1 (December 2005). It contains chemical structures and additional information such as therapeutic categories and target molecules. A most unique feature of KEGG DRUG is a collection of drug structure maps, which graphically illustrate, in a manner similar to KEGG pathway maps, our knowledge on groups of chemical structural patterns, therapeutic categories, their relationships and the chronology of drug development if known.
Reaction classification
The RC system in the chemical space is a counterpart of the KO system in the genomic space (Figure 1). It represents our attempt to organize knowledge on chemical reactions by categorizing chemical structure transformation patterns. The REACTION database contains individual reaction formulas taken from the ENZYME database. Each reaction formula is split into a set of substrate-product pairs, and the chemical structure comparison program SIMCOMP is applied to obtain an optimal alignment. This comparison is based on atom typing, which is the conversion of regular atomic (C, N, O, S, P and so on) representation to what we call KCF representation that consists of 68 atom types distinguishing functional groups and atomic environments (7). The chemical structure alignment generated by SIMCOMP is used to define the R atom for the reaction center, the D atom(s) for adjacent atom(s) in the mismatched region and the M atom(s) for adjacent atom(s) in the matched region (8). This is first done computationally and is followed by extensive manual curation.
The RPAIR database is still under development, but it is the basis for the RC system categorizing curated RDM patterns. Since an enzymatic reaction usually involves multiple substrates and products, one EC number corresponds to a combination of RDM patterns. The RC system has enabled automatic assignment of EC numbers from a set of substrate and product structures (8) and will further enable exploration of unknown reactions by generating plausible combinations of RDM patterns, which may then be related to possible paralogs of enzyme genes.
Glycosyltransferase reactions
Functional glycomics has been a most successful area for integrated analysis of genomic and chemical information (9). The carbohydrate sequence of glycans is determined by a specific set of biosynthetic reactions catalyzed by different types of glycosyltransferases. Thus, once we know the repertoire of glycosyltransferases in the genome or in the transcriptome, it should in principle be possible to predict the repertoire of glycan structures. Conversely, the knowledge about glycan structures can be used to search and annotate new glycosyltransferases. Composite Structure Map in KEGG GLYCAN is a tool for converting genomic or transcriptomic data to glycan structure variations based on a curated set of known glycosyltransferase reactions.
ACCESSING KEGG
Web and FTP
KEGG is the major component of the Japanese GenomeNet, which is served by the Kyoto University Bioinformatics Center. The other GenomeNet services including DBGET and BLAST/FASTA searches are now primarily developed and used to support KEGG. The official URL for GenomeNet has been modified to http://www.genome.jp/, but the former URL http://www.genome.ad.jp/ will still be made available (Table 1). To download the KEGG data, academic users may use the GenomeNet FTP site.
KEGG API
The KEGG API service has become an increasingly popular mode of access. It is the SOAP/WSDL interface to KEGG, enabling users to write their own programs to access, customize and utilize KEGG.
KegArray and KegDraw
KegArray and KegDraw are standalone Java applications that make use of the KEGG resources. KegArray is for microarray data analysis in conjunction with KEGG pathways and genomes. KegDraw is for drawing glycan structures and chemical compound structures, which can then be used to query against KEGG and PubChem databases. Both are freely available to academic and non-academic users.
Figure 1
The overall architecture of KEGG now consisting of four main components. KEGG BRITE has been formally added to establish a logical foundation for inference of higher-order functions.
Table 1
URLs for the KEGG resource
Table 1
URLs for the KEGG resource
Table 2
Functional hierarchies in KEGG BRITE
Network hierarchy |
---|
KO |
Protein families |
Enzymes |
Transcription factors |
Ribosome |
Translation factors |
ABC transporters |
G-protein-coupled receptors |
Ion channels |
Cytokines |
Cytokine receptors |
Cell adhesion molecules (CAMs) |
CAM ligands |
CD molecules |
Bacterial motility proteins |
Compounds |
Compounds with biological roles |
Lipids |
Phytochemical compounds |
Compound interactions |
Ion channel agonists/antagonists |
Cytochrome P450 substrates |
Drugs |
Therapeutic category of drugs |
Drug classification |
Diseases |
Disease genes, genomes and pathways |
Organisms |
KEGG organisms |
Network hierarchy |
---|
KO |
Protein families |
Enzymes |
Transcription factors |
Ribosome |
Translation factors |
ABC transporters |
G-protein-coupled receptors |
Ion channels |
Cytokines |
Cytokine receptors |
Cell adhesion molecules (CAMs) |
CAM ligands |
CD molecules |
Bacterial motility proteins |
Compounds |
Compounds with biological roles |
Lipids |
Phytochemical compounds |
Compound interactions |
Ion channel agonists/antagonists |
Cytochrome P450 substrates |
Drugs |
Therapeutic category of drugs |
Drug classification |
Diseases |
Disease genes, genomes and pathways |
Organisms |
KEGG organisms |
As on September 12, 2005.
Table 2
Functional hierarchies in KEGG BRITE
Network hierarchy |
---|
KO |
Protein families |
Enzymes |
Transcription factors |
Ribosome |
Translation factors |
ABC transporters |
G-protein-coupled receptors |
Ion channels |
Cytokines |
Cytokine receptors |
Cell adhesion molecules (CAMs) |
CAM ligands |
CD molecules |
Bacterial motility proteins |
Compounds |
Compounds with biological roles |
Lipids |
Phytochemical compounds |
Compound interactions |
Ion channel agonists/antagonists |
Cytochrome P450 substrates |
Drugs |
Therapeutic category of drugs |
Drug classification |
Diseases |
Disease genes, genomes and pathways |
Organisms |
KEGG organisms |
Network hierarchy |
---|
KO |
Protein families |
Enzymes |
Transcription factors |
Ribosome |
Translation factors |
ABC transporters |
G-protein-coupled receptors |
Ion channels |
Cytokines |
Cytokine receptors |
Cell adhesion molecules (CAMs) |
CAM ligands |
CD molecules |
Bacterial motility proteins |
Compounds |
Compounds with biological roles |
Lipids |
Phytochemical compounds |
Compound interactions |
Ion channel agonists/antagonists |
Cytochrome P450 substrates |
Drugs |
Therapeutic category of drugs |
Drug classification |
Diseases |
Disease genes, genomes and pathways |
Organisms |
KEGG organisms |
As on September 12, 2005.
The KEGG project is supported by the Institute for Bioinformatics Research and Development of the Japan Science and Technology Agency, the 21st Century COE program ‘Genome Science’, and a grant-in-aid for scientific research on the priority area from the Ministry of Education, Culture, Sports, Science and Technology of Japan. The computational resources were provided by the Bioinformatics Center, Institute for Chemical Research, Kyoto University. Funding to pay the Open Access publication charges for this article was provided by the grant-in-aid for scientific research.
Conflict of interest statement. None declared.
REFERENCES
1
Kanehisa, M.
1997
A database for post-genome analysis
Trends Genet
.
13
375
–376
2
Kanehisa, M., Goto, S., Kawashima, S., Okuno, Y., Hattori, M.
2004
The KEGG resource for deciphering the genome
Nucleic Acids Res
.
32
D277
–D280
3
Pruitt, K.D., Tatusova, T., Maglott, D.R.
2005
NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins
Nucleic Acids Res
.
33
D501
–D504
4
Kanehisa, M., Goto, S., Kawashima, S., Nakaya, A.
2002
The KEGG databases at GenomeNet
Nucleic Acids Res
.
30
42
–46
5
Tatusov, R.L., Natale, D.A., Garkavtsev, I.V., Tatusova, T.A., Shankavaram, U.T., Rao, B.S., Kiryutin, B., Galperin, M.Y., Fedorova, N.D., Koonin, E.V.
2001
The COG database: new developments in phylogenetic classification of proteins from complete genomes
Nucleic Acids Res
.
29
22
–28
6
Goto, S., Nishioka, T., Kanehisa, M.
1998
LIGAND: chemical database for enzyme reactions
Bioinformatics
14
591
–599
7
Hattori, M., Okuno, Y., Goto, S., Kanehisa, M.
2003
Development of a chemical structure comparison method for integrated analysis of chemical and genomic information in the metabolic pathways
J. Am. Chem. Soc
.
125
11853
–11865
8
Kotera, M., Okuno, Y., Hattori, M., Goto, S., Kanehisa, M.
2004
Computational assignment of the EC numbers for genomic-scale analysis of enzymatic reactions
J. Am. Chem. Soc
.
126
16487
–16498
9
Hashimoto, K., Goto, S., Kawano, S., Aoki-Kinoshita, K.F., Ueda, N., Hamajima, M., Kawasaki, T., Kanehisa, M.
2005
KEGG as a glycome informatics resource
Glycobiology
, in press
© The Author 2006. Published by Oxford University Press. All rights reserved The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. For commercial re-use, please contact journals.permissions@oxfordjournals.org
I agree to the terms and conditions. You must accept the terms and conditions.
Submit a comment
Name
Affiliations
Comment title
Comment
You have entered an invalid code
Thank you for submitting a comment on this article. Your comment will be reviewed and published at the journal's discretion. Please check for further notifications by email.
Citations
Views
Altmetric
Metrics
Total Views 9,176
7,186 Pageviews
1,990 PDF Downloads
Since 12/1/2016
Month: | Total Views: |
---|---|
December 2016 | 1 |
January 2017 | 22 |
February 2017 | 36 |
March 2017 | 44 |
April 2017 | 41 |
May 2017 | 33 |
June 2017 | 23 |
July 2017 | 30 |
August 2017 | 39 |
September 2017 | 31 |
October 2017 | 38 |
November 2017 | 26 |
December 2017 | 103 |
January 2018 | 134 |
February 2018 | 113 |
March 2018 | 130 |
April 2018 | 108 |
May 2018 | 115 |
June 2018 | 113 |
July 2018 | 92 |
August 2018 | 134 |
September 2018 | 123 |
October 2018 | 88 |
November 2018 | 135 |
December 2018 | 126 |
January 2019 | 104 |
February 2019 | 96 |
March 2019 | 147 |
April 2019 | 164 |
May 2019 | 112 |
June 2019 | 83 |
July 2019 | 120 |
August 2019 | 124 |
September 2019 | 117 |
October 2019 | 93 |
November 2019 | 108 |
December 2019 | 96 |
January 2020 | 115 |
February 2020 | 127 |
March 2020 | 97 |
April 2020 | 58 |
May 2020 | 100 |
June 2020 | 150 |
July 2020 | 93 |
August 2020 | 146 |
September 2020 | 138 |
October 2020 | 138 |
November 2020 | 115 |
December 2020 | 123 |
January 2021 | 101 |
February 2021 | 92 |
March 2021 | 142 |
April 2021 | 92 |
May 2021 | 114 |
June 2021 | 87 |
July 2021 | 95 |
August 2021 | 113 |
September 2021 | 87 |
October 2021 | 140 |
November 2021 | 99 |
December 2021 | 87 |
January 2022 | 95 |
February 2022 | 115 |
March 2022 | 132 |
April 2022 | 106 |
May 2022 | 97 |
June 2022 | 107 |
July 2022 | 99 |
August 2022 | 110 |
September 2022 | 106 |
October 2022 | 129 |
November 2022 | 81 |
December 2022 | 87 |
January 2023 | 85 |
February 2023 | 100 |
March 2023 | 116 |
April 2023 | 89 |
May 2023 | 113 |
June 2023 | 61 |
July 2023 | 100 |
August 2023 | 120 |
September 2023 | 76 |
October 2023 | 67 |
November 2023 | 57 |
December 2023 | 136 |
January 2024 | 133 |
February 2024 | 110 |
March 2024 | 111 |
April 2024 | 129 |
May 2024 | 98 |
June 2024 | 51 |
July 2024 | 79 |
August 2024 | 57 |
September 2024 | 70 |
October 2024 | 63 |
×
Email alerts
Citing articles via
More from Oxford Academic