SMART 6: recent updates and new developments (original) (raw)

Journal Article

,

EMBL, Meyerhofstrasse 1, 69012 Heidelberg, Germany

Search for other works by this author on:

,

EMBL, Meyerhofstrasse 1, 69012 Heidelberg, Germany

Search for other works by this author on:

EMBL, Meyerhofstrasse 1, 69012 Heidelberg, Germany

*To whom correspondence should be addressed. Tel: +49 6221 387 8526; Fax: +49 6221 387 8517; Email: bork@embl.de

Search for other works by this author on:

Received:

15 September 2008

Revision received:

09 October 2008

Accepted:

10 October 2008

Published:

31 October 2008

Cite

Ivica Letunic, Tobias Doerks, Peer Bork, SMART 6: recent updates and new developments, Nucleic Acids Research, Volume 37, Issue suppl_1, 1 January 2009, Pages D229–D232, https://doi.org/10.1093/nar/gkn808
Close

Navbar Search Filter Mobile Enter search term Search

Abstract

Simple modular architecture research tool (SMART) is an online tool (http://smart.embl.de/) for the identification and annotation of protein domains. It provides a user-friendly platform for the exploration and comparative study of domain architectures in both proteins and genes. The current release of SMART contains manually curated models for 784 protein domains. Recent developments were focused on further data integration and improving user friendliness. The underlying protein database based on completely sequenced genomes was greatly expanded and now includes 630 species, compared to 191 in the previous release. As an initial step towards integrating information on biological pathways into SMART, our domain annotations were extended with data on metabolic pathways and links to several pathways resources. The interaction network view was completely redesigned and is now available for more than 2 million proteins. In addition to the standard web access to the database, users can now query SMART using distributed annotation system (DAS) or through a simple object access protocol (SOAP) based web service.

INTRODUCTION

Protein domain databases remain important annotation and research tools. Simple modular architecture research tool (SMART) is one of the earliest and was originally focused on mobile domains (1). It contains manually curated hidden Markov models for many domains, accessible via a web interface, but the data can also be downloaded. SMART still remains popular and is heavily used by the general scientific community. Here we summarize the major changes and new features that have been introduced since our last report (2).

EXPANDED DOMAIN COVERAGE

Although SMART was not intended to be exhaustive, it continues to expand its domain coverage. The current release introduces 120 new domains, with around 10% being unique to SMART, bringing the total number close to 800. Even though the rate of discovery of novel domains is falling (3,4), annotation of domains is far from being finished as many existing and known domain families have suboptimal definitions due to automatic or semiautomatic methods which are most often used to create them. Reaching a high quality of the underlying alignments requires expertise and a great amount of manual work for proper functional annotation. This is illustrated by the creation of new sequence profiles for a number of characteristic domains for a subfamily of polyketide biosynthesis proteins (PKS I). This protein family synthesizes a highly diverse group of secondary metabolites that cover many biological functions and have considerable medical relevance (5). PKS I multidomain proteins contain several predominantly enzymatic domains, used for example in the synthesis of antibiotics through different repetitive steps. PKS1 usually contain at least an acyltransferase (PKS_AT) domain, a ketoacylsynthase domain (PKS_KS) and an acyl carrier protein (PKS_PP) domain. Additionally ketoreductase (PKS_KR), dehydratase (PKS_DH), enoylreductase (PKS_ER), methyltransferase (PKS_MT) and thioesterase (PKS_TE) domains can be found. As PKS1 are homologous to several enzymes in fatty acid biosynthesis, current profiles are not able to distinguish between the two functionalities. Our new, hand-adjusted multiple sequence alignments and derived hidden Markov models allow, with manually established cut-offs, to selectively identify PKS1 above the background of many related enzymes such as fatty acid synthases. The selection of cut-offs for individual domains was based on a sophisticated tree-building procedure (6).

NEW AND UPDATED PROTEIN DATABASES

Protein database redundancy creates significant difficulties in the protein domain architecture analyses. Users looking at genome wide domain counts often end up with wrong and highly inflated numbers. To remedy this problem, in the previous release of SMART (2), we have introduced a ‘genomic’ analysis mode, which uses only proteins from the completely sequenced genomes. In the initial release, this protein database included 170 genomes, which were available in SWISS-PROT (7) and ENSEMBL (8). With the new release of SMART, we have greatly expanded this database and it now contains proteins from 630 completely sequenced genomes (55 Eukaryota, 46 Archaea and 529 Bacteria).

In addition to the expanded genomic mode protein database, SMART uses a new procedure to create the default nonredundant protein database that is used in the ‘normal’ analysis mode. The main source of protein sequences is Uniprot (9), complemented with the full set of stable genomes from ENSEMBL. To reduce the high redundancy that is inherently present in these databases, we have implemented a per-species protein clustering procedure. All the proteins are initially separated into species-specific databases. Each of these databases is clustered separately using the CD-HIT algorithm (10) with a 96% identity cutoff. Longest members of each cluster are used as ‘representatives’, and are the only proteins included in the database, together with non-clustered ones. This procedure significantly improves the results of all domain architecture queries and brings the domain counts to lower levels, comparable to the genomic mode database.

INTEGRATION OF BIOLOGICAL PATHWAYS DATA

In the current release, we have started the integration of biological pathways information into SMART. Initially, this will be limited to the metabolic pathways, with further expansions coming in the future releases. We have mapped the complete genomic mode protein database to the KEGG (11) orthologous groups and their corresponding metabolic pathways. This information is available directly in the protein annotation pages, for more than 1 million proteins (Figure 1). Additionally, this information was used to generate the overview of various domains’ presence in different parts of metabolism. Each domain's annotation page includes a new ‘Metabolic pathways’ entry, which lists the pathways where the domain is present (Figure 2). In addition to the basic statistics, the metabolic pathways information for both proteins and domains is also displayed on the global overview map of the metabolism (11), with an interactive version of the maps provided by iPath, the interactive Pathways Explorer (12).

Protein annotation page for Mus musculus phospholipase C, delta 1. More than 600 000 proteins are linked to the KEGG metabolic pathways and orthologous groups, and can be displayed in the interactive Pathways Explorer (12), which provides an interactive, global overview map of the metabolism. Interaction network information has been greatly expanded, and is available for about 2.5 million proteins.

Figure 1.

Protein annotation page for Mus musculus phospholipase C, delta 1. More than 600 000 proteins are linked to the KEGG metabolic pathways and orthologous groups, and can be displayed in the interactive Pathways Explorer (12), which provides an interactive, global overview map of the metabolism. Interaction network information has been greatly expanded, and is available for about 2.5 million proteins.

SMART annotation page for the HDc domain. A new feature of SMART domain annotation pages is the ‘Metabolism’ entry. Based on the mapping of SMART genomic protein database to KEGG orthologous groups, it gives an overview of the domain's presence in various metabolic pathways. Matching pathways are also displayed on the global metabolism overview map (11), with a link to the interactive version, provided by the interactive Pathways Explorer.

Figure 2.

SMART annotation page for the HDc domain. A new feature of SMART domain annotation pages is the ‘Metabolism’ entry. Based on the mapping of SMART genomic protein database to KEGG orthologous groups, it gives an overview of the domain's presence in various metabolic pathways. Matching pathways are also displayed on the global metabolism overview map (11), with a link to the interactive version, provided by the interactive Pathways Explorer.

EXPANDED PROTEIN INTERACTION DATA

The expansion of the protein database based on completely sequenced genomes allowed SMART to significantly extend the information on putative protein interaction partners. This data is now available for about 2.5 million proteins, compared to 350 000 in the previous release. Interaction network data has been expanded and updated, and is displayed using completely redesigned summary graphics, which are easier to read and interpret. The data has been imported from the STRING database (13), and is synchronized with its version 8 release.

DATABASE AND WEB SERVER OPTIMIZATIONS

With the ever-increasing amount of sequence information available, domain annotation tools such as SMART face constant new challenges in providing fast and user-friendly interfaces to the underlying data. The core of SMART is a relational database management system (RDBMS), which stores the annotation of all SMART domains and the pre-calculated protein analyses for complete Uniprot (9) and Ensembl (8) sequence databases. In order to keep the response times of the server acceptable, many parts of the database access code have been greatly optimized, and the database itself restructured. Additionally, the server was distributed onto a hardware cluster with different tasks assigned to dedicated machines, resulting in a greatly expanded load capacity.

USER INTERFACE IMPROVEMENTS AND TECHNICAL CHANGES

Many parts of SMART's web interface have been updated and streamlined. Protein analysis pages now include extended information on all detected SMART domains, which is dynamically loaded on user request. In addition to SMART domains, we now also display the basic annotation for all detected Pfam (14) domains, such as Interpro (15) abstract and annotated Gene Ontology (16) terms.

Domain annotation pages have also been redesigned and updated. Information on domain presence in 3D structures has been expanded and includes PDB (17) titles and the basic graphical representation of the structure.

With version 6, SMART offers two new modes of database access, oriented towards advanced users. Distributed annotation system (DAS, 18), allows access to sequence annotation data on an as-needed basis, and offers users an easy way of integrating multiple annotation sources in a single client-side interface. SMART domain annotations for the complete Uniprot and Ensembl protein databases are accessible as DAS XML at the URL http://smart.embl.de/smart/das.

In addition to DAS, SMART can also be accessed through a web service, with a web service definition language (WSDL) service description file available at http://smart.embl.de/webservice. SMART web service uses simple object access protocol (SOAP) for all input and output messages and accepts both protein sequence identifiers and raw amino acid sequences.

These new access modes offer simpler integration of SMART annotation data into other resources and an easier way for analysis of large datasets.

CONCLUSION

Since the initial conception of SMART in the mid 1990s, our goal has been to provide a useful biological web resource, characterized by high quality of underlying data and a powerful, simple user interface. We continue to modestly expand our coverage and implement new features to make using SMART a better and more enjoyable experience to both existing and new users.

FUNDING

Funding for open acess charge: EMBL (European Molecular Biology Laboratory).

Conflict of interest statement. None declared.

REFERENCES

1

SMART, a simple modular architecture research tool: identification of signaling domains

,

Proc. Natl Acad. Sci. USA

,

1998

, vol.

95

(pg.

5857

-

5864

)

2

SMART 5: domains in the context of genomes and networks

,

Nucleic Acids Res.

,

2006

, vol.

34

(pg.

D257

-

D260

)

3

Protein domain analysis in the era of complete genomes

,

FEBS Lett.

,

2002

, vol.

513

(pg.

129

-

134

)

4

Exhaustive enumeration of protein domain families

,

J. Mol. Biol.

,

2003

, vol.

328

(pg.

749

-

767

)

5

Polyketide biosynthesis: a millennium review

,

Nat. Prod. Rep.

,

2001

, vol.

18

(pg.

380

-

416

)

6

A computational screen for type I polyketide synthases in metagenomics shotgun data

,

PLoS ONE

,

2008

, vol.

3

pg.

e3515

doi:10.1371/journal.pone.0003515

7

et al.

The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003

,

Nucleic Acids Res.

,

2003

, vol.

31

(pg.

365

-

370

)

8

et al.

Ensembl 2008

,

Nucleic Acids Res.

,

2008

, vol.

36

(pg.

D707

-

D714

)

9

The UniProt Consortium

The universal protein resource (UniProt)

,

Nucleic Acids Res.

,

2008

, vol.

36

(pg.

D190

-

D195

)

10

Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences

,

Bioinformatics

,

2006

, vol.

22

(pg.

1658

-

1659

)

11

et al.

KEGG for linking genomes to life and the environment

,

Nucleic Acids Res.

,

2008

, vol.

36

(pg.

D480

-

D484

)

12

iPath: interactive exploration of biochemical pathways and networks

,

Trends Biochem. Sci.

,

2008

, vol.

33

(pg.

101

-

103

)

13

STRING 7–recent developments in the integration and prediction of protein interactions

,

Nucleic Acids Res.

,

2007

, vol.

35

(pg.

D358

-

D362

)

14

et al.

The Pfam protein families database

,

Nucleic Acids Res.

,

2008

, vol.

36

(pg.

D281

-

D288

)

15

et al.

New developments in the InterPro database

,

Nucleic Acids Res.

,

2007

, vol.

35

(pg.

D224

-

D228

)

16

Gene Ontology Consortium

The Gene Ontology (GO) project in 2006

,

Nucleic Acids Res.

,

2006

, vol.

34

(pg.

D322

-

D326

)

17

The Protein Data Bank and structural genomics

,

Nucleic Acids Res.

,

2003

, vol.

31

(pg.

489

-

491

)

18

The distributed annotation system

,

BMC Bioinformatics

,

2001

, vol.

2

pg.

7

© 2008 The Author(s)

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

I agree to the terms and conditions. You must accept the terms and conditions.

Submit a comment

Name

Affiliations

Comment title

Comment

You have entered an invalid code

Thank you for submitting a comment on this article. Your comment will be reviewed and published at the journal's discretion. Please check for further notifications by email.

Citations

Views

Altmetric

Metrics

Total Views 2,727

2,173 Pageviews

554 PDF Downloads

Since 11/1/2016

Month: Total Views:
November 2016 2
January 2017 5
February 2017 18
March 2017 10
April 2017 5
May 2017 1
June 2017 8
July 2017 9
August 2017 7
September 2017 2
October 2017 9
November 2017 11
December 2017 27
January 2018 45
February 2018 34
March 2018 49
April 2018 81
May 2018 35
June 2018 16
July 2018 26
August 2018 44
September 2018 30
October 2018 21
November 2018 35
December 2018 20
January 2019 24
February 2019 11
March 2019 48
April 2019 47
May 2019 32
June 2019 22
July 2019 17
August 2019 58
September 2019 36
October 2019 38
November 2019 31
December 2019 28
January 2020 51
February 2020 34
March 2020 29
April 2020 22
May 2020 24
June 2020 34
July 2020 41
August 2020 21
September 2020 10
October 2020 19
November 2020 12
December 2020 26
January 2021 36
February 2021 25
March 2021 24
April 2021 30
May 2021 29
June 2021 24
July 2021 22
August 2021 20
September 2021 37
October 2021 31
November 2021 28
December 2021 25
January 2022 37
February 2022 23
March 2022 19
April 2022 34
May 2022 21
June 2022 28
July 2022 26
August 2022 23
September 2022 41
October 2022 19
November 2022 24
December 2022 30
January 2023 30
February 2023 38
March 2023 34
April 2023 19
May 2023 97
June 2023 25
July 2023 23
August 2023 36
September 2023 27
October 2023 41
November 2023 35
December 2023 45
January 2024 34
February 2024 40
March 2024 47
April 2024 31
May 2024 40
June 2024 28
July 2024 34
August 2024 33
September 2024 34
October 2024 35

Citations

795 Web of Science

×

Email alerts

Citing articles via

More from Oxford Academic