SMART 6: recent updates and new developments (original) (raw)
Journal Article
,
EMBL, Meyerhofstrasse 1, 69012 Heidelberg, Germany
Search for other works by this author on:
,
EMBL, Meyerhofstrasse 1, 69012 Heidelberg, Germany
Search for other works by this author on:
EMBL, Meyerhofstrasse 1, 69012 Heidelberg, Germany
*To whom correspondence should be addressed. Tel: +49 6221 387 8526; Fax: +49 6221 387 8517; Email: bork@embl.de
Search for other works by this author on:
Received:
15 September 2008
Revision received:
09 October 2008
Accepted:
10 October 2008
Published:
31 October 2008
Cite
Ivica Letunic, Tobias Doerks, Peer Bork, SMART 6: recent updates and new developments, Nucleic Acids Research, Volume 37, Issue suppl_1, 1 January 2009, Pages D229–D232, https://doi.org/10.1093/nar/gkn808
Close
Navbar Search Filter Mobile Enter search term Search
Abstract
Simple modular architecture research tool (SMART) is an online tool (http://smart.embl.de/) for the identification and annotation of protein domains. It provides a user-friendly platform for the exploration and comparative study of domain architectures in both proteins and genes. The current release of SMART contains manually curated models for 784 protein domains. Recent developments were focused on further data integration and improving user friendliness. The underlying protein database based on completely sequenced genomes was greatly expanded and now includes 630 species, compared to 191 in the previous release. As an initial step towards integrating information on biological pathways into SMART, our domain annotations were extended with data on metabolic pathways and links to several pathways resources. The interaction network view was completely redesigned and is now available for more than 2 million proteins. In addition to the standard web access to the database, users can now query SMART using distributed annotation system (DAS) or through a simple object access protocol (SOAP) based web service.
INTRODUCTION
Protein domain databases remain important annotation and research tools. Simple modular architecture research tool (SMART) is one of the earliest and was originally focused on mobile domains (1). It contains manually curated hidden Markov models for many domains, accessible via a web interface, but the data can also be downloaded. SMART still remains popular and is heavily used by the general scientific community. Here we summarize the major changes and new features that have been introduced since our last report (2).
EXPANDED DOMAIN COVERAGE
Although SMART was not intended to be exhaustive, it continues to expand its domain coverage. The current release introduces 120 new domains, with around 10% being unique to SMART, bringing the total number close to 800. Even though the rate of discovery of novel domains is falling (3,4), annotation of domains is far from being finished as many existing and known domain families have suboptimal definitions due to automatic or semiautomatic methods which are most often used to create them. Reaching a high quality of the underlying alignments requires expertise and a great amount of manual work for proper functional annotation. This is illustrated by the creation of new sequence profiles for a number of characteristic domains for a subfamily of polyketide biosynthesis proteins (PKS I). This protein family synthesizes a highly diverse group of secondary metabolites that cover many biological functions and have considerable medical relevance (5). PKS I multidomain proteins contain several predominantly enzymatic domains, used for example in the synthesis of antibiotics through different repetitive steps. PKS1 usually contain at least an acyltransferase (PKS_AT) domain, a ketoacylsynthase domain (PKS_KS) and an acyl carrier protein (PKS_PP) domain. Additionally ketoreductase (PKS_KR), dehydratase (PKS_DH), enoylreductase (PKS_ER), methyltransferase (PKS_MT) and thioesterase (PKS_TE) domains can be found. As PKS1 are homologous to several enzymes in fatty acid biosynthesis, current profiles are not able to distinguish between the two functionalities. Our new, hand-adjusted multiple sequence alignments and derived hidden Markov models allow, with manually established cut-offs, to selectively identify PKS1 above the background of many related enzymes such as fatty acid synthases. The selection of cut-offs for individual domains was based on a sophisticated tree-building procedure (6).
NEW AND UPDATED PROTEIN DATABASES
Protein database redundancy creates significant difficulties in the protein domain architecture analyses. Users looking at genome wide domain counts often end up with wrong and highly inflated numbers. To remedy this problem, in the previous release of SMART (2), we have introduced a ‘genomic’ analysis mode, which uses only proteins from the completely sequenced genomes. In the initial release, this protein database included 170 genomes, which were available in SWISS-PROT (7) and ENSEMBL (8). With the new release of SMART, we have greatly expanded this database and it now contains proteins from 630 completely sequenced genomes (55 Eukaryota, 46 Archaea and 529 Bacteria).
In addition to the expanded genomic mode protein database, SMART uses a new procedure to create the default nonredundant protein database that is used in the ‘normal’ analysis mode. The main source of protein sequences is Uniprot (9), complemented with the full set of stable genomes from ENSEMBL. To reduce the high redundancy that is inherently present in these databases, we have implemented a per-species protein clustering procedure. All the proteins are initially separated into species-specific databases. Each of these databases is clustered separately using the CD-HIT algorithm (10) with a 96% identity cutoff. Longest members of each cluster are used as ‘representatives’, and are the only proteins included in the database, together with non-clustered ones. This procedure significantly improves the results of all domain architecture queries and brings the domain counts to lower levels, comparable to the genomic mode database.
INTEGRATION OF BIOLOGICAL PATHWAYS DATA
In the current release, we have started the integration of biological pathways information into SMART. Initially, this will be limited to the metabolic pathways, with further expansions coming in the future releases. We have mapped the complete genomic mode protein database to the KEGG (11) orthologous groups and their corresponding metabolic pathways. This information is available directly in the protein annotation pages, for more than 1 million proteins (Figure 1). Additionally, this information was used to generate the overview of various domains’ presence in different parts of metabolism. Each domain's annotation page includes a new ‘Metabolic pathways’ entry, which lists the pathways where the domain is present (Figure 2). In addition to the basic statistics, the metabolic pathways information for both proteins and domains is also displayed on the global overview map of the metabolism (11), with an interactive version of the maps provided by iPath, the interactive Pathways Explorer (12).
Figure 1.
Protein annotation page for Mus musculus phospholipase C, delta 1. More than 600 000 proteins are linked to the KEGG metabolic pathways and orthologous groups, and can be displayed in the interactive Pathways Explorer (12), which provides an interactive, global overview map of the metabolism. Interaction network information has been greatly expanded, and is available for about 2.5 million proteins.
Figure 2.
SMART annotation page for the HDc domain. A new feature of SMART domain annotation pages is the ‘Metabolism’ entry. Based on the mapping of SMART genomic protein database to KEGG orthologous groups, it gives an overview of the domain's presence in various metabolic pathways. Matching pathways are also displayed on the global metabolism overview map (11), with a link to the interactive version, provided by the interactive Pathways Explorer.
EXPANDED PROTEIN INTERACTION DATA
The expansion of the protein database based on completely sequenced genomes allowed SMART to significantly extend the information on putative protein interaction partners. This data is now available for about 2.5 million proteins, compared to 350 000 in the previous release. Interaction network data has been expanded and updated, and is displayed using completely redesigned summary graphics, which are easier to read and interpret. The data has been imported from the STRING database (13), and is synchronized with its version 8 release.
DATABASE AND WEB SERVER OPTIMIZATIONS
With the ever-increasing amount of sequence information available, domain annotation tools such as SMART face constant new challenges in providing fast and user-friendly interfaces to the underlying data. The core of SMART is a relational database management system (RDBMS), which stores the annotation of all SMART domains and the pre-calculated protein analyses for complete Uniprot (9) and Ensembl (8) sequence databases. In order to keep the response times of the server acceptable, many parts of the database access code have been greatly optimized, and the database itself restructured. Additionally, the server was distributed onto a hardware cluster with different tasks assigned to dedicated machines, resulting in a greatly expanded load capacity.
USER INTERFACE IMPROVEMENTS AND TECHNICAL CHANGES
Many parts of SMART's web interface have been updated and streamlined. Protein analysis pages now include extended information on all detected SMART domains, which is dynamically loaded on user request. In addition to SMART domains, we now also display the basic annotation for all detected Pfam (14) domains, such as Interpro (15) abstract and annotated Gene Ontology (16) terms.
Domain annotation pages have also been redesigned and updated. Information on domain presence in 3D structures has been expanded and includes PDB (17) titles and the basic graphical representation of the structure.
With version 6, SMART offers two new modes of database access, oriented towards advanced users. Distributed annotation system (DAS, 18), allows access to sequence annotation data on an as-needed basis, and offers users an easy way of integrating multiple annotation sources in a single client-side interface. SMART domain annotations for the complete Uniprot and Ensembl protein databases are accessible as DAS XML at the URL http://smart.embl.de/smart/das.
In addition to DAS, SMART can also be accessed through a web service, with a web service definition language (WSDL) service description file available at http://smart.embl.de/webservice. SMART web service uses simple object access protocol (SOAP) for all input and output messages and accepts both protein sequence identifiers and raw amino acid sequences.
These new access modes offer simpler integration of SMART annotation data into other resources and an easier way for analysis of large datasets.
CONCLUSION
Since the initial conception of SMART in the mid 1990s, our goal has been to provide a useful biological web resource, characterized by high quality of underlying data and a powerful, simple user interface. We continue to modestly expand our coverage and implement new features to make using SMART a better and more enjoyable experience to both existing and new users.
FUNDING
Funding for open acess charge: EMBL (European Molecular Biology Laboratory).
Conflict of interest statement. None declared.
REFERENCES
1
SMART, a simple modular architecture research tool: identification of signaling domains
,
Proc. Natl Acad. Sci. USA
,
1998
, vol.
95
(pg.
5857
-
5864
)
2
SMART 5: domains in the context of genomes and networks
,
Nucleic Acids Res.
,
2006
, vol.
34
(pg.
D257
-
D260
)
3
Protein domain analysis in the era of complete genomes
,
FEBS Lett.
,
2002
, vol.
513
(pg.
129
-
134
)
4
Exhaustive enumeration of protein domain families
,
J. Mol. Biol.
,
2003
, vol.
328
(pg.
749
-
767
)
5
Polyketide biosynthesis: a millennium review
,
Nat. Prod. Rep.
,
2001
, vol.
18
(pg.
380
-
416
)
6
A computational screen for type I polyketide synthases in metagenomics shotgun data
,
PLoS ONE
,
2008
, vol.
3
pg.
e3515
doi:10.1371/journal.pone.0003515
7
et al.
The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003
,
Nucleic Acids Res.
,
2003
, vol.
31
(pg.
365
-
370
)
8
et al.
Ensembl 2008
,
Nucleic Acids Res.
,
2008
, vol.
36
(pg.
D707
-
D714
)
9
The UniProt Consortium
The universal protein resource (UniProt)
,
Nucleic Acids Res.
,
2008
, vol.
36
(pg.
D190
-
D195
)
10
Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences
,
Bioinformatics
,
2006
, vol.
22
(pg.
1658
-
1659
)
11
et al.
KEGG for linking genomes to life and the environment
,
Nucleic Acids Res.
,
2008
, vol.
36
(pg.
D480
-
D484
)
12
iPath: interactive exploration of biochemical pathways and networks
,
Trends Biochem. Sci.
,
2008
, vol.
33
(pg.
101
-
103
)
13
STRING 7–recent developments in the integration and prediction of protein interactions
,
Nucleic Acids Res.
,
2007
, vol.
35
(pg.
D358
-
D362
)
14
et al.
The Pfam protein families database
,
Nucleic Acids Res.
,
2008
, vol.
36
(pg.
D281
-
D288
)
15
et al.
New developments in the InterPro database
,
Nucleic Acids Res.
,
2007
, vol.
35
(pg.
D224
-
D228
)
16
Gene Ontology Consortium
The Gene Ontology (GO) project in 2006
,
Nucleic Acids Res.
,
2006
, vol.
34
(pg.
D322
-
D326
)
17
The Protein Data Bank and structural genomics
,
Nucleic Acids Res.
,
2003
, vol.
31
(pg.
489
-
491
)
18
The distributed annotation system
,
BMC Bioinformatics
,
2001
, vol.
2
pg.
7
© 2008 The Author(s)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
I agree to the terms and conditions. You must accept the terms and conditions.
Submit a comment
Name
Affiliations
Comment title
Comment
You have entered an invalid code
Thank you for submitting a comment on this article. Your comment will be reviewed and published at the journal's discretion. Please check for further notifications by email.
Citations
Views
Altmetric
Metrics
Total Views 2,727
2,173 Pageviews
554 PDF Downloads
Since 11/1/2016
Month: | Total Views: |
---|---|
November 2016 | 2 |
January 2017 | 5 |
February 2017 | 18 |
March 2017 | 10 |
April 2017 | 5 |
May 2017 | 1 |
June 2017 | 8 |
July 2017 | 9 |
August 2017 | 7 |
September 2017 | 2 |
October 2017 | 9 |
November 2017 | 11 |
December 2017 | 27 |
January 2018 | 45 |
February 2018 | 34 |
March 2018 | 49 |
April 2018 | 81 |
May 2018 | 35 |
June 2018 | 16 |
July 2018 | 26 |
August 2018 | 44 |
September 2018 | 30 |
October 2018 | 21 |
November 2018 | 35 |
December 2018 | 20 |
January 2019 | 24 |
February 2019 | 11 |
March 2019 | 48 |
April 2019 | 47 |
May 2019 | 32 |
June 2019 | 22 |
July 2019 | 17 |
August 2019 | 58 |
September 2019 | 36 |
October 2019 | 38 |
November 2019 | 31 |
December 2019 | 28 |
January 2020 | 51 |
February 2020 | 34 |
March 2020 | 29 |
April 2020 | 22 |
May 2020 | 24 |
June 2020 | 34 |
July 2020 | 41 |
August 2020 | 21 |
September 2020 | 10 |
October 2020 | 19 |
November 2020 | 12 |
December 2020 | 26 |
January 2021 | 36 |
February 2021 | 25 |
March 2021 | 24 |
April 2021 | 30 |
May 2021 | 29 |
June 2021 | 24 |
July 2021 | 22 |
August 2021 | 20 |
September 2021 | 37 |
October 2021 | 31 |
November 2021 | 28 |
December 2021 | 25 |
January 2022 | 37 |
February 2022 | 23 |
March 2022 | 19 |
April 2022 | 34 |
May 2022 | 21 |
June 2022 | 28 |
July 2022 | 26 |
August 2022 | 23 |
September 2022 | 41 |
October 2022 | 19 |
November 2022 | 24 |
December 2022 | 30 |
January 2023 | 30 |
February 2023 | 38 |
March 2023 | 34 |
April 2023 | 19 |
May 2023 | 97 |
June 2023 | 25 |
July 2023 | 23 |
August 2023 | 36 |
September 2023 | 27 |
October 2023 | 41 |
November 2023 | 35 |
December 2023 | 45 |
January 2024 | 34 |
February 2024 | 40 |
March 2024 | 47 |
April 2024 | 31 |
May 2024 | 40 |
June 2024 | 28 |
July 2024 | 34 |
August 2024 | 33 |
September 2024 | 34 |
October 2024 | 35 |
Citations
795 Web of Science
×
Email alerts
Citing articles via
More from Oxford Academic