STRING 7—recent developments in the integration and prediction of protein interactions (original) (raw)

Journal Article

,

11

European Molecular Biology Laboratory, Meyerhofstrasse 1

69117 Heidelberg, Germany

22

University of Zurich, Winterthurerstrasse 190

8057 Zurich, Switzerland

*To whom correspondence should be addressed. Tel: +41 44 6353147; Fax: +41 44 6356864; Email: mering@molbio.unizh.ch

Search for other works by this author on:

,

11

European Molecular Biology Laboratory, Meyerhofstrasse 1

69117 Heidelberg, Germany

Search for other works by this author on:

,

11

European Molecular Biology Laboratory, Meyerhofstrasse 1

69117 Heidelberg, Germany

Search for other works by this author on:

,

11

European Molecular Biology Laboratory, Meyerhofstrasse 1

69117 Heidelberg, Germany

22

University of Zurich, Winterthurerstrasse 190

8057 Zurich, Switzerland

Search for other works by this author on:

,

11

European Molecular Biology Laboratory, Meyerhofstrasse 1

69117 Heidelberg, Germany

Search for other works by this author on:

,

11

European Molecular Biology Laboratory, Meyerhofstrasse 1

69117 Heidelberg, Germany

Search for other works by this author on:

,

33

Utrecht University, Padualaan 8

3584 CH Utrecht, The Netherlands

Search for other works by this author on:

11

European Molecular Biology Laboratory, Meyerhofstrasse 1

69117 Heidelberg, Germany

44

Max-Delbrück-Centre for Molecular Medicine, Robert-Rössle-Str. 10

13092 Berlin, Germany

Search for other works by this author on:

The authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First Authors

Author Notes

Received:

15 September 2006

Revision received:

05 October 2006

Accepted:

05 October 2006

Published:

10 November 2006

Cite

Christian von Mering, Lars J. Jensen, Michael Kuhn, Samuel Chaffron, Tobias Doerks, Beate Krüger, Berend Snel, Peer Bork, STRING 7—recent developments in the integration and prediction of protein interactions, Nucleic Acids Research, Volume 35, Issue suppl_1, 1 January 2007, Pages D358–D362, https://doi.org/10.1093/nar/gkl825
Close

Navbar Search Filter Mobile Enter search term Search

Abstract

Information on protein–protein interactions is still mostly limited to a small number of model organisms, and originates from a wide variety of experimental and computational techniques. The database and online resource STRING generalizes access to protein interaction data, by integrating known and predicted interactions from a variety of sources. The underlying infrastructure includes a consistent body of completely sequenced genomes and exhaustive orthology classifications, based on which interaction evidence is transferred between organisms. Although primarily developed for protein interaction analysis, the resource has also been successfully applied to comparative genomics, phylogenetics and network studies, which are all facilitated by programmatic access to the database backend and the availability of compact download files. As of release 7, STRING has almost doubled to 373 distinct organisms, and contains more than 1.5 million proteins for which associations have been pre-computed. Novel features include AJAX-based web-navigation, inclusion of additional resources such as BioGRID, and detailed protein domain annotation. STRING is available at Author Webpage

INTRODUCTION

A fully comprehensive view of all functionally relevant protein interactions is still not available for any species, not even for relatively simple, single-celled model organisms. However, this information is essential for a systems-level understanding of cellular behavior, and it is needed in order to place the molecular functions of individual proteins into their cellular context.

For detecting direct physical binding between proteins, numerous small-scale and high-throughput experiments have been undertaken, and most of their reported interactions are available from dedicated interaction databases (1–4), as well as from multipurpose databases centered on specific model organisms (5–7). However, the growth of interaction data is severely lagging behind the pace of genome sequencing, so that for most genomes and proteins known to date no interaction data is available. Furthermore, proteins do not only interact physically: indirect associations such as genetic interactions or shared pathway memberships are equally important for a complete understanding of cellular function, but are for the most part not stored in interaction databases. Instead, they are available from a variety of pathway databases (8,9) and from the scientific literature.

The database STRING (‘Search Tool for the Retrieval of Interacting Genes/Proteins’) aims to collect, predict and unify most types of protein–protein associations, including direct and indirect associations. In order to cover organisms not yet addressed experimentally, STRING runs a set of prediction algorithms (10), and transfers known interactions from model organisms to other species based on predicted orthology of the respective proteins (11). STRING has grown from a purely predictive resource covering mainly prokaryotes (12) to a comprehensive tool integrating protein association information from all domains of life (Figure 1). Each interaction in the database is annotated with a benchmarked numerical confidence score, which can be used to filter the interaction network at any desired stringency. All data in STRING are stored in relational database tables. The interaction information is freely available for download, but download of the entire database content requires a license agreement to prevent redistribution (free for academic users who only access the previous version number).

Protein interaction network in STRING. Screenshot from STRING showing a network of Saccharomyces cerevisiae proteins [the exosome complex, upper right, is seen weakly associated with proteins from nuclear transport, lower left, see also Ref. (26)]. The inset shows the context menu available for all STRING proteins—in the context menu, annotation and domain architecture are shown directly, and links to other databases and tools are available (22,23). In the network, links between proteins signify the various interaction data supporting the network, colored by evidence type (see STRING website for color legend).

Figure 1

Protein interaction network in STRING. Screenshot from STRING showing a network of Saccharomyces cerevisiae proteins [the exosome complex, upper right, is seen weakly associated with proteins from nuclear transport, lower left, see also Ref. (26)]. The inset shows the context menu available for all STRING proteins—in the context menu, annotation and domain architecture are shown directly, and links to other databases and tools are available (22,23). In the network, links between proteins signify the various interaction data supporting the network, colored by evidence type (see STRING website for color legend).

KNOWN AND PREDICTED INTERACTIONS

Known interactions in STRING are primarily imported from existing excellent interaction databases (1–5,8,9), and are complemented by automated text mining of PubMed abstracts and several other bodies of scientific text [such as from Ref. (6)]. As is the case for all interactions in STRING, imported interactions are mapped onto a consistent set of proteins and identifiers, thereby facilitating comparison between datasets. STRING does not store specific details regarding splicing isoforms or post-translational modifications, but instead reduces protein isoforms to a single protein per locus (usually as defined by the longest known protein-coding transcript). This level of resolution enables efficient storage and is compatible with most prediction/transfer algorithms, which usually operate only at the level of the gene locus.

Known interactions are further complemented by de novo interaction predictions derived from several comparative genomics prediction algorithms that are mainly applicable to prokaryotes (13–19). These algorithms systematically compare genomes, searching for frequently observed gene neighborhoods, gene fusion events and similarities in gene occurrence across genomes. For each prediction algorithm, dedicated viewers of the genomic evidence are available in STRING.

Interaction evidence from model organisms is often useful for other organisms as well, especially when orthologs of interacting proteins can be clearly identified in the second organism. STRING systematically executes such orthology transfers, using both precomputed orthologs from the COG database (20), as well as a homology-based orthology scheme computed de novo (11). STRING can thus immediately predict a large number of interactions for any newly sequenced genome, as soon as it is included into the system. The combination of known, predicted and transferred interactions is unique, making STRING the most comprehensive interaction resource available to date, especially for organisms not addressed experimentally.

The homology data stored in STRING form the basis for the interaction transfers, and are the result of more than 7 × 1011 pairwise protein comparisons using the sensitive Smith–Waterman dynamic programming algorithm. This dataset is a very useful asset in itself [see also (21)], and can be accessed independently of the protein interaction networks by locally installing the STRING database files. Users of the website can also browse all of the homologs detected for any protein of interest, and can inspect alignments with very fast response times (Figure 2).

Precomputed homology relations and alignments. For most genomes contained in STRING, sensitive all-against-all homology searches using the Smith–Waterman algorithm are included. These form the basis for assigning orthologs and transferring interaction information, but are also available directly to the user. Because they are stored in a relational database, access to homologs and alignments for any protein of interest is possible without the usual waiting time.

Figure 2

Precomputed homology relations and alignments. For most genomes contained in STRING, sensitive all-against-all homology searches using the Smith–Waterman algorithm are included. These form the basis for assigning orthologs and transferring interaction information, but are also available directly to the user. Because they are stored in a relational database, access to homologs and alignments for any protein of interest is possible without the usual waiting time.

NEW FEATURES AND IMPROVEMENTS IN STRING 7

The network viewer in STRING (Figure 1) is the central information source and navigation hub for the user. It has been extended through a context-sensitive menu-box, which displays associated information for any protein in the network. This menu includes a graphical summary of protein domains and features, and allows the user to link out to other external resources such as the motif discovery tool DILIMOT (22). STRING is now also tightly integrated with the SMART protein architecture research tool (23). With the latter it shares a common set of genomes and proteins, for which consistent results are pre-computed and stored. This enables automatic interlinking between both resources (SMART includes interaction previews, and STRING includes domain architecture previews). The topology and evolution of interaction networks can thus be studied both at the level of proteins as well as at the level of individual domains.

Since the last update (11), STRING has grown substantially both in terms of data sources and number of organisms covered. Five new databases are included [MINT, HPRD, BioGRID, DIP and Reactome (2–5,8)], as well as 194 new organisms. Especially due to this latter increase in completely sequenced organisms, the architecture of STRING had to be substantially upgraded so that it can accommodate present and future growth. With respect to the user interface, this required changes in the viewers for the genomic context data, which could no longer show all of the genomes simultaneously by default. Instead, STRING uses a phylogenetic tree of species to collapse redundant genomes; this tree has been derived from concatenated alignments of a small number of universal protein families (24). Users can navigate the tree by expanding or collapsing its sub-branches, thus choosing which organisms to focus on. AJAX technology (‘Asynchronous JavaScript and XML’) is then used to fetch the requested information into the existing, pre-loaded browser page, thus increasing useability and speed.

With respect to the underlying database structure, changes were necessary in the way homology data and interaction transfers are stored. Both can no longer be computed and stored in an ‘all-against-all’ fashion, because of their quadratic scaling with the number of genomes. Beginning with version 7, STRING therefore adopts a two-layered approach when accommodating fully sequenced genomes (Figure 3): important model organisms and those for which experimental data are available form the ‘core genomes’, all other genomes form the periphery. Within the core, homology searches and interaction transfers are still executed in an all-against-all fashion, whereas for peripheral genomes only searches against the core are included. These and other changes in STRING dramatically improve the scalability of the resource, leading to faster update cycles even when the number of sequenced genomes is to increase as fast as currently projected. Together with future plans to increase the scope and specificity of the stored interaction information, STRING should thus continue to facilitate not only network research but also wider projects that range from phylogenetics to metagenomics (24,25).

Organisms covered by STRING. STRING currently contains 373 fully sequenced organisms. These are divided into ‘Core Organisms’ and ‘Peripheral Organisms’. The former include all important model organisms for which experimental data are available, as well as selected representatives for cases of redundant genome sequencing (e.g. when several closely related strains of a bacterial species have been sequenced, only one strain is included). The ‘Peripheral Organisms’ form the remainder; they tend to be somewhat redundant, and usually have little more than genomic sequence information annotated. For the core organisms, homology relations and interaction transfers are fully computed, whereas the peripheral organisms are only connected to the core but not among themselves (the graphic shows only a small selection of organisms; lines indicate homology searches and interaction transfers). This architecture allows STRING to encompass all sequenced genomes, while still keeping database size and computation time within reasonable limits.

Figure 3

Organisms covered by STRING. STRING currently contains 373 fully sequenced organisms. These are divided into ‘Core Organisms’ and ‘Peripheral Organisms’. The former include all important model organisms for which experimental data are available, as well as selected representatives for cases of redundant genome sequencing (e.g. when several closely related strains of a bacterial species have been sequenced, only one strain is included). The ‘Peripheral Organisms’ form the remainder; they tend to be somewhat redundant, and usually have little more than genomic sequence information annotated. For the core organisms, homology relations and interaction transfers are fully computed, whereas the peripheral organisms are only connected to the core but not among themselves (the graphic shows only a small selection of organisms; lines indicate homology searches and interaction transfers). This architecture allows STRING to encompass all sequenced genomes, while still keeping database size and computation time within reasonable limits.

The authors wish to thank Dianna Fisk from the Saccharomyces Genome Database for access to the Gene Summary Paragraphs, and Toby Gibson, Martijn Huynen, Victor Neduva, Rune Linding and members of the Bork group for continued feedback and discussions. This work was supported in part by grants from the Bundesministerium für Forschung und Bildung, Germany, as well as through the ADIT Integrated Project, contract number LSHB-CT-2005-511065, and through the BioSapiens Network of Excellence, contract number LSHG-CT-2003-503265, both funded by the European Commission FP6 Programme. Funding to pay the Open Access publication charges for this article was provided by the University of Zurich, through its Research Priority Program ‘Systems Biology and Functional Genomics’.

Conflict of interest statement. None declared.

REFERENCES

1

, , , , , , , , , , et al.

The biomolecular interaction network database and related tools 2005 update

,

Nucleic Acids Res.

,

2005

, vol.

33

(pg.

D418

-

D424

)

2

, , , , , .

The Database of Interacting Proteins: 2004 update

,

Nucleic Acids Res.

,

2004

, vol.

32

(pg.

D449

-

D451

)

3

, , , , , .

MINT: a Molecular INTeraction database

,

FEBS Lett.

,

2002

, vol.

513

(pg.

135

-

140

)

4

, , , , , .

BioGRID: a general repository for interaction datasets

,

Nucleic Acids Res.

,

2006

, vol.

34

(pg.

D535

-

D539

)

5

, , , , , , , , , , et al.

Human protein reference database—2006 update

,

Nucleic Acids Res.

,

2006

, vol.

34

(pg.

D411

-

D414

)

6

, , , , , , , , , , et al.

Genome Snapshot: a new resource at the Saccharomyces Genome Database (SGD) presenting an overview of the Saccharomyces cerevisiae genome

,

Nucleic Acids Res.

,

2006

, vol.

34

(pg.

D442

-

D445

)

7

, , , , , , , , , , et al.

WormBase: better software, richer content

,

Nucleic Acids Res.

,

2006

, vol.

34

(pg.

D475

-

D478

)

8

, , , , , , , , , , et al.

Reactome: a knowledgebase of biological pathways

,

Nucleic Acids Res.

,

2005

, vol.

33

(pg.

D428

-

D432

)

9

, , , , , , , , .

From genomics to chemical genomics: new developments in KEGG

,

Nucleic Acids Res.

,

2006

, vol.

34

(pg.

D354

-

D357

)

10

, , , , , .

STRING: a database of predicted functional associations between proteins

,

Nucleic Acids Res.

,

2003

, vol.

31

(pg.

258

-

261

)

11

, , , , , , , , .

STRING: known and predicted protein–protein associations, integrated and transferred across organisms

,

Nucleic Acids Res.

,

2005

, vol.

33

(pg.

D433

-

D437

)

12

, , , .

STRING: a web-server to retrieve and display the repeatedly occurring neighbourhood of a gene

,

Nucleic Acids Res.

,

2000

, vol.

28

(pg.

3442

-

3444

)

13

, .

Computational methods for the prediction of protein interactions

,

Curr. Opin. Struct. Biol.

,

2002

, vol.

12

(pg.

368

-

373

)

14

, .

Measuring genome evolution

,

Proc. Natl Acad. Sci. USA

,

1998

, vol.

95

(pg.

5849

-

5856

)

15

, , , , .

Assigning protein functions by comparative genome analysis: protein phylogenetic profiles

,

Proc. Natl Acad. Sci. USA

,

1999

, vol.

96

(pg.

4285

-

4288

)

16

, , , .

Protein interaction maps for complete genomes based on gene fusion events

,

Nature

,

1999

, vol.

402

(pg.

86

-

90

)

17

, , , , , .

Detecting protein function and protein–protein interactions from genome sequences

,

Science

,

1999

, vol.

285

(pg.

751

-

753

)

18

, , , .

Conservation of gene order: a fingerprint of proteins that physically interact

,

Trends Biochem. Sci.

,

1998

, vol.

23

(pg.

324

-

328

)

19

, , , , .

The use of gene clusters to infer functional coupling

,

Proc. Natl Acad. Sci. USA

,

1999

, vol.

96

(pg.

2896

-

2901

)

20

, , , , , , , , , , et al.

The COG database: an updated version includes eukaryotes

,

BMC Bioinformatics

,

2003

, vol.

4

pg.

41

21

, , , , , .

SIMAP: the similarity matrix of proteins

,

Nucleic Acids Res.

,

2006

, vol.

34

(pg.

D252

-

D256

)

22

, .

DILIMOT: discovery of linear motifs in proteins

,

Nucleic Acids Res.

,

2006

, vol.

34

(pg.

W350

-

W355

)

23

, , , , , .

SMART 5: domains in the context of genomes and networks

,

Nucleic Acids Res.

,

2006

, vol.

34

(pg.

D257

-

D260

)

24

, , , , , .

Toward automatic reconstruction of a highly resolved tree of life

,

Science

,

2006

, vol.

311

(pg.

1283

-

1287

)

25

, , , , , , , , , , et al.

Comparative metagenomics of microbial communities

,

Science

,

2005

, vol.

308

(pg.

554

-

557

)

26

, .

Nucleocytoplasmic transport: integrating mRNA production and turnover with export through the nuclear pore

,

Mol. Cell. Biol.

,

2004

, vol.

24

(pg.

3069

-

3076

)

Author notes

The authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First Authors

© 2006 The Author(s)

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

I agree to the terms and conditions. You must accept the terms and conditions.

Submit a comment

Name

Affiliations

Comment title

Comment

You have entered an invalid code

Thank you for submitting a comment on this article. Your comment will be reviewed and published at the journal's discretion. Please check for further notifications by email.

Citations

Views

Altmetric

Metrics

Total Views 3,887

2,789 Pageviews

1,098 PDF Downloads

Since 12/1/2016

Month: Total Views:
December 2016 2
January 2017 2
February 2017 25
March 2017 8
April 2017 9
May 2017 12
June 2017 11
July 2017 13
August 2017 10
September 2017 11
October 2017 5
November 2017 12
December 2017 24
January 2018 33
February 2018 18
March 2018 21
April 2018 26
May 2018 33
June 2018 21
July 2018 23
August 2018 94
September 2018 33
October 2018 34
November 2018 44
December 2018 30
January 2019 35
February 2019 21
March 2019 46
April 2019 52
May 2019 35
June 2019 43
July 2019 65
August 2019 46
September 2019 62
October 2019 40
November 2019 43
December 2019 38
January 2020 41
February 2020 17
March 2020 27
April 2020 15
May 2020 23
June 2020 30
July 2020 23
August 2020 50
September 2020 31
October 2020 31
November 2020 28
December 2020 34
January 2021 38
February 2021 16
March 2021 31
April 2021 48
May 2021 47
June 2021 37
July 2021 27
August 2021 22
September 2021 22
October 2021 34
November 2021 34
December 2021 15
January 2022 22
February 2022 45
March 2022 57
April 2022 66
May 2022 53
June 2022 40
July 2022 38
August 2022 44
September 2022 51
October 2022 56
November 2022 104
December 2022 59
January 2023 71
February 2023 65
March 2023 65
April 2023 63
May 2023 48
June 2023 49
July 2023 37
August 2023 64
September 2023 61
October 2023 66
November 2023 76
December 2023 96
January 2024 101
February 2024 95
March 2024 83
April 2024 74
May 2024 69
June 2024 55
July 2024 51
August 2024 79
September 2024 70
October 2024 13

Citations

490 Web of Science

×

Email alerts

Citing articles via

More from Oxford Academic