Touring protein fold space with Dali/FSSP (original) (raw)

Journal Article

,

European Molecular Biology Laboratory, European Bioinformatics Institute

,

Genome Campus, Cambridge CB10 1SD, UK

Search for other works by this author on:

European Molecular Biology Laboratory, European Bioinformatics Institute

,

Genome Campus, Cambridge CB10 1SD, UK

Search for other works by this author on:

Received:

10 October 1997

Accepted:

15 October 1997

Published:

01 January 1998

Navbar Search Filter Mobile Enter search term Search

Abstract

The FSSP database and its new supplement, the Dali Domain Dictionary, present a continuously updated classification of all known 3D protein structures. The classification is derived using an automatic structure alignment program (Dali) for the all-against-all comparison of structures in the Protein Data Bank. From the resulting enumeration of structural neighbours (which form a surprisingly continuous distribution in fold space) we derive a discrete fold classification in three steps: (i) sequence-related families are covered by a representative set of protein chains; (ii) protein chains are decomposed into structural domains based on the recurrence of structural motifs; (iii) folds are defined as tight clusters of domains in fold space. The fold classification, domain definitions and test sets for sequence-structure alignment (threading) are accessible on the web at www.embl-ebi.ac.uk/dali . The web interface provides a rich network of links between neighbours in fold space, between domains and proteins, and between structures and sequences leading, for example, to a database of explicit multiple alignments of protein families in the twilight zone of sequence similarity. The Dali/FSSP organization of protein structures provides a map of the currently known regions of the protein universe that is useful for the analysis of folding principles, for the evolutionary unification of protein families and for maximizing the information return from experimental structure determination.

Introduction

The number of three-dimensional protein structures in the Protein Data Bank (PDB; 1) has been doubling approximately every 18 months. This acceleration means that automatic methods are increasingly important for efforts to organize the data. The FSSP database ( 2 ), established in 1992, and its new supplement, the Dali Domain Dictionary, are produced using the Dali program for structural alignment ( 3 ) to automatically and continuously process the new structures released by the Protein Data Bank ( Fig. 1 ). The information derived as a result includes the description of protein domain architecture, the definition of structural neighbours around each known structure, the definition of structurally conserved cores and explicit multiple alignments of distantly related protein families; these are made available on the web.

 Flowchart of the processing of protein structures in PDB. The high redundancy of biological databases presents a number of problems in practical use. To overcome these problems, it is useful and essential to derive representative subsets and/or classify the data. Our structural classification starts from extracting all structures (chains) from the PDB (left). Based on all-on-all structure comparison, we define a representative set of structures which is free of sequence redundancy (middle bottom). Each structure is decomposed into domains (upper right). Folds are defined by clustering domains based on structural similarities. As a result, all known protein structures can be completely described in terms of 526 fold types (bottom right; the numbers refer to April 1997). The arrows in the middle column put the fold classification in context with the world of sequence analysis via the HSSP database of structure-sequence alignments ( 15 ). About one quarter of all sequences in the SWISS-PROT database ( 13 ) are clearly homologous to proteins of known structure.

Figure 1

Flowchart of the processing of protein structures in PDB. The high redundancy of biological databases presents a number of problems in practical use. To overcome these problems, it is useful and essential to derive representative subsets and/or classify the data. Our structural classification starts from extracting all structures (chains) from the PDB (left). Based on all-on-all structure comparison, we define a representative set of structures which is free of sequence redundancy (middle bottom). Each structure is decomposed into domains (upper right). Folds are defined by clustering domains based on structural similarities. As a result, all known protein structures can be completely described in terms of 526 fold types (bottom right; the numbers refer to April 1997). The arrows in the middle column put the fold classification in context with the world of sequence analysis via the HSSP database of structure-sequence alignments ( 15 ). About one quarter of all sequences in the SWISS-PROT database ( 13 ) are clearly homologous to proteins of known structure.

There are a number of other classification schemes for protein structures available on the web. Although they are based on the same data, the presentations differ in their basic philosophy regarding automation and organization ( 4–9 ). For example, MMDB from NCBI (US National Center for Biotechnology Information) provides a fish-eye view of structural neighbours around any PDB structure based on precalculated all-on-all structure comparisons using the VAST algorithm ( 4 ). Scop ( 5 ) and CATH ( 6 ) are strictly hierarchical classifications based on the abstractions of class (4–10 categories at the top of the hierarchy), architecture/topology or fold, and superfamily (519 in scop). Both classifications are curated by experts, with emphasis in scop on the definition of functionally related superfamilies and in CATH on the definition of architectural types. Dali/FSSP is a fully automatic classification based on the concept of neighbourhoods in fold space, of which it aims to provide useful views at both coarse-grained and fine-grained resolution. In the near neighbour range, the quantitative structural relationships between domains are described in terms of hierarchical clustering (dendrograms, similar to scop and CATH) and in terms of neighbour lists (similar to VAST). In recognition of the continuous rather than discrete distribution of domains in fold space, the global overview of structural relationships between domains is presented in terms of 2D ‘roadmaps’ of fold space. At all levels, representative sets are used for clarity, removing obvious redundancy of information. Many of the finer branches of the fold dendrograms correspond to evolutionarily related, functionally conserved superfamilies. We are currently developing tools for automatically annotating functional evidence of plausible evolutionary relationships ( 10 ).

 Hierarchical clustering of folds. Hierarchical clustering yields a convenient view (dendrogram) of fold neighbours at different level of structural similarity (Z-scores). In this example, five domains (columns a–e) belong to the same fold class. Based on the topology of the dendrogram, domains d and e are siblings (same parent node), domains a and b are cousins (same grandparent node), domains c and d are second cousins (same greatgrandparent node), and so on. To ease navigation, the user is presented with a uniform summary for each node in the dendrogram. The idea is to choose a central member of the cluster as a representative (3D template) onto which structural or sequence variability can be mapped based on the multiple alignment of cluster members. For example, domain a represents the whole class {a,b,c,d,e}, and the link d→c means that domain c is used to represents the set {c,d,e}. The fold...domain levels are based on structure similarity. Sequence families around proteins of known structure (bottom row) are defined by sequence similarity ( 14 ). Exploiting links involving structure alignments leads to accurate multiple alignments of distantly related protein families. Currently, the naming of structural similarity levels is not a statement about evolutionary relationships. However, we regularly observe that remote relatives are more similar to each other than to other proteins in the database, so in favourable cases examination of the fold dendrogram can lead to biological discoveries. For example, {a,b} and {d,e} including their associated sequence families are likely candidates for unification into a functionally conserved superfamily.

Figure 2

Hierarchical clustering of folds. Hierarchical clustering yields a convenient view (dendrogram) of fold neighbours at different level of structural similarity (Z-scores). In this example, five domains (columns a–e) belong to the same fold class. Based on the topology of the dendrogram, domains d and e are siblings (same parent node), domains a and b are cousins (same grandparent node), domains c and d are second cousins (same greatgrandparent node), and so on. To ease navigation, the user is presented with a uniform summary for each node in the dendrogram. The idea is to choose a central member of the cluster as a representative (3D template) onto which structural or sequence variability can be mapped based on the multiple alignment of cluster members. For example, domain a represents the whole class {a,b,c,d,e}, and the link d→c means that domain c is used to represents the set {c,d,e}. The fold...domain levels are based on structure similarity. Sequence families around proteins of known structure (bottom row) are defined by sequence similarity ( 14 ). Exploiting links involving structure alignments leads to accurate multiple alignments of distantly related protein families. Currently, the naming of structural similarity levels is not a statement about evolutionary relationships. However, we regularly observe that remote relatives are more similar to each other than to other proteins in the database, so in favourable cases examination of the fold dendrogram can lead to biological discoveries. For example, {a,b} and {d,e} including their associated sequence families are likely candidates for unification into a functionally conserved superfamily.

The structural classification is explicitly linked ( 11 ) to sequence families with associated functional annotation, resulting in a rich network of biologically interesting relationships that can be browsed online. In particular, structure-based alignments increase our understanding of the more distant evolutionary relationships. For example, the discovery of remarkable structural similarity between histidine triad (HIT) proteins and galactose-1-phosphate uridylyltransferase (GalT) pointed to a conserved biochemical function in an emerging superfamily ( 12 ). The interconnection of structural classification with sequence families also opens the door to studies of structure-sequence-function relationships from a global perspective, for example: ‘which folds support function X?’, ‘which functions have evolved on the framework of fold Y?’, ‘do protein families in region Z of fold space diverge faster/more slowly than average?’.

Form and Content of the Database

The Protein Data Bank (PDB) is highly redundant in terms of sequence and structure similarities. Our aim is complete and economical description of structural data ( Fig. 1 ). The first reduction step is the generation of a sequence-unique set. No pair of proteins within this set is more than 25% identical in sequence and all removed structures are more than 25% identical with a representative. To avoid the removal of unique domains next to more common domains, the percentage used here is calculated as the number of residue identities in the structurally aligned region, divided by the average length of the two proteins (not by the length of the aligned region). The second step is to describe the structural neighbourhood around each sequence-unique representative chain, in the form of structural alignments. The FSSP database (DaliFSSP) has one entry per representative, reporting the structural alignments with the representative's sequence homologs (same family, membership detectable by sequence methods) and with other members of the representative set (related families, relationship difficult or impossible to detect by sequence methods). The Dali Domain Dictionary (DaliDD) is a new complement to the FSSP database that has the same format but one entry per structural domain. In other words, DaliFSSP is about proteins, or protein chains, while DaliDD is about structural domains.

 Touring fold space. The dictionary is based on the quantification of structural similarities by all-on-all comparison of known structures. Using the pairwise similarities, each structure can be positioned in an abstract high-dimensional fold space. The overall distribution of domains into general architectural types is visualized using 2D projections of fold space (‘roadmaps’) generated by multivariate scaling methods ( 3 ). Within fold space, there are tight clusters of domains that have the same fold, i.e., similar overall arrangement of secondary structure elements. The structural relationships between instances (member domains) of a fold are visualized using dendrograms (explained in Fig. 2 ). The WWW interface allows the database of structural neighbours to be queried in a variety of ways with dynamic views generated on the fly. In this example, clicking in the lower right corner of the 2D map (top left) leads to a table view (middle) of folds occupying this region of fold space. Click on ‘details’ for a representative domain to identify structural neighbours that form bridges between the fold clusters and can be used for 3D superimposition. In this case, superimposition reveals a shared motif consisting of two crossed β-hairpins (upper right, the numbers above the ribbon diagrams refer to fold class). To analyse a fold cluster in more detail, the user can expand or contract the fold tree (click on a node, e.g., 21.1.1.) and invoke different graphical views of selected subsets that highlight conserved sequence features and structural elements (bottom).

Figure 3

Touring fold space. The dictionary is based on the quantification of structural similarities by all-on-all comparison of known structures. Using the pairwise similarities, each structure can be positioned in an abstract high-dimensional fold space. The overall distribution of domains into general architectural types is visualized using 2D projections of fold space (‘roadmaps’) generated by multivariate scaling methods ( 3 ). Within fold space, there are tight clusters of domains that have the same fold, i.e., similar overall arrangement of secondary structure elements. The structural relationships between instances (member domains) of a fold are visualized using dendrograms (explained in Fig. 2 ). The WWW interface allows the database of structural neighbours to be queried in a variety of ways with dynamic views generated on the fly. In this example, clicking in the lower right corner of the 2D map (top left) leads to a table view (middle) of folds occupying this region of fold space. Click on ‘details’ for a representative domain to identify structural neighbours that form bridges between the fold clusters and can be used for 3D superimposition. In this case, superimposition reveals a shared motif consisting of two crossed β-hairpins (upper right, the numbers above the ribbon diagrams refer to fold class). To analyse a fold cluster in more detail, the user can expand or contract the fold tree (click on a node, e.g., 21.1.1.) and invoke different graphical views of selected subsets that highlight conserved sequence features and structural elements (bottom).

For many types of analysis, it is useful to work within a discrete classification framework, although the data does not easily lend itself to disjoint clustering. To produce a discrete classification of domains, the all-on-all structure comparison is used to derive a fold tree (dendrogram) by a simple hierarchical clustering procedure using average linkage. Folds are then defined by cutting the fold tree at an empirically chosen cutoff such that most secondary structure elements are structurally equivalent between members of a cluster, i.e, they have the same fold. To ease navigation, subclusters that group together domains with similarities of architectural detail are obtained by cutting the tree at higher levels of structural similarity ( Fig. 2 ).

The distribution of representative structures in folds is highly uneven. The largest fold has >100 member domains, and the four dominant folds [αβ domains, immunoglobulin-like domains, (αβ) 8 barrels, helical bundles] comprise one quarter of the number of secondary structure elements in the representative set. For book-keeping purposes, we have chosen to index folds in order of decreasing population; these indices have no intrinsic meaning and may change as more structures are solved.

Uses of the Database

The web service provides graphical and tabular views of the data so that the user can take a tour of fold space while sitting and clicking ( Fig. 3 ). A tour of fold space can start from a region of fold space seen in 2D projection, from a structure selected automatically at random, from a node in the fold dendrograms, or from a string (text) search in structure or sequence databases ( 13–15 ). Hyperlinks connect structures to structural neighbours allowing ‘walking’ through neighborhoods of structural motifs.

Strong structural similarity despite low overall sequence similarity hints at a possible distant evolutionary relationship. The web server provides powerful tools for analysing superfamilies because the structural alignments are linked with protein families and functional annotation in sequence databases. Particularly informative (and rarely available) are the explicit multiple alignments of distantly related representatives with their sequence neighbours which often reveal a signature of invariantly conserved residues. Although such invariant residues may be widely dispersed along the 1D sequence, mapping these residues onto a structural template typically shows that they cluster together in 3D to form an active site ( 16 ). Such sets of residues are an excellent starting point for the crafting of far-reaching search profiles.

In the context of fold recognition, the structural classification thus leads to sequence models (profiles) that more accurately model the evolutionary variation within a superfamily, provides core templates with information about structurally conserved or variable parts, and reduces the size of the target structure database. See http://www2.embl-ebi.ac.uk/dali/testset for proposed test sets.

Distribution

The FSSP database and Dali Domain Dictionary are accessible at http://www.embl-ebi.ac.uk/dali and by anonymous ftp (file transfer protocol) from ftp.embl-ebi.ac.uk in the directory /pub/databases/fssp. The complete set of database files requires ∼140 Mb of disk storage. The web browser script is available for sites wishing to mirror the server [local installation of the HSSP ( 15 ) and PDB databases is also required].

No inclusion in other databases or database services, academic or other, without explicit permission of the authors. All rights reserved. Not to be used for classified research. Academic redistribution of single files or of the entire database is permitted, provided no changes are made in content or terms of use.

The Dali server ( 3 ) is the ‘BLAST server’ of protein 3D structures. Dali performs a database similarity search of a new structure solved by crystallography or NMR against the 3D co-ordinates of structures in the Protein Data Bank. Requests must contain at least the C α co-ordinates of the new structure and may be sent by e-mail to dali@embl-ebi.ac.uk or submitted interactively through http://www.embl-ebi.ac.uk/dali . Please report any problems to the authors by electronic mail.

References

1

,

J. Mol. Biol.

,

1977

, vol.

112

(pg.

535

-

542

)

2

,

Protein Sci.

,

1992

, vol.

1

(pg.

1691

-

1698

)

3

,

Science

,

1996

, vol.

273

(pg.

595

-

602

)

4

,

Curr. Opin. Struct. Biol.

,

1996

, vol.

6

(pg.

377

-

385

)

5

,

J. Mol. Biol.

,

1995

, vol.

247

(pg.

536

-

540

)

6

,

Structure

,

1997

, vol.

5

(pg.

1093

-

1108

)

7

,

Protein Engng

,

1995

, vol.

8

(pg.

513

-

525

)

8

,

Protein Sci.

,

1995

, vol.

4

(pg.

872

-

884

)

9

,

Fold Des.

,

1996

, vol.

1

(pg.

209

-

220

)

10

,

ISMB

,

1997

, vol.

5

(pg.

140

-

146

)

11

,

Methods Enzymol.

,

1996

, vol.

266

(pg.

114

-

128

)

12

,

Trends Biochem. Sci.

,

1997

, vol.

22

(pg.

116

-

117

)

13

,

Nucleic Acids Res.

,

1992

, vol.

20

(pg.

2013

-

2018

)

14

,

Proteins

,

1991

, vol.

9

(pg.

56

-

68

)

15

,

Nucleic Acids Res.

,

1997

, vol.

25

(pg.

226

-

230

)

[see also this issue, Nucleic Acids Res. (1998) 26 , 313–315]

16

,

Proteins

,

1997

, vol.

28

(pg.

72

-

82

)

© 1998 Oxford University Press

I agree to the terms and conditions. You must accept the terms and conditions.

Submit a comment

Name

Affiliations

Comment title

Comment

You have entered an invalid code

Thank you for submitting a comment on this article. Your comment will be reviewed and published at the journal's discretion. Please check for further notifications by email.

Citations

Views

Altmetric

Metrics

Total Views 1,329

980 Pageviews

349 PDF Downloads

Since 1/1/2017

Month: Total Views:
January 2017 3
March 2017 3
April 2017 2
May 2017 2
June 2017 4
July 2017 7
August 2017 4
September 2017 3
October 2017 2
November 2017 5
December 2017 8
January 2018 16
February 2018 10
March 2018 20
April 2018 18
May 2018 13
June 2018 8
July 2018 12
August 2018 16
September 2018 6
October 2018 23
November 2018 16
December 2018 19
January 2019 12
February 2019 9
March 2019 28
April 2019 25
May 2019 18
June 2019 14
July 2019 12
August 2019 24
September 2019 18
October 2019 14
November 2019 9
December 2019 11
January 2020 17
February 2020 9
March 2020 9
April 2020 20
May 2020 24
June 2020 14
July 2020 9
August 2020 7
September 2020 16
October 2020 24
November 2020 22
December 2020 14
January 2021 12
February 2021 8
March 2021 16
April 2021 9
May 2021 8
June 2021 17
July 2021 13
August 2021 21
September 2021 21
October 2021 13
November 2021 17
December 2021 9
January 2022 14
February 2022 7
March 2022 33
April 2022 21
May 2022 24
June 2022 8
July 2022 24
August 2022 23
September 2022 13
October 2022 13
November 2022 10
December 2022 12
January 2023 15
February 2023 14
March 2023 13
April 2023 12
May 2023 7
June 2023 14
July 2023 18
August 2023 15
September 2023 20
October 2023 14
November 2023 18
December 2023 35
January 2024 30
February 2024 19
March 2024 19
April 2024 17
May 2024 14
June 2024 14
July 2024 19
August 2024 15
September 2024 11
October 2024 11

Citations

579 Web of Science

×

Email alerts

Citing articles via

More from Oxford Academic