Touring protein fold space with Dali/FSSP (original) (raw)

Journal Article

European Molecular Biology Laboratory, European Bioinformatics Institute

Genome Campus, Cambridge CB10 1SD, UK

Search for other works by this author on:

European Molecular Biology Laboratory, European Bioinformatics Institute

Genome Campus, Cambridge CB10 1SD, UK

Search for other works by this author on:

Received:

10 October 1997

Accepted:

15 October 1997

Published:

01 January 1998

Navbar Search Filter Mobile Enter search term Search

Abstract

The FSSP database and its new supplement, the Dali Domain Dictionary, present a continuously updated classification of all known 3D protein structures. The classification is derived using an automatic structure alignment program (Dali) for the all-against-all comparison of structures in the Protein Data Bank. From the resulting enumeration of structural neighbours (which form a surprisingly continuous distribution in fold space) we derive a discrete fold classification in three steps: (i) sequence-related families are covered by a representative set of protein chains; (ii) protein chains are decomposed into structural domains based on the recurrence of structural motifs; (iii) folds are defined as tight clusters of domains in fold space. The fold classification, domain definitions and test sets for sequence-structure alignment (threading) are accessible on the web at www.embl-ebi.ac.uk/dali . The web interface provides a rich network of links between neighbours in fold space, between domains and proteins, and between structures and sequences leading, for example, to a database of explicit multiple alignments of protein families in the twilight zone of sequence similarity. The Dali/FSSP organization of protein structures provides a map of the currently known regions of the protein universe that is useful for the analysis of folding principles, for the evolutionary unification of protein families and for maximizing the information return from experimental structure determination.

Introduction

The number of three-dimensional protein structures in the Protein Data Bank (PDB; 1) has been doubling approximately every 18 months. This acceleration means that automatic methods are increasingly important for efforts to organize the data. The FSSP database ( 2 ), established in 1992, and its new supplement, the Dali Domain Dictionary, are produced using the Dali program for structural alignment ( 3 ) to automatically and continuously process the new structures released by the Protein Data Bank ( Fig. 1 ). The information derived as a result includes the description of protein domain architecture, the definition of structural neighbours around each known structure, the definition of structurally conserved cores and explicit multiple alignments of distantly related protein families; these are made available on the web.

Figure 1

Flowchart of the processing of protein structures in PDB. The high redundancy of biological databases presents a number of problems in practical use. To overcome these problems, it is useful and essential to derive representative subsets and/or classify the data. Our structural classification starts from extracting all structures (chains) from the PDB (left). Based on all-on-all structure comparison, we define a representative set of structures which is free of sequence redundancy (middle bottom). Each structure is decomposed into domains (upper right). Folds are defined by clustering domains based on structural similarities. As a result, all known protein structures can be completely described in terms of 526 fold types (bottom right; the numbers refer to April 1997). The arrows in the middle column put the fold classification in context with the world of sequence analysis via the HSSP database of structure-sequence alignments ( 15 ). About one quarter of all sequences in the SWISS-PROT database ( 13 ) are clearly homologous to proteins of known structure.

There are a number of other classification schemes for protein structures available on the web. Although they are based on the same data, the presentations differ in their basic philosophy regarding automation and organization ( 4–9 ). For example, MMDB from NCBI (US National Center for Biotechnology Information) provides a fish-eye view of structural neighbours around any PDB structure based on precalculated all-on-all structure comparisons using the VAST algorithm ( 4 ). Scop ( 5 ) and CATH ( 6 ) are strictly hierarchical classifications based on the abstractions of class (4–10 categories at the top of the hierarchy), architecture/topology or fold, and superfamily (519 in scop). Both classifications are curated by experts, with emphasis in scop on the definition of functionally related superfamilies and in CATH on the definition of architectural types. Dali/FSSP is a fully automatic classification based on the concept of neighbourhoods in fold space, of which it aims to provide useful views at both coarse-grained and fine-grained resolution. In the near neighbour range, the quantitative structural relationships between domains are described in terms of hierarchical clustering (dendrograms, similar to scop and CATH) and in terms of neighbour lists (similar to VAST). In recognition of the continuous rather than discrete distribution of domains in fold space, the global overview of structural relationships between domains is presented in terms of 2D ‘roadmaps’ of fold space. At all levels, representative sets are used for clarity, removing obvious redundancy of information. Many of the finer branches of the fold dendrograms correspond to evolutionarily related, functionally conserved superfamilies. We are currently developing tools for automatically annotating functional evidence of plausible evolutionary relationships ( 10 ).

Figure 2

Hierarchical clustering of folds. Hierarchical clustering yields a convenient view (dendrogram) of fold neighbours at different level of structural similarity (Z-scores). In this example, five domains (columns a–e) belong to the same fold class. Based on the topology of the dendrogram, domains d and e are siblings (same parent node), domains a and b are cousins (same grandparent node), domains c and d are second cousins (same greatgrandparent node), and so on. To ease navigation, the user is presented with a uniform summary for each node in the dendrogram. The idea is to choose a central member of the cluster as a representative (3D template) onto which structural or sequence variability can be mapped based on the multiple alignment of cluster members. For example, domain a represents the whole class {a,b,c,d,e}, and the link d→c means that domain c is used to represents the set {c,d,e}. The fold...domain levels are based on structure similarity. Sequence families around proteins of known structure (bottom row) are defined by sequence similarity ( 14 ). Exploiting links involving structure alignments leads to accurate multiple alignments of distantly related protein families. Currently, the naming of structural similarity levels is not a statement about evolutionary relationships. However, we regularly observe that remote relatives are more similar to each other than to other proteins in the database, so in favourable cases examination of the fold dendrogram can lead to biological discoveries. For example, {a,b} and {d,e} including their associated sequence families are likely candidates for unification into a functionally conserved superfamily.

The structural classification is explicitly linked ( 11 ) to sequence families with associated functional annotation, resulting in a rich network of biologically interesting relationships that can be browsed online. In particular, structure-based alignments increase our understanding of the more distant evolutionary relationships. For example, the discovery of remarkable structural similarity between histidine triad (HIT) proteins and galactose-1-phosphate uridylyltransferase (GalT) pointed to a conserved biochemical function in an emerging superfamily ( 12 ). The interconnection of structural classification with sequence families also opens the door to studies of structure-sequence-function relationships from a global perspective, for example: ‘which folds support function X?’, ‘which functions have evolved on the framework of fold Y?’, ‘do protein families in region Z of fold space diverge faster/more slowly than average?’.

Form and Content of the Database

The Protein Data Bank (PDB) is highly redundant in terms of sequence and structure similarities. Our aim is complete and economical description of structural data ( Fig. 1 ). The first reduction step is the generation of a sequence-unique set. No pair of proteins within this set is more than 25% identical in sequence and all removed structures are more than 25% identical with a representative. To avoid the removal of unique domains next to more common domains, the percentage used here is calculated as the number of residue identities in the structurally aligned region, divided by the average length of the two proteins (not by the length of the aligned region). The second step is to describe the structural neighbourhood around each sequence-unique representative chain, in the form of structural alignments. The FSSP database (DaliFSSP) has one entry per representative, reporting the structural alignments with the representative's sequence homologs (same family, membership detectable by sequence methods) and with other members of the representative set (related families, relationship difficult or impossible to detect by sequence methods). The Dali Domain Dictionary (DaliDD) is a new complement to the FSSP database that has the same format but one entry per structural domain. In other words, DaliFSSP is about proteins, or protein chains, while DaliDD is about structural domains.

Figure 3

Touring fold space. The dictionary is based on the quantification of structural similarities by all-on-all comparison of known structures. Using the pairwise similarities, each structure can be positioned in an abstract high-dimensional fold space. The overall distribution of domains into general architectural types is visualized using 2D projections of fold space (‘roadmaps’) generated by multivariate scaling methods ( 3 ). Within fold space, there are tight clusters of domains that have the same fold, i.e., similar overall arrangement of secondary structure elements. The structural relationships between instances (member domains) of a fold are visualized using dendrograms (explained in Fig. 2 ). The WWW interface allows the database of structural neighbours to be queried in a variety of ways with dynamic views generated on the fly. In this example, clicking in the lower right corner of the 2D map (top left) leads to a table view (middle) of folds occupying this region of fold space. Click on ‘details’ for a representative domain to identify structural neighbours that form bridges between the fold clusters and can be used for 3D superimposition. In this case, superimposition reveals a shared motif consisting of two crossed β-hairpins (upper right, the numbers above the ribbon diagrams refer to fold class). To analyse a fold cluster in more detail, the user can expand or contract the fold tree (click on a node, e.g., 21.1.1.) and invoke different graphical views of selected subsets that highlight conserved sequence features and structural elements (bottom).

For many types of analysis, it is useful to work within a discrete classification framework, although the data does not easily lend itself to disjoint clustering. To produce a discrete classification of domains, the all-on-all structure comparison is used to derive a fold tree (dendrogram) by a simple hierarchical clustering procedure using average linkage. Folds are then defined by cutting the fold tree at an empirically chosen cutoff such that most secondary structure elements are structurally equivalent between members of a cluster, i.e, they have the same fold. To ease navigation, subclusters that group together domains with similarities of architectural detail are obtained by cutting the tree at higher levels of structural similarity ( Fig. 2 ).

The distribution of representative structures in folds is highly uneven. The largest fold has >100 member domains, and the four dominant folds [αβ domains, immunoglobulin-like domains, (αβ) 8 barrels, helical bundles] comprise one quarter of the number of secondary structure elements in the representative set. For book-keeping purposes, we have chosen to index folds in order of decreasing population; these indices have no intrinsic meaning and may change as more structures are solved.

Uses of the Database

The web service provides graphical and tabular views of the data so that the user can take a tour of fold space while sitting and clicking ( Fig. 3 ). A tour of fold space can start from a region of fold space seen in 2D projection, from a structure selected automatically at random, from a node in the fold dendrograms, or from a string (text) search in structure or sequence databases ( 13–15 ). Hyperlinks connect structures to structural neighbours allowing ‘walking’ through neighborhoods of structural motifs.

Strong structural similarity despite low overall sequence similarity hints at a possible distant evolutionary relationship. The web server provides powerful tools for analysing superfamilies because the structural alignments are linked with protein families and functional annotation in sequence databases. Particularly informative (and rarely available) are the explicit multiple alignments of distantly related representatives with their sequence neighbours which often reveal a signature of invariantly conserved residues. Although such invariant residues may be widely dispersed along the 1D sequence, mapping these residues onto a structural template typically shows that they cluster together in 3D to form an active site ( 16 ). Such sets of residues are an excellent starting point for the crafting of far-reaching search profiles.

In the context of fold recognition, the structural classification thus leads to sequence models (profiles) that more accurately model the evolutionary variation within a superfamily, provides core templates with information about structurally conserved or variable parts, and reduces the size of the target structure database. See http://www2.embl-ebi.ac.uk/dali/testset for proposed test sets.

Distribution

The FSSP database and Dali Domain Dictionary are accessible at http://www.embl-ebi.ac.uk/dali and by anonymous ftp (file transfer protocol) from ftp.embl-ebi.ac.uk in the directory /pub/databases/fssp. The complete set of database files requires ∼140 Mb of disk storage. The web browser script is available for sites wishing to mirror the server [local installation of the HSSP ( 15 ) and PDB databases is also required].

No inclusion in other databases or database services, academic or other, without explicit permission of the authors. All rights reserved. Not to be used for classified research. Academic redistribution of single files or of the entire database is permitted, provided no changes are made in content or terms of use.

The Dali server ( 3 ) is the ‘BLAST server’ of protein 3D structures. Dali performs a database similarity search of a new structure solved by crystallography or NMR against the 3D co-ordinates of structures in the Protein Data Bank. Requests must contain at least the C α co-ordinates of the new structure and may be sent by e-mail to dali@embl-ebi.ac.uk or submitted interactively through http://www.embl-ebi.ac.uk/dali . Please report any problems to the authors by electronic mail.

References

J. Mol. Biol.

1977

, vol.

112

(pg.

535

542

)

Protein Sci.

1992

, vol.

(pg.

1691

1698

)

Science

1996

, vol.

273

(pg.

595

602

)

Curr. Opin. Struct. Biol.

1996

, vol.

(pg.

377

385

)

J. Mol. Biol.

1995

, vol.

247

(pg.

536

540

)

Structure

1997

, vol.

(pg.

1093

1108

)

Protein Engng

1995

, vol.

(pg.

513

525

)

Protein Sci.

1995

, vol.

(pg.

872

884

)

Fold Des.

1996

, vol.

(pg.

209

220

)

ISMB

1997

, vol.

(pg.

140

146

)

Methods Enzymol.

1996

, vol.

266

(pg.

114

128

)

Trends Biochem. Sci.

1997

, vol.

(pg.

116

117

)

Nucleic Acids Res.

1992

, vol.

(pg.

2013

2018

)

Proteins

1991

, vol.

(pg.

)

Nucleic Acids Res.

1997

, vol.

(pg.

226

230

)

[see also this issue, Nucleic Acids Res. (1998) 26 , 313–315]

Proteins

1997

, vol.

(pg.

)

I agree to the terms and conditions. You must accept the terms and conditions.

Submit a comment

Name

Affiliations

Comment title

Comment

You have entered an invalid code

Thank you for submitting a comment on this article. Your comment will be reviewed and published at the journal's discretion. Please check for further notifications by email.

Citations

Views

Altmetric

Metrics

Total Views 1,329

980 Pageviews

349 PDF Downloads

Since 1/1/2017

Month:	Total Views:
January 2017	3
March 2017	3
April 2017	2
May 2017	2
June 2017	4
July 2017	7
August 2017	4
September 2017	3
October 2017	2
November 2017	5
December 2017	8
January 2018	16
February 2018	10
March 2018	20
April 2018	18
May 2018	13
June 2018	8
July 2018	12
August 2018	16
September 2018	6
October 2018	23
November 2018	16
December 2018	19
January 2019	12
February 2019	9
March 2019	28
April 2019	25
May 2019	18
June 2019	14
July 2019	12
August 2019	24
September 2019	18
October 2019	14
November 2019	9
December 2019	11
January 2020	17
February 2020	9
March 2020	9
April 2020	20
May 2020	24
June 2020	14
July 2020	9
August 2020	7
September 2020	16
October 2020	24
November 2020	22
December 2020	14
January 2021	12
February 2021	8
March 2021	16
April 2021	9
May 2021	8
June 2021	17
July 2021	13
August 2021	21
September 2021	21
October 2021	13
November 2021	17
December 2021	9
January 2022	14
February 2022	7
March 2022	33
April 2022	21
May 2022	24
June 2022	8
July 2022	24
August 2022	23
September 2022	13
October 2022	13
November 2022	10
December 2022	12
January 2023	15
February 2023	14
March 2023	13
April 2023	12
May 2023	7
June 2023	14
July 2023	18
August 2023	15
September 2023	20
October 2023	14
November 2023	18
December 2023	35
January 2024	30
February 2024	19
March 2024	19
April 2024	17
May 2024	14
June 2024	14
July 2024	19
August 2024	15
September 2024	11
October 2024	11

Citations

579 Web of Science

Touring protein fold space with Dali/FSSP (original) (raw)

Abstract

Introduction

Form and Content of the Database

Uses of the Database

Distribution

References

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Cited

Touring protein fold space with Dali/FSSP (original) (raw)

Abstract

Introduction

Form and Content of the Database

Uses of the Database

Distribution

Related Services

References

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited