The FSSP Database: Fold Classification Based on Structure-Structure Alignment of Proteins (original) (raw)

Journal Article

,

European Molecular Biology Laboratory

,

D-69012 Heidelberg, Germany

Search for other works by this author on:

European Molecular Biology Laboratory

,

D-69012 Heidelberg, Germany

Search for other works by this author on:

Received:

02 October 1995

Accepted:

04 October 1995

Published:

01 January 1996

Cite

Liisa Holm, Chris Sander, The FSSP Database: Fold Classification Based on Structure-Structure Alignment of Proteins, Nucleic Acids Research, Volume 24, Issue 1, 1 January 1996, Pages 206–209, https://doi.org/10.1093/nar/24.1.206
Close

Navbar Search Filter Mobile Enter search term Search

Abstract

The FSSP database presents a continuously updated classification of 3-D protein folds based on an all-against-all comparison of structures currently in the Protein Data Bank (PDB) [Bernstein et al. (1977) J. Mol. Biol. , 112, 535–542]. The database currently contains an extended structural family for each of 600 representative protein chains which have <25% mutual sequence identity. The results of the exhaustive pairwise structure comparisons are reported in the form of a fold tree generated by hierachical clustering and as a series of structurally representative sets of folds at varying levels of uniqueness. For each query structure from the representative set, there is a database entry containing structure-structure alignments with its structural neighbours in the representative set and its sequence homologs in the PDB. All alignments are based purely on the 3-D co-ordinates of the proteins and are derived by an automatic structure comparison program (Dali). The FSSP database is accessible electronically on the World Wide Web and by anonymous ftp.

Introduction

Most newly determined protein sequences can be classified into families by sequence homology. However, protein families are known to retain the shape of the fold even when sequences have diverged below the limit of detection of significant similarities at the sequence level. These similarities can be detected by structural comparisons that merge protein families of known 3-D structure into structural classes, the members of which may or may not be evolutionarily related ( 1–4 ). The FSSP database contains a fold classification based on exhaustive structural alignments of known structures. The database provides a rich source of information for the study of both divergent and convergent aspects of the evolution of protein folds and defines useful test sets and a standard of truth for assessing the correctness of sequence-sequence or sequence-structure alignments.

The major new developments since last year ( 5 ) are continuous updates of the database and easy access to the data using browsers on the World Wide Web (WWW).

Form and Content of the Database

Fold classification

The basic structural entity used currently in the FSSP database are protein chains, which are identified by the Protein Data Bank (PDB) entry code plus chain identifier. All protein chains in the PDB entries that are >30 residues are listed alphabetically in PROTEIN INDEX which gives the pointer to the representative structure of the protein family and short summary information about the strength of similarity to the representative. The sequence-representative set is derived using algorithm #1 of ref. 6 so that all pairwise sequence identities within this set are <25%. For example, PROTEIN INDEX ( Fig. 1 ) tells you that the protease inhibitor domain of Alzheimer's amyloid beta-protein precursor is deposited in the PDB as entry 1AAP which has two chains, A and B. Both the A and B chain are 45% sequence identical to the representative structure of the family, which is bovine pancreatic trypsin inhibitor (PDB entry 9PTI). As expected from the high sequence identity, the folds of both of the 1AAP chains and that of 9PTI are as good as identical (1.0–1.1 Å root-mean-square deviation of CA positions).

Classifying proteins into sequence families yields a reduction from nearly 5000 protein chains in the PDB to ∼600 representatives. This set includes many pairs of remote homologs that have completely superimposable 3-D structures despite low sequence similarity and pairs with recurrent common folding motifs. The sequence-representative set is clustered further based on all-against-all structure comparison within the sequence-representative set.

FOLDTREE is a tree representation of the sequence-representative set produced by hierarchical clustering. The tree gives a simple overview of protein families, grouping together remote homologs and joining topologically similar but not necessarily evolutionarily related proteins in the lower branches. Cutting the tree at a level of Z = 2 (i.e. structural similarity scores two standard deviations above database average, taking domain size into acccount) yields 200 fold classes. For example, Figure 2 shows how the first C2 domain of synaptotagmin I (PDB entry 1RSY), which presented a new calcium-binding fold ( 7 ), is firmly anchored in a large structural class that contains beta-sandwich proteins with topological similarity to immunoglobulin-like domains and blue copper proteins.

An alternative way of defining clusters in protein fold space is used to derive the PDBfolds series of structurally representative sets using algorithm #2 of ref. 6 . The sets of representative folds contain a maximal number of protein folds where no pair is allowed to have a larger fraction of structurally equivalent residues than a given threshold percentage. This reduces the number of unique folds to consider for structural analysis, depending on the threshold chosen. For example, the common structural core covers >90% of the chain in all globin-globin pairs and >70% in any phycocyanin-globin pair. Accordingly, there is only one globin structure in the 90% list and only one representative for the phycocyanin-globin fold in the 70% list of PDBfolds.

Finding proteins in FSSP. All protein structures in the PDB are listed alphabetically in the PROTEIN INDEX table. The index can be used for searching by protein name or PDB code. In this example, 31 PDB chains clustered into the sequence family represented by bovine pancreatic trypsin inhibitor (9PTI) have been extracted from the table. These include multiple determinations of the same protein in different crystallographic conditions (chains with 100% sequence identity to the representative) and homologs from other species with sequence identity down to 33% relative to the representative. Notation: PDBid, PDB entry name, chain identifier appended; Repre, representative structure of the family; Rmsd, root-mean-square deviation of CA atoms in 3-D superimposition; Lali, number of structurally equivalent residues; Lseq, number of residues in PDBid; %ide, percentage of identical residues between PDBid and Repre in structural alignment; Compound, protein name echoed from the PDB entry.

Figure 1

Finding proteins in FSSP. All protein structures in the PDB are listed alphabetically in the PROTEIN INDEX table. The index can be used for searching by protein name or PDB code. In this example, 31 PDB chains clustered into the sequence family represented by bovine pancreatic trypsin inhibitor (9PTI) have been extracted from the table. These include multiple determinations of the same protein in different crystallographic conditions (chains with 100% sequence identity to the representative) and homologs from other species with sequence identity down to 33% relative to the representative. Notation: PDBid, PDB entry name, chain identifier appended; Repre, representative structure of the family; Rmsd, root-mean-square deviation of CA atoms in 3-D superimposition; Lali, number of structurally equivalent residues; Lseq, number of residues in PDBid; %ide, percentage of identical residues between PDBid and Repre in structural alignment; Compound, protein name echoed from the PDB entry.

Structural alignments

For each protein chain in the representative set, with PDB identifier Nxxx (like: 1PPT, 5PCY) and chain identifier Y (omitted if blank), there is an ASCII (text) file Nxxx.FSSP or NxxxY.FSSP which contains a few or tens of proteins structurally similar to the search structure, alongside the secondary structure and solvent accessibility extracted from the 3-D coordinates of the search structure ( 8 ). The structural neighbours that are reported include any sequence homologs to the query structure that have a structure in the PDB and all structurally similar chains from the representative set (Z ≥ 2). Details about the Dali method used to derive the database are given in refs 9 and 10.

An FSSP file is divided in five formatted blocks and a free text footer which explains the format. (i) The header block identifies the query structure, database and structural alignment method used and gives the number of structural neighbours. (ii) The summary block gives a one-line summary for each neighbour, including the statistical significance of the similarity (Z-score), positional root-mean-square deviation of superimposed CA coordinates, total number of equivalent residues and the percentage of sequence identity over structurally equivalent positions. (iii) The alignments block is a multiple structural alignment, printed vertically and showing the sequence and secondary structure of matched residues. (iv) The equivalences block is a machine readable listing that gives the residue numbers of the structurally equivalent segments. (v) The matrices block gives the rotation-translation matrices that, when applied to the 3-D coordinates in the respective PDB entries, yield the least-squares superimposition of the matched protein onto the query structure. See below for automatic parsing of FSSP entries.

Distribution

World Wide Web

The FSSP database is accessible over the WWW addressing URL http://www.embl-heidelberg.de/dali/fssp/ .

The most convenient starting point for a walk in fold space is via clicking the ‘alignment’ link in the FOLDTREE table. FSSP entries are parsed on the fly to display structural neighbours of individual proteins in the form of structure alignments laid out horizontally, multiple structure alignments (known structures) combined with multiple sequence alignments [sequences homologous to a known structure: HSSP database or superimposed coordinates [retrieved from PDB ( 12 )] for viewing with molecular graphics programs such as Rasmol ( 13 ). There are further hypertext links to functional annotations and literature references via SRS ( 14 ). For example, a study of the p21 ras family could start from the FOLDTREE table, which immediately shows transducin alpha, the ADP-ribosylation factor 1 and elongation factor G as the closest structural neighbours. From the structural alignment of these remote homologs one can identify the conserved sequence motifs GxxxxGKS and NKxD ( 15 ). These patterns are conserved in all members of the protein families as seen by extending the structure alignment with the results from a sequence database search ( 11 ). The number of sequence relatives displayed can be reduced from several hundred to a few tens using a cutoff of 50% identity between any pair that is displayed ( Fig. 3 ). Clicking on the sequence identifier (e.g. rashrat) pops up the Swissprot ( 16 ) annotation for this sequence.

 Overview of protein fold space. ( a ) Part of fold tree obtained by hierarchical clustering based on structural similarities between proteins in the representative set (<25% pairwise sequence identity). ( b ) The same part of the fold tree as it appears in the FOLDTREE table. A fold index is constructed by cutting an average linkage clustering tree at a similarity level of two standard deviations above expected (Z = 2), for example 31 in 31.2.5.3.1.1. for synaptotagmin. Subfamilies are defined and indexed according to cuts at similarity levels of Z = 3, 4, 5, 6 and 10, that is increasing levels of stringency. For example, the cut at Z = 4 (31.2.*) separates between blue copper proteins, hemocyanin, coagulation factor, cadherin, bacterial and eukaryotic immunoglobulin-like domains and superoxide dismutases. Indentation in the ‘PDB code’ column corresponds to the fold indices and means that a protein belongs to the same structural family/subfamily as the protein above. ( c ) Stereo view of superimposition between synaptotagmin I (PDB entry 1RSY, thick line) and a fibronectin type III domain (PDB entry 1FNA, thin line) reveals the common topological arrangement of strands in the beta sandwich (cf. ref. 7 ). Plotted with WhatIf ( 17 ).

Figure 2

Overview of protein fold space. ( a ) Part of fold tree obtained by hierarchical clustering based on structural similarities between proteins in the representative set (<25% pairwise sequence identity). ( b ) The same part of the fold tree as it appears in the FOLDTREE table. A fold index is constructed by cutting an average linkage clustering tree at a similarity level of two standard deviations above expected (Z = 2), for example 31 in 31.2.5.3.1.1. for synaptotagmin. Subfamilies are defined and indexed according to cuts at similarity levels of Z = 3, 4, 5, 6 and 10, that is increasing levels of stringency. For example, the cut at Z = 4 (31.2.*) separates between blue copper proteins, hemocyanin, coagulation factor, cadherin, bacterial and eukaryotic immunoglobulin-like domains and superoxide dismutases. Indentation in the ‘PDB code’ column corresponds to the fold indices and means that a protein belongs to the same structural family/subfamily as the protein above. ( c ) Stereo view of superimposition between synaptotagmin I (PDB entry 1RSY, thick line) and a fibronectin type III domain (PDB entry 1FNA, thin line) reveals the common topological arrangement of strands in the beta sandwich (cf. ref. 7 ). Plotted with WhatIf ( 17 ).

Anonymous ftp

The FSSP data sets can be obtained by anonymous ftp from ftp.embl-heidelberg.de in the directory: /pub/databases/protein_extras/fssp.

Conditions

Academic redistribution of single files or of the entire database is permitted. No inclusion in other databases or database services, academic or other, without explicit permission of the authors. All rights reserved. Not to be used for classified research. Users are asked to refer to ref. 9 and this paper in reporting results obtained using the database.

Size of the Current Release

The size of the FSSP database is tightly coupled to that of the PDB from which it is derived. The FSSP database is updated with each release of new structures by the PDB. The size of the sequence-representative set of chains was 600 in August 1995, an 80% increase from June 1994. The complete set of result files requires ∼60 Mb of disk storage.

Limitations

The current database contains at most one alignment per pair of full length proteins. The alignments are constrained to be sequential as this is biologically meaningful though not imposed by the Dali method. Different chains in one PDB entry are compared separately; chains with <30 residues or unknown sequence are excluded.

The structure comparison program Dali ( 9 ) defines the extent of the common structural core by maximizing the agreement of intra molecular CA-CA distances. The scoring function was deliberately designed to allow inter-domain conformational flexibility; hence, positional root mean square deviations for the corresponding rigid-body superimpositions are often higher than for comparison methods that put an absolute upper limit on inter molecular positional deviations. This, however, is only an apparent disadvantage.

 Combining multiple structure-structure alignments with multiple sequence-sequence alignments. A multiple sequence alignment of four protein families: p21 ras , transducin alpha, ADP-ribosylation factor 1 and elongation factor G. Only structurally equivalent blocks are shown; the middle part of the alignment has been omitted in order to highlight the conserved sequence signatures near the N- and C-termini. Structural alignment defines the register of each of the families (indicated in the FSSP column) relative to p21 ras . In addition to the guide structures, the alignment includes representative sequence homologs (Swissprot column; first sequence corresponds to the known structure) taken from the HSSP database of sequence-sequence alignments ( 11 ). The combined multiple alignment is filtered so that any sequence pair displayed has <50% sequence identity. For example, the original HSSP entry for 5p21 lists 189 sequences; here, only 29 representativeras sequences are shown. Notation: ∼, nonequivalent segments and trailing ends from structure alignment; blanks and dots, gaps and trailing ends from sequence alignment; lowercase, insertions in sequence alignment.

Figure 3

Combining multiple structure-structure alignments with multiple sequence-sequence alignments. A multiple sequence alignment of four protein families: p21 ras , transducin alpha, ADP-ribosylation factor 1 and elongation factor G. Only structurally equivalent blocks are shown; the middle part of the alignment has been omitted in order to highlight the conserved sequence signatures near the N- and C-termini. Structural alignment defines the register of each of the families (indicated in the FSSP column) relative to p21 ras . In addition to the guide structures, the alignment includes representative sequence homologs (Swissprot column; first sequence corresponds to the known structure) taken from the HSSP database of sequence-sequence alignments ( 11 ). The combined multiple alignment is filtered so that any sequence pair displayed has <50% sequence identity. For example, the original HSSP entry for 5p21 lists 189 sequences; here, only 29 representativeras sequences are shown. Notation: ∼, nonequivalent segments and trailing ends from structure alignment; blanks and dots, gaps and trailing ends from sequence alignment; lowercase, insertions in sequence alignment.

Requests for alignments of newly solved crystallographic or solution NMR structures (C α co-ordinates required) may be sent to the Dali e-mail server with Internet address: dali@embl-heidelbeig.de .

More information on the Dali server ( 10 ) is available on the WWW at: URL http://www.embl-heidelberg.de/dali/dali.html . Kindly report any problems to the authors by e-mail.

References

1

,

J. Mol. Biol

,

1994

, vol.

247

(pg.

536

-

540

)

2

,

Proc. Royal Soc. Lond

,

1990

, vol.

B241

(pg.

132

-

145

)

3

,

Protein Eng.

,

1993

, vol.

6

(pg.

485

-

500

)

4

,

Proteins

,

1994

, vol.

19

(pg.

165

-

173

)

5

,

Nucleic Acids Res

,

1994

, vol.

22

(pg.

3600

-

3609

)

6

,

Protein Sci.

,

1992

, vol.

1

(pg.

409

-

417

)

7

,

Cell

,

1995

, vol.

80

(pg.

929

-

935

)

8

,

Biopolymers

,

1983

, vol.

22

(pg.

2577

-

2637

)

9

,

J. Mol. Biol

,

1993

, vol.

233

(pg.

123

-

138

)

10

,

Trends Biol Sci

,

1995

, vol.

20

(pg.

478

-

480

)

11

,

Proteins

,

1991

, vol.

9

(pg.

56

-

68

)

12

,

J. Mol. Biol.

,

1977

, vol.

112

(pg.

535

-

542

)

13

,

Trends Biol Sci

,

1995

, vol.

20

(pg.

374

-

376

)

14

,

CABIOS

,

1993

, vol.

9

(pg.

49

-

57

)

15

,

Proc. Natl. Acad. Sci.

,

1991

, vol.

88

(pg.

5443

-

5447

)

16

,

Nucleic Acids Res.

,

1992

, vol.

20

(pg.

2013

-

2018

)

17

,

J. Mol. Graphics

,

1990

, vol.

8

(pg.

52

-

56

)

© 1996 Oxford University Press

I agree to the terms and conditions. You must accept the terms and conditions.

Submit a comment

Name

Affiliations

Comment title

Comment

You have entered an invalid code

Thank you for submitting a comment on this article. Your comment will be reviewed and published at the journal's discretion. Please check for further notifications by email.

Citations

Views

Altmetric

Metrics

Total Views 5,806

5,289 Pageviews

517 PDF Downloads

Since 1/1/2017

Month: Total Views:
January 2017 6
February 2017 24
March 2017 21
April 2017 10
May 2017 12
June 2017 6
July 2017 5
August 2017 5
September 2017 3
October 2017 10
November 2017 7
December 2017 52
January 2018 31
February 2018 45
March 2018 48
April 2018 59
May 2018 25
June 2018 39
July 2018 16
August 2018 23
September 2018 39
October 2018 47
November 2018 54
December 2018 65
January 2019 91
February 2019 60
March 2019 67
April 2019 120
May 2019 82
June 2019 61
July 2019 67
August 2019 48
September 2019 134
October 2019 75
November 2019 124
December 2019 43
January 2020 56
February 2020 80
March 2020 56
April 2020 54
May 2020 32
June 2020 39
July 2020 42
August 2020 56
September 2020 70
October 2020 52
November 2020 85
December 2020 100
January 2021 88
February 2021 68
March 2021 142
April 2021 106
May 2021 153
June 2021 59
July 2021 72
August 2021 81
September 2021 66
October 2021 63
November 2021 79
December 2021 102
January 2022 71
February 2022 73
March 2022 88
April 2022 123
May 2022 79
June 2022 47
July 2022 60
August 2022 84
September 2022 51
October 2022 55
November 2022 87
December 2022 106
January 2023 55
February 2023 70
March 2023 103
April 2023 57
May 2023 75
June 2023 44
July 2023 61
August 2023 58
September 2023 80
October 2023 124
November 2023 88
December 2023 119
January 2024 93
February 2024 81
March 2024 51
April 2024 57
May 2024 39
June 2024 17
July 2024 40
August 2024 26
September 2024 62
October 2024 57

Citations

201 Web of Science

×

Email alerts

Citing articles via

More from Oxford Academic