The FSSP Database: Fold Classification Based on Structure-Structure Alignment of Proteins (original) (raw)
Journal Article
,
European Molecular Biology Laboratory
,
D-69012 Heidelberg, Germany
Search for other works by this author on:
European Molecular Biology Laboratory
,
D-69012 Heidelberg, Germany
Search for other works by this author on:
Received:
02 October 1995
Accepted:
04 October 1995
Published:
01 January 1996
Cite
Liisa Holm, Chris Sander, The FSSP Database: Fold Classification Based on Structure-Structure Alignment of Proteins, Nucleic Acids Research, Volume 24, Issue 1, 1 January 1996, Pages 206–209, https://doi.org/10.1093/nar/24.1.206
Close
Navbar Search Filter Mobile Enter search term Search
Abstract
The FSSP database presents a continuously updated classification of 3-D protein folds based on an all-against-all comparison of structures currently in the Protein Data Bank (PDB) [Bernstein et al. (1977) J. Mol. Biol. , 112, 535–542]. The database currently contains an extended structural family for each of 600 representative protein chains which have <25% mutual sequence identity. The results of the exhaustive pairwise structure comparisons are reported in the form of a fold tree generated by hierachical clustering and as a series of structurally representative sets of folds at varying levels of uniqueness. For each query structure from the representative set, there is a database entry containing structure-structure alignments with its structural neighbours in the representative set and its sequence homologs in the PDB. All alignments are based purely on the 3-D co-ordinates of the proteins and are derived by an automatic structure comparison program (Dali). The FSSP database is accessible electronically on the World Wide Web and by anonymous ftp.
Introduction
Most newly determined protein sequences can be classified into families by sequence homology. However, protein families are known to retain the shape of the fold even when sequences have diverged below the limit of detection of significant similarities at the sequence level. These similarities can be detected by structural comparisons that merge protein families of known 3-D structure into structural classes, the members of which may or may not be evolutionarily related ( 1–4 ). The FSSP database contains a fold classification based on exhaustive structural alignments of known structures. The database provides a rich source of information for the study of both divergent and convergent aspects of the evolution of protein folds and defines useful test sets and a standard of truth for assessing the correctness of sequence-sequence or sequence-structure alignments.
The major new developments since last year ( 5 ) are continuous updates of the database and easy access to the data using browsers on the World Wide Web (WWW).
Form and Content of the Database
Fold classification
The basic structural entity used currently in the FSSP database are protein chains, which are identified by the Protein Data Bank (PDB) entry code plus chain identifier. All protein chains in the PDB entries that are >30 residues are listed alphabetically in PROTEIN INDEX which gives the pointer to the representative structure of the protein family and short summary information about the strength of similarity to the representative. The sequence-representative set is derived using algorithm #1 of ref. 6 so that all pairwise sequence identities within this set are <25%. For example, PROTEIN INDEX ( Fig. 1 ) tells you that the protease inhibitor domain of Alzheimer's amyloid beta-protein precursor is deposited in the PDB as entry 1AAP which has two chains, A and B. Both the A and B chain are 45% sequence identical to the representative structure of the family, which is bovine pancreatic trypsin inhibitor (PDB entry 9PTI). As expected from the high sequence identity, the folds of both of the 1AAP chains and that of 9PTI are as good as identical (1.0–1.1 Å root-mean-square deviation of CA positions).
Classifying proteins into sequence families yields a reduction from nearly 5000 protein chains in the PDB to ∼600 representatives. This set includes many pairs of remote homologs that have completely superimposable 3-D structures despite low sequence similarity and pairs with recurrent common folding motifs. The sequence-representative set is clustered further based on all-against-all structure comparison within the sequence-representative set.
FOLDTREE is a tree representation of the sequence-representative set produced by hierarchical clustering. The tree gives a simple overview of protein families, grouping together remote homologs and joining topologically similar but not necessarily evolutionarily related proteins in the lower branches. Cutting the tree at a level of Z = 2 (i.e. structural similarity scores two standard deviations above database average, taking domain size into acccount) yields 200 fold classes. For example, Figure 2 shows how the first C2 domain of synaptotagmin I (PDB entry 1RSY), which presented a new calcium-binding fold ( 7 ), is firmly anchored in a large structural class that contains beta-sandwich proteins with topological similarity to immunoglobulin-like domains and blue copper proteins.
An alternative way of defining clusters in protein fold space is used to derive the PDBfolds series of structurally representative sets using algorithm #2 of ref. 6 . The sets of representative folds contain a maximal number of protein folds where no pair is allowed to have a larger fraction of structurally equivalent residues than a given threshold percentage. This reduces the number of unique folds to consider for structural analysis, depending on the threshold chosen. For example, the common structural core covers >90% of the chain in all globin-globin pairs and >70% in any phycocyanin-globin pair. Accordingly, there is only one globin structure in the 90% list and only one representative for the phycocyanin-globin fold in the 70% list of PDBfolds.
Figure 1
Finding proteins in FSSP. All protein structures in the PDB are listed alphabetically in the PROTEIN INDEX table. The index can be used for searching by protein name or PDB code. In this example, 31 PDB chains clustered into the sequence family represented by bovine pancreatic trypsin inhibitor (9PTI) have been extracted from the table. These include multiple determinations of the same protein in different crystallographic conditions (chains with 100% sequence identity to the representative) and homologs from other species with sequence identity down to 33% relative to the representative. Notation: PDBid, PDB entry name, chain identifier appended; Repre, representative structure of the family; Rmsd, root-mean-square deviation of CA atoms in 3-D superimposition; Lali, number of structurally equivalent residues; Lseq, number of residues in PDBid; %ide, percentage of identical residues between PDBid and Repre in structural alignment; Compound, protein name echoed from the PDB entry.
Structural alignments
For each protein chain in the representative set, with PDB identifier Nxxx (like: 1PPT, 5PCY) and chain identifier Y (omitted if blank), there is an ASCII (text) file Nxxx.FSSP or NxxxY.FSSP which contains a few or tens of proteins structurally similar to the search structure, alongside the secondary structure and solvent accessibility extracted from the 3-D coordinates of the search structure ( 8 ). The structural neighbours that are reported include any sequence homologs to the query structure that have a structure in the PDB and all structurally similar chains from the representative set (Z ≥ 2). Details about the Dali method used to derive the database are given in refs 9 and 10.
An FSSP file is divided in five formatted blocks and a free text footer which explains the format. (i) The header block identifies the query structure, database and structural alignment method used and gives the number of structural neighbours. (ii) The summary block gives a one-line summary for each neighbour, including the statistical significance of the similarity (Z-score), positional root-mean-square deviation of superimposed CA coordinates, total number of equivalent residues and the percentage of sequence identity over structurally equivalent positions. (iii) The alignments block is a multiple structural alignment, printed vertically and showing the sequence and secondary structure of matched residues. (iv) The equivalences block is a machine readable listing that gives the residue numbers of the structurally equivalent segments. (v) The matrices block gives the rotation-translation matrices that, when applied to the 3-D coordinates in the respective PDB entries, yield the least-squares superimposition of the matched protein onto the query structure. See below for automatic parsing of FSSP entries.
Distribution
World Wide Web
The FSSP database is accessible over the WWW addressing URL http://www.embl-heidelberg.de/dali/fssp/ .
The most convenient starting point for a walk in fold space is via clicking the ‘alignment’ link in the FOLDTREE table. FSSP entries are parsed on the fly to display structural neighbours of individual proteins in the form of structure alignments laid out horizontally, multiple structure alignments (known structures) combined with multiple sequence alignments [sequences homologous to a known structure: HSSP database or superimposed coordinates [retrieved from PDB ( 12 )] for viewing with molecular graphics programs such as Rasmol ( 13 ). There are further hypertext links to functional annotations and literature references via SRS ( 14 ). For example, a study of the p21 ras family could start from the FOLDTREE table, which immediately shows transducin alpha, the ADP-ribosylation factor 1 and elongation factor G as the closest structural neighbours. From the structural alignment of these remote homologs one can identify the conserved sequence motifs GxxxxGKS and NKxD ( 15 ). These patterns are conserved in all members of the protein families as seen by extending the structure alignment with the results from a sequence database search ( 11 ). The number of sequence relatives displayed can be reduced from several hundred to a few tens using a cutoff of 50% identity between any pair that is displayed ( Fig. 3 ). Clicking on the sequence identifier (e.g. rashrat) pops up the Swissprot ( 16 ) annotation for this sequence.
Figure 2
Overview of protein fold space. ( a ) Part of fold tree obtained by hierarchical clustering based on structural similarities between proteins in the representative set (<25% pairwise sequence identity). ( b ) The same part of the fold tree as it appears in the FOLDTREE table. A fold index is constructed by cutting an average linkage clustering tree at a similarity level of two standard deviations above expected (Z = 2), for example 31 in 31.2.5.3.1.1. for synaptotagmin. Subfamilies are defined and indexed according to cuts at similarity levels of Z = 3, 4, 5, 6 and 10, that is increasing levels of stringency. For example, the cut at Z = 4 (31.2.*) separates between blue copper proteins, hemocyanin, coagulation factor, cadherin, bacterial and eukaryotic immunoglobulin-like domains and superoxide dismutases. Indentation in the ‘PDB code’ column corresponds to the fold indices and means that a protein belongs to the same structural family/subfamily as the protein above. ( c ) Stereo view of superimposition between synaptotagmin I (PDB entry 1RSY, thick line) and a fibronectin type III domain (PDB entry 1FNA, thin line) reveals the common topological arrangement of strands in the beta sandwich (cf. ref. 7 ). Plotted with WhatIf ( 17 ).
Anonymous ftp
The FSSP data sets can be obtained by anonymous ftp from ftp.embl-heidelberg.de in the directory: /pub/databases/protein_extras/fssp.
Conditions
Academic redistribution of single files or of the entire database is permitted. No inclusion in other databases or database services, academic or other, without explicit permission of the authors. All rights reserved. Not to be used for classified research. Users are asked to refer to ref. 9 and this paper in reporting results obtained using the database.
Size of the Current Release
The size of the FSSP database is tightly coupled to that of the PDB from which it is derived. The FSSP database is updated with each release of new structures by the PDB. The size of the sequence-representative set of chains was 600 in August 1995, an 80% increase from June 1994. The complete set of result files requires ∼60 Mb of disk storage.
Limitations
The current database contains at most one alignment per pair of full length proteins. The alignments are constrained to be sequential as this is biologically meaningful though not imposed by the Dali method. Different chains in one PDB entry are compared separately; chains with <30 residues or unknown sequence are excluded.
The structure comparison program Dali ( 9 ) defines the extent of the common structural core by maximizing the agreement of intra molecular CA-CA distances. The scoring function was deliberately designed to allow inter-domain conformational flexibility; hence, positional root mean square deviations for the corresponding rigid-body superimpositions are often higher than for comparison methods that put an absolute upper limit on inter molecular positional deviations. This, however, is only an apparent disadvantage.
Figure 3
Combining multiple structure-structure alignments with multiple sequence-sequence alignments. A multiple sequence alignment of four protein families: p21 ras , transducin alpha, ADP-ribosylation factor 1 and elongation factor G. Only structurally equivalent blocks are shown; the middle part of the alignment has been omitted in order to highlight the conserved sequence signatures near the N- and C-termini. Structural alignment defines the register of each of the families (indicated in the FSSP column) relative to p21 ras . In addition to the guide structures, the alignment includes representative sequence homologs (Swissprot column; first sequence corresponds to the known structure) taken from the HSSP database of sequence-sequence alignments ( 11 ). The combined multiple alignment is filtered so that any sequence pair displayed has <50% sequence identity. For example, the original HSSP entry for 5p21 lists 189 sequences; here, only 29 representativeras sequences are shown. Notation: ∼, nonequivalent segments and trailing ends from structure alignment; blanks and dots, gaps and trailing ends from sequence alignment; lowercase, insertions in sequence alignment.
Related Service
Requests for alignments of newly solved crystallographic or solution NMR structures (C α co-ordinates required) may be sent to the Dali e-mail server with Internet address: dali@embl-heidelbeig.de .
More information on the Dali server ( 10 ) is available on the WWW at: URL http://www.embl-heidelberg.de/dali/dali.html . Kindly report any problems to the authors by e-mail.
References
1
,
J. Mol. Biol
,
1994
, vol.
247
(pg.
536
-
540
)
2
,
Proc. Royal Soc. Lond
,
1990
, vol.
B241
(pg.
132
-
145
)
3
,
Protein Eng.
,
1993
, vol.
6
(pg.
485
-
500
)
4
,
Proteins
,
1994
, vol.
19
(pg.
165
-
173
)
5
,
Nucleic Acids Res
,
1994
, vol.
22
(pg.
3600
-
3609
)
6
,
Protein Sci.
,
1992
, vol.
1
(pg.
409
-
417
)
7
,
Cell
,
1995
, vol.
80
(pg.
929
-
935
)
8
,
Biopolymers
,
1983
, vol.
22
(pg.
2577
-
2637
)
9
,
J. Mol. Biol
,
1993
, vol.
233
(pg.
123
-
138
)
10
,
Trends Biol Sci
,
1995
, vol.
20
(pg.
478
-
480
)
11
,
Proteins
,
1991
, vol.
9
(pg.
56
-
68
)
12
,
J. Mol. Biol.
,
1977
, vol.
112
(pg.
535
-
542
)
13
,
Trends Biol Sci
,
1995
, vol.
20
(pg.
374
-
376
)
14
,
CABIOS
,
1993
, vol.
9
(pg.
49
-
57
)
15
,
Proc. Natl. Acad. Sci.
,
1991
, vol.
88
(pg.
5443
-
5447
)
16
,
Nucleic Acids Res.
,
1992
, vol.
20
(pg.
2013
-
2018
)
17
,
J. Mol. Graphics
,
1990
, vol.
8
(pg.
52
-
56
)
© 1996 Oxford University Press
I agree to the terms and conditions. You must accept the terms and conditions.
Submit a comment
Name
Affiliations
Comment title
Comment
You have entered an invalid code
Thank you for submitting a comment on this article. Your comment will be reviewed and published at the journal's discretion. Please check for further notifications by email.
Citations
Views
Altmetric
Metrics
Total Views 5,806
5,289 Pageviews
517 PDF Downloads
Since 1/1/2017
Month: | Total Views: |
---|---|
January 2017 | 6 |
February 2017 | 24 |
March 2017 | 21 |
April 2017 | 10 |
May 2017 | 12 |
June 2017 | 6 |
July 2017 | 5 |
August 2017 | 5 |
September 2017 | 3 |
October 2017 | 10 |
November 2017 | 7 |
December 2017 | 52 |
January 2018 | 31 |
February 2018 | 45 |
March 2018 | 48 |
April 2018 | 59 |
May 2018 | 25 |
June 2018 | 39 |
July 2018 | 16 |
August 2018 | 23 |
September 2018 | 39 |
October 2018 | 47 |
November 2018 | 54 |
December 2018 | 65 |
January 2019 | 91 |
February 2019 | 60 |
March 2019 | 67 |
April 2019 | 120 |
May 2019 | 82 |
June 2019 | 61 |
July 2019 | 67 |
August 2019 | 48 |
September 2019 | 134 |
October 2019 | 75 |
November 2019 | 124 |
December 2019 | 43 |
January 2020 | 56 |
February 2020 | 80 |
March 2020 | 56 |
April 2020 | 54 |
May 2020 | 32 |
June 2020 | 39 |
July 2020 | 42 |
August 2020 | 56 |
September 2020 | 70 |
October 2020 | 52 |
November 2020 | 85 |
December 2020 | 100 |
January 2021 | 88 |
February 2021 | 68 |
March 2021 | 142 |
April 2021 | 106 |
May 2021 | 153 |
June 2021 | 59 |
July 2021 | 72 |
August 2021 | 81 |
September 2021 | 66 |
October 2021 | 63 |
November 2021 | 79 |
December 2021 | 102 |
January 2022 | 71 |
February 2022 | 73 |
March 2022 | 88 |
April 2022 | 123 |
May 2022 | 79 |
June 2022 | 47 |
July 2022 | 60 |
August 2022 | 84 |
September 2022 | 51 |
October 2022 | 55 |
November 2022 | 87 |
December 2022 | 106 |
January 2023 | 55 |
February 2023 | 70 |
March 2023 | 103 |
April 2023 | 57 |
May 2023 | 75 |
June 2023 | 44 |
July 2023 | 61 |
August 2023 | 58 |
September 2023 | 80 |
October 2023 | 124 |
November 2023 | 88 |
December 2023 | 119 |
January 2024 | 93 |
February 2024 | 81 |
March 2024 | 51 |
April 2024 | 57 |
May 2024 | 39 |
June 2024 | 17 |
July 2024 | 40 |
August 2024 | 26 |
September 2024 | 62 |
October 2024 | 57 |
Citations
201 Web of Science
×
Email alerts
Citing articles via
More from Oxford Academic