The FSSP Database: Fold Classification Based on Structure-Structure Alignment of Proteins (original) (raw)

Journal Article

European Molecular Biology Laboratory

D-69012 Heidelberg, Germany

Search for other works by this author on:

European Molecular Biology Laboratory

D-69012 Heidelberg, Germany

Search for other works by this author on:

Received:

02 October 1995

Accepted:

04 October 1995

Published:

01 January 1996

Cite

Liisa Holm, Chris Sander, The FSSP Database: Fold Classification Based on Structure-Structure Alignment of Proteins, Nucleic Acids Research, Volume 24, Issue 1, 1 January 1996, Pages 206–209, https://doi.org/10.1093/nar/24.1.206
Close

Navbar Search Filter Mobile Enter search term Search

Abstract

The FSSP database presents a continuously updated classification of 3-D protein folds based on an all-against-all comparison of structures currently in the Protein Data Bank (PDB) [Bernstein et al. (1977) J. Mol. Biol. , 112, 535–542]. The database currently contains an extended structural family for each of 600 representative protein chains which have <25% mutual sequence identity. The results of the exhaustive pairwise structure comparisons are reported in the form of a fold tree generated by hierachical clustering and as a series of structurally representative sets of folds at varying levels of uniqueness. For each query structure from the representative set, there is a database entry containing structure-structure alignments with its structural neighbours in the representative set and its sequence homologs in the PDB. All alignments are based purely on the 3-D co-ordinates of the proteins and are derived by an automatic structure comparison program (Dali). The FSSP database is accessible electronically on the World Wide Web and by anonymous ftp.

Introduction

Most newly determined protein sequences can be classified into families by sequence homology. However, protein families are known to retain the shape of the fold even when sequences have diverged below the limit of detection of significant similarities at the sequence level. These similarities can be detected by structural comparisons that merge protein families of known 3-D structure into structural classes, the members of which may or may not be evolutionarily related ( 1–4 ). The FSSP database contains a fold classification based on exhaustive structural alignments of known structures. The database provides a rich source of information for the study of both divergent and convergent aspects of the evolution of protein folds and defines useful test sets and a standard of truth for assessing the correctness of sequence-sequence or sequence-structure alignments.

The major new developments since last year ( 5 ) are continuous updates of the database and easy access to the data using browsers on the World Wide Web (WWW).

Form and Content of the Database

Fold classification

The basic structural entity used currently in the FSSP database are protein chains, which are identified by the Protein Data Bank (PDB) entry code plus chain identifier. All protein chains in the PDB entries that are >30 residues are listed alphabetically in PROTEIN INDEX which gives the pointer to the representative structure of the protein family and short summary information about the strength of similarity to the representative. The sequence-representative set is derived using algorithm #1 of ref. 6 so that all pairwise sequence identities within this set are <25%. For example, PROTEIN INDEX ( Fig. 1 ) tells you that the protease inhibitor domain of Alzheimer's amyloid beta-protein precursor is deposited in the PDB as entry 1AAP which has two chains, A and B. Both the A and B chain are 45% sequence identical to the representative structure of the family, which is bovine pancreatic trypsin inhibitor (PDB entry 9PTI). As expected from the high sequence identity, the folds of both of the 1AAP chains and that of 9PTI are as good as identical (1.0–1.1 Å root-mean-square deviation of CA positions).

Classifying proteins into sequence families yields a reduction from nearly 5000 protein chains in the PDB to ∼600 representatives. This set includes many pairs of remote homologs that have completely superimposable 3-D structures despite low sequence similarity and pairs with recurrent common folding motifs. The sequence-representative set is clustered further based on all-against-all structure comparison within the sequence-representative set.

FOLDTREE is a tree representation of the sequence-representative set produced by hierarchical clustering. The tree gives a simple overview of protein families, grouping together remote homologs and joining topologically similar but not necessarily evolutionarily related proteins in the lower branches. Cutting the tree at a level of Z = 2 (i.e. structural similarity scores two standard deviations above database average, taking domain size into acccount) yields 200 fold classes. For example, Figure 2 shows how the first C2 domain of synaptotagmin I (PDB entry 1RSY), which presented a new calcium-binding fold ( 7 ), is firmly anchored in a large structural class that contains beta-sandwich proteins with topological similarity to immunoglobulin-like domains and blue copper proteins.

An alternative way of defining clusters in protein fold space is used to derive the PDBfolds series of structurally representative sets using algorithm #2 of ref. 6 . The sets of representative folds contain a maximal number of protein folds where no pair is allowed to have a larger fraction of structurally equivalent residues than a given threshold percentage. This reduces the number of unique folds to consider for structural analysis, depending on the threshold chosen. For example, the common structural core covers >90% of the chain in all globin-globin pairs and >70% in any phycocyanin-globin pair. Accordingly, there is only one globin structure in the 90% list and only one representative for the phycocyanin-globin fold in the 70% list of PDBfolds.

Figure 1

Finding proteins in FSSP. All protein structures in the PDB are listed alphabetically in the PROTEIN INDEX table. The index can be used for searching by protein name or PDB code. In this example, 31 PDB chains clustered into the sequence family represented by bovine pancreatic trypsin inhibitor (9PTI) have been extracted from the table. These include multiple determinations of the same protein in different crystallographic conditions (chains with 100% sequence identity to the representative) and homologs from other species with sequence identity down to 33% relative to the representative. Notation: PDBid, PDB entry name, chain identifier appended; Repre, representative structure of the family; Rmsd, root-mean-square deviation of CA atoms in 3-D superimposition; Lali, number of structurally equivalent residues; Lseq, number of residues in PDBid; %ide, percentage of identical residues between PDBid and Repre in structural alignment; Compound, protein name echoed from the PDB entry.

Structural alignments

For each protein chain in the representative set, with PDB identifier Nxxx (like: 1PPT, 5PCY) and chain identifier Y (omitted if blank), there is an ASCII (text) file Nxxx.FSSP or NxxxY.FSSP which contains a few or tens of proteins structurally similar to the search structure, alongside the secondary structure and solvent accessibility extracted from the 3-D coordinates of the search structure ( 8 ). The structural neighbours that are reported include any sequence homologs to the query structure that have a structure in the PDB and all structurally similar chains from the representative set (Z ≥ 2). Details about the Dali method used to derive the database are given in refs 9 and 10.

An FSSP file is divided in five formatted blocks and a free text footer which explains the format. (i) The header block identifies the query structure, database and structural alignment method used and gives the number of structural neighbours. (ii) The summary block gives a one-line summary for each neighbour, including the statistical significance of the similarity (Z-score), positional root-mean-square deviation of superimposed CA coordinates, total number of equivalent residues and the percentage of sequence identity over structurally equivalent positions. (iii) The alignments block is a multiple structural alignment, printed vertically and showing the sequence and secondary structure of matched residues. (iv) The equivalences block is a machine readable listing that gives the residue numbers of the structurally equivalent segments. (v) The matrices block gives the rotation-translation matrices that, when applied to the 3-D coordinates in the respective PDB entries, yield the least-squares superimposition of the matched protein onto the query structure. See below for automatic parsing of FSSP entries.

Distribution

World Wide Web

The FSSP database is accessible over the WWW addressing URL http://www.embl-heidelberg.de/dali/fssp/ .

The most convenient starting point for a walk in fold space is via clicking the ‘alignment’ link in the FOLDTREE table. FSSP entries are parsed on the fly to display structural neighbours of individual proteins in the form of structure alignments laid out horizontally, multiple structure alignments (known structures) combined with multiple sequence alignments [sequences homologous to a known structure: HSSP database or superimposed coordinates [retrieved from PDB ( 12 )] for viewing with molecular graphics programs such as Rasmol ( 13 ). There are further hypertext links to functional annotations and literature references via SRS ( 14 ). For example, a study of the p21 ras family could start from the FOLDTREE table, which immediately shows transducin alpha, the ADP-ribosylation factor 1 and elongation factor G as the closest structural neighbours. From the structural alignment of these remote homologs one can identify the conserved sequence motifs GxxxxGKS and NKxD ( 15 ). These patterns are conserved in all members of the protein families as seen by extending the structure alignment with the results from a sequence database search ( 11 ). The number of sequence relatives displayed can be reduced from several hundred to a few tens using a cutoff of 50% identity between any pair that is displayed ( Fig. 3 ). Clicking on the sequence identifier (e.g. rashrat) pops up the Swissprot ( 16 ) annotation for this sequence.

Figure 2

Overview of protein fold space. ( a ) Part of fold tree obtained by hierarchical clustering based on structural similarities between proteins in the representative set (<25% pairwise sequence identity). ( b ) The same part of the fold tree as it appears in the FOLDTREE table. A fold index is constructed by cutting an average linkage clustering tree at a similarity level of two standard deviations above expected (Z = 2), for example 31 in 31.2.5.3.1.1. for synaptotagmin. Subfamilies are defined and indexed according to cuts at similarity levels of Z = 3, 4, 5, 6 and 10, that is increasing levels of stringency. For example, the cut at Z = 4 (31.2.*) separates between blue copper proteins, hemocyanin, coagulation factor, cadherin, bacterial and eukaryotic immunoglobulin-like domains and superoxide dismutases. Indentation in the ‘PDB code’ column corresponds to the fold indices and means that a protein belongs to the same structural family/subfamily as the protein above. ( c ) Stereo view of superimposition between synaptotagmin I (PDB entry 1RSY, thick line) and a fibronectin type III domain (PDB entry 1FNA, thin line) reveals the common topological arrangement of strands in the beta sandwich (cf. ref. 7 ). Plotted with WhatIf ( 17 ).

Anonymous ftp

The FSSP data sets can be obtained by anonymous ftp from ftp.embl-heidelberg.de in the directory: /pub/databases/protein_extras/fssp.

Conditions

Academic redistribution of single files or of the entire database is permitted. No inclusion in other databases or database services, academic or other, without explicit permission of the authors. All rights reserved. Not to be used for classified research. Users are asked to refer to ref. 9 and this paper in reporting results obtained using the database.

Size of the Current Release

The size of the FSSP database is tightly coupled to that of the PDB from which it is derived. The FSSP database is updated with each release of new structures by the PDB. The size of the sequence-representative set of chains was 600 in August 1995, an 80% increase from June 1994. The complete set of result files requires ∼60 Mb of disk storage.

Limitations

The current database contains at most one alignment per pair of full length proteins. The alignments are constrained to be sequential as this is biologically meaningful though not imposed by the Dali method. Different chains in one PDB entry are compared separately; chains with <30 residues or unknown sequence are excluded.

The structure comparison program Dali ( 9 ) defines the extent of the common structural core by maximizing the agreement of intra molecular CA-CA distances. The scoring function was deliberately designed to allow inter-domain conformational flexibility; hence, positional root mean square deviations for the corresponding rigid-body superimpositions are often higher than for comparison methods that put an absolute upper limit on inter molecular positional deviations. This, however, is only an apparent disadvantage.

Figure 3

Combining multiple structure-structure alignments with multiple sequence-sequence alignments. A multiple sequence alignment of four protein families: p21 ras , transducin alpha, ADP-ribosylation factor 1 and elongation factor G. Only structurally equivalent blocks are shown; the middle part of the alignment has been omitted in order to highlight the conserved sequence signatures near the N- and C-termini. Structural alignment defines the register of each of the families (indicated in the FSSP column) relative to p21 ras . In addition to the guide structures, the alignment includes representative sequence homologs (Swissprot column; first sequence corresponds to the known structure) taken from the HSSP database of sequence-sequence alignments ( 11 ). The combined multiple alignment is filtered so that any sequence pair displayed has <50% sequence identity. For example, the original HSSP entry for 5p21 lists 189 sequences; here, only 29 representativeras sequences are shown. Notation: ∼, nonequivalent segments and trailing ends from structure alignment; blanks and dots, gaps and trailing ends from sequence alignment; lowercase, insertions in sequence alignment.

Requests for alignments of newly solved crystallographic or solution NMR structures (C α co-ordinates required) may be sent to the Dali e-mail server with Internet address: dali@embl-heidelbeig.de .

More information on the Dali server ( 10 ) is available on the WWW at: URL http://www.embl-heidelberg.de/dali/dali.html . Kindly report any problems to the authors by e-mail.

References

J. Mol. Biol

1994

, vol.

247

(pg.

536

540

)

Proc. Royal Soc. Lond

1990

, vol.

B241

(pg.

132

145

)

Protein Eng.

1993

, vol.

(pg.

485

500

)

Proteins

1994

, vol.

(pg.

165

173

)

Nucleic Acids Res

1994

, vol.

(pg.

3600

3609

)

Protein Sci.

1992

, vol.

(pg.

409

417

)

Cell

1995

, vol.

(pg.

929

935

)

Biopolymers

1983

, vol.

(pg.

2577

2637

)

J. Mol. Biol

1993

, vol.

233

(pg.

123

138

)

Trends Biol Sci

1995

, vol.

(pg.

478

480

)

Proteins

1991

, vol.

(pg.

)

J. Mol. Biol.

1977

, vol.

112

(pg.

535

542

)

Trends Biol Sci

1995

, vol.

(pg.

374

376

)

CABIOS

1993

, vol.

(pg.

)

Proc. Natl. Acad. Sci.

1991

, vol.

(pg.

5443

5447

)

Nucleic Acids Res.

1992

, vol.

(pg.

2013

2018

)

J. Mol. Graphics

1990

, vol.

(pg.

)

I agree to the terms and conditions. You must accept the terms and conditions.

Submit a comment

Name

Affiliations

Comment title

Comment

You have entered an invalid code

Thank you for submitting a comment on this article. Your comment will be reviewed and published at the journal's discretion. Please check for further notifications by email.

Citations

Views

Altmetric

Metrics

Total Views 5,806

5,289 Pageviews

517 PDF Downloads

Since 1/1/2017

Month:	Total Views:
January 2017	6
February 2017	24
March 2017	21
April 2017	10
May 2017	12
June 2017	6
July 2017	5
August 2017	5
September 2017	3
October 2017	10
November 2017	7
December 2017	52
January 2018	31
February 2018	45
March 2018	48
April 2018	59
May 2018	25
June 2018	39
July 2018	16
August 2018	23
September 2018	39
October 2018	47
November 2018	54
December 2018	65
January 2019	91
February 2019	60
March 2019	67
April 2019	120
May 2019	82
June 2019	61
July 2019	67
August 2019	48
September 2019	134
October 2019	75
November 2019	124
December 2019	43
January 2020	56
February 2020	80
March 2020	56
April 2020	54
May 2020	32
June 2020	39
July 2020	42
August 2020	56
September 2020	70
October 2020	52
November 2020	85
December 2020	100
January 2021	88
February 2021	68
March 2021	142
April 2021	106
May 2021	153
June 2021	59
July 2021	72
August 2021	81
September 2021	66
October 2021	63
November 2021	79
December 2021	102
January 2022	71
February 2022	73
March 2022	88
April 2022	123
May 2022	79
June 2022	47
July 2022	60
August 2022	84
September 2022	51
October 2022	55
November 2022	87
December 2022	106
January 2023	55
February 2023	70
March 2023	103
April 2023	57
May 2023	75
June 2023	44
July 2023	61
August 2023	58
September 2023	80
October 2023	124
November 2023	88
December 2023	119
January 2024	93
February 2024	81
March 2024	51
April 2024	57
May 2024	39
June 2024	17
July 2024	40
August 2024	26
September 2024	62
October 2024	57

Citations

201 Web of Science

The FSSP Database: Fold Classification Based on Structure-Structure Alignment of Proteins (original) (raw)

Cite

Abstract

Introduction