Non-redundant PDB data set for VAST (original) (raw)
All the chains available from PDB are compared with each other using the BLAST algorithm as implemented in the NCBI toolkit library. They are then clustered into groups of sequence-similar chains using the single-linkage clustering procedure. Chains within a sequence-similar group thus derived are automatically ranked according to the precision and completeness of their structural data. The following measures of the structural quality are used in this order of priority:
- Lower percentage of residues with unknown amino acid type,
- Lower percentage of residues with incomplete coordinate data,
- Lower percentage of residues whose coordinate data are missing,
- Lower percentage of residues with incomplete side-chain coordinate data,
- Higher resolution,
- Larger number of chains (subunits) contained in the PDB entry,
- Larger number of heterogens contained in the PDB entry,
- Larger number of different types of heterogens,
- Larger number of residues, and
- Alphanumerical order of their PDB codes. The top-ranked chain is generally chosen as the representative of the group. In some cases, however, a lower-ranked chain may be chosen by the authors manually. For example, if the top-ranked chain was a mutant protein and there was a native protein with reasonably comparable structural quality, then that lower-ranked native protein might replace the mutant. Representatives from all the groups together form a non-redundant set.
In comparing sequences, the database-size parameter of the BLAST algorithm is fixed at 500,000. This allows the use of the constant p-value cutoffs in clustering chains. In clustering chains, four different similarity cutoffs are used. They are: BLAST p-values of 10e-7, 10e-40, 10e-80, and 100% sequence identity. This results in a hierarchical clustering of PDB chains and four sets of representatives of different non-redundancy.
The non-redundant set does not include chains with less than 20 residues or chains whose coordinates are a theoretical model. A chain with more than 5% "UNKNOWN" residues is included in the clustering but will not be selected as a representative.
The non-redundant set is updated on a regular basis (about once a month), in synchronization with updates of MMDB and the VAST database of structure neighbors.