Assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships - PubMed (original) (raw)
Assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships
S E Brenner et al. Proc Natl Acad Sci U S A. 1998.
Abstract
Pairwise sequence comparison methods have been assessed using proteins whose relationships are known reliably from their structures and functions, as described in the SCOP database [Murzin, A. G., Brenner, S. E., Hubbard, T. & Chothia C. (1995) J. Mol. Biol. 247, 536-540]. The evaluation tested the programs BLAST [Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. (1990). J. Mol. Biol. 215, 403-410], WU-BLAST2 [Altschul, S. F. & Gish, W. (1996) Methods Enzymol. 266, 460-480], FASTA [Pearson, W. R. & Lipman, D. J. (1988) Proc. Natl. Acad. Sci. USA 85, 2444-2448], and SSEARCH [Smith, T. F. & Waterman, M. S. (1981) J. Mol. Biol. 147, 195-197] and their scoring schemes. The error rate of all algorithms is greatly reduced by using statistical scores to evaluate matches rather than percentage identity or raw scores. The E-value statistical scores of SSEARCH and FASTA are reliable: the number of false positives found in our tests agrees well with the scores reported. However, the P-values reported by BLAST and WU-BLAST2 exaggerate significance by orders of magnitude. SSEARCH, FASTA ktup = 1, and WU-BLAST2 perform best, and they are capable of detecting almost all relationships between proteins whose sequence identities are >30%. For more distantly related proteins, they do much less well; only one-half of the relationships between proteins with 20-30% identity are found. Because many homologs have low sequence similarity, most distant relationships cannot be detected by any pairwise comparison method; however, those which are identified may be used with confidence.
Figures
Figure 1
Coverage vs. error plots of different scoring schemes for
ssearch
Smith–Waterman. (A) Analysis of
pdb
40
d-b
database. (B) Analysis of
pdb
90
d-b
database. All of the proteins in the database were compared with each other using the
ssearch
program. The results of this single set of comparisons were considered using five different scoring schemes and assessed. The graphs show the coverage and errors per query (EPQ) for statistical scores, raw scores, and three measures using percentage identity. In the coverage vs. error plot, the x axis indicates the fraction of all homologs in the database (known from structure) which have been detected. Precisely, it is the number of detected pairs of proteins with the same fold divided by the total number of pairs from a common superfamily.
pdb
40
d-b
contains a total of 9,044 homologs, so a score of 10% indicates identification of 904 relationships. The y axis reports the number of EPQ. Because there are 1,323 queries made in the
pdb
40
d-b
all-vs.-all comparison, 13 errors corresponds to 0.01, or 1% EPQ. The y axis is presented on a log scale to show results over the widely varying degrees of accuracy which may be desired. The scores that correspond to the levels of EPQ and coverage are shown in Fig. 4 and Table 1. The graph demonstrates the trade-off between sensitivity and selectivity. As more homologs are found (moving to the right), more errors are made (moving up). The ideal method would be in the lower right corner of the graph, which corresponds to identifying many evolutionary relationships without selecting unrelated proteins. Three measures of percentage identity are plotted. Percentage identity within alignment is the degree of identity within the aligned region of the proteins, without consideration of the alignment length. Percentage identity within both is the number of identical residues in the aligned region as a percentage of the average length of the query and target proteins. The
hssp
equation (17) is H = 290.15_l_−0.562 where l is length for 10 < _l_ < 80; H > 100 for l < 10; H = 24.7 for _l_ > 80. The percentage identity
hssp
-adjusted score is the percent identity within the alignment minus H. Smith–Waterman raw scores and E-values were taken directly from the sequence comparison program.
Figure 2
Unrelated proteins with high percentage identity. Hemoglobin β-chain (
pdb
code 1hds chain b, ref. , Left) and cellulase E2 (
pdb
code 1tml, ref. , Right) have 39% identity over 64 residues, a level which is often believed to be indicative of homology. Despite this high degree of identity, their structures strongly suggest that these proteins are not related. Appropriately, neither the raw alignment score of 85 nor the E-value of 1.3 is significant. Proteins rendered by
rasmol
(40).
Figure 3
Length and percentage identity of alignments of unrelated proteins in
pdb
90
d-b
: Each pair of nonhomologous proteins found with
ssearch
is plotted as a point whose position indicates the length and the percentage identity within the alignment. Because alignment length and percentage identity are quantized, many pairs of proteins may have exactly the same alignment length and percentage identity. The line shows the
hssp
threshold (though it is intended to be applied with a different matrix and parameters).
Figure 4
Reliability of statistical scores in
pdb
90
d-b
: Each line shows the relationship between reported statistical score and actual error rate for a different program. E-values are reported for
ssearch
and
fasta
, whereas P-values are shown for
blast
and
wu-blast
2. If the scoring were perfect, then the number of errors per query and the E-values would be the same, as indicated by the upper bold line. (P-values should be the same as EPQ for small numbers, and diverges at higher values, as indicated by the lower bold line.) E-values from
ssearch
and
fasta
are shown to have good agreement with EPQ but underestimate the significance slightly.
blast
and
wu-blast
2 are overconfident, with the degree of exaggeration dependent upon the score. The results for
pdb
40
d-b
were similar to those for
pdb
90
d-b
despite the difference in number of homologs detected. This graph could be used to roughly calibrate the reliability of a given statistical score.
Figure 5
Coverage vs. error plots of different sequence comparison methods: Five different sequence comparison methods are evaluated, each using statistical scores (E- or P-values). (A)
pdb
40
d-b
database. In this analysis, the best method is the slow
ssearch
, which finds 18% of relationships at 1% EPQ.
fasta
ktup = 1 and
wu-blast
2 are almost as good. (B)
pdb
90
d-b
database. The quick
wu-blast
2 program provides the best coverage at 1% EPQ on this database, although at higher levels of error it becomes slightly worse than
fasta
ktup = 1 and
ssearch
.
Figure 6
Distribution and detection of homologs in
pdb
40
d-b
. Bars show the distribution of homologous pairs
pdb
40
d-b
according to their identity (using the measure of identity in both). Filled regions indicate the number of these pairs found by the best database searching method (
ssearch
with E-values) at 1% EPQ. The
pdb
40
d-b
database contains proteins with <40% identity, and as shown on this graph, most structurally identified homologs in the database have diverged extremely far in sequence and have <20% identity. Note that the alignments may be inaccurate, especially at low levels of identity. Filled regions show that
ssearch
can identify most relationships that have 25% or more identity, but its detection wanes sharply below 25%. Consequently, the great sequence divergence of most structurally identified evolutionary relationships effectively defeats the ability of pariwise sequence comparison to detect them.
Similar articles
- Effective protein sequence comparison.
Pearson WR. Pearson WR. Methods Enzymol. 1996;266:227-58. doi: 10.1016/s0076-6879(96)66017-0. Methods Enzymol. 1996. PMID: 8743688 - Comparative accuracy of methods for protein sequence similarity search.
Agarwal P, States DJ. Agarwal P, et al. Bioinformatics. 1998;14(1):40-7. doi: 10.1093/bioinformatics/14.1.40. Bioinformatics. 1998. PMID: 9520500 - Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods.
Park J, Karplus K, Barrett C, Hughey R, Haussler D, Hubbard T, Chothia C. Park J, et al. J Mol Biol. 1998 Dec 11;284(4):1201-10. doi: 10.1006/jmbi.1998.2221. J Mol Biol. 1998. PMID: 9837738 - Practical and predictive bioinformatics methods for the identification of potentially cross-reactive protein matches.
Goodman RE. Goodman RE. Mol Nutr Food Res. 2006 Jul;50(7):655-60. doi: 10.1002/mnfr.200500277. Mol Nutr Food Res. 2006. PMID: 16810734 Review.
Cited by
- Statistical limits to the identification of ion channel domains by sequence similarity.
Fodor AA, Aldrich RW. Fodor AA, et al. J Gen Physiol. 2006 Jun;127(6):755-66. doi: 10.1085/jgp.200509419. J Gen Physiol. 2006. PMID: 16735758 Free PMC article. - Analysis of sequencing strategies and tools for taxonomic annotation: Defining standards for progressive metagenomics.
Escobar-Zepeda A, Godoy-Lozano EE, Raggi L, Segovia L, Merino E, Gutiérrez-Rios RM, Juarez K, Licea-Navarro AF, Pardo-Lopez L, Sanchez-Flores A. Escobar-Zepeda A, et al. Sci Rep. 2018 Aug 13;8(1):12034. doi: 10.1038/s41598-018-30515-5. Sci Rep. 2018. PMID: 30104688 Free PMC article. - SCOP: a structural classification of proteins database.
Lo Conte L, Ailey B, Hubbard TJ, Brenner SE, Murzin AG, Chothia C. Lo Conte L, et al. Nucleic Acids Res. 2000 Jan 1;28(1):257-9. doi: 10.1093/nar/28.1.257. Nucleic Acids Res. 2000. PMID: 10592240 Free PMC article. - Evolutionary relationships among G protein-coupled receptors using a clustered database approach.
Graul RC, Sadée W. Graul RC, et al. AAPS PharmSci. 2001;3(2):E12. doi: 10.1208/ps030212. AAPS PharmSci. 2001. PMID: 11741263 Free PMC article. - Genomix: a method for combining gene-finders' predictions, which uses evolutionary conservation of sequence and intron-exon structure.
Coghlan A, Durbin R. Coghlan A, et al. Bioinformatics. 2007 Jun 15;23(12):1468-75. doi: 10.1093/bioinformatics/btm133. Epub 2007 May 5. Bioinformatics. 2007. PMID: 17483502 Free PMC article.
References
- Altschul S F, Gish W, Miller W, Myers E W, Lipman D J. J Mol Biol. 1990;215:403–410. - PubMed
- Altschul S F, Gish W. Methods Enzymol. 1996;266:460–480. - PubMed
- Murzin A G, Brenner S E, Hubbard T, Chothia C. J Mol Biol. 1995;247:536–540. - PubMed
- Brenner S E, Chothia C, Hubbard T J P, Murzin A G. Methods Enzymol. 1996;266:635–643. - PubMed
Publication types
MeSH terms
Substances
LinkOut - more resources
Full Text Sources
Other Literature Sources
Research Materials