Detection of homologous proteins by an intermediate sequence search - PubMed (original) (raw)

Comparative Study

Detection of homologous proteins by an intermediate sequence search

Bino John et al. Protein Sci. 2004 Jan.

Abstract

We developed a variant of the intermediate sequence search method (ISS(new)) for detection and alignment of weakly similar pairs of protein sequences. ISS(new) relates two query sequences by an intermediate sequence that is potentially homologous to both queries. The improvement was achieved by a more robust overlap score for a match between the queries through an intermediate. The approach was benchmarked on a data set of 2369 sequences of known structure with insignificant sequence similarity to each other (BLAST E-value larger than 0.001); 2050 of these sequences had a related structure in the set. ISS(new) performed significantly better than both PSI-BLAST and a previously described intermediate sequence search method. PSI-BLAST could not detect correct homologs for 1619 of the 2369 sequences. In contrast, ISS(new) assigned a correct homolog as the top hit for 121 of these 1619 sequences, while incorrectly assigning homologs for only nine targets; it did not assign homologs for the remainder of the sequences. By estimate, ISS(new) may be able to assign the folds of domains in approximately 29,000 of the approximately 500,000 sequences unassigned by PSI-BLAST, with 90% specificity (1 - false positives fraction). In addition, we show that the 15 alignments with the most significant BLAST E-values include the nearly best alignments constructed by ISS(new).

PubMed Disclaimer

Figures

Figure 1.

An alignment of the target sequence with an intermediate and a putative homolog. The dotted and the dashed lines represent the intermediate aligned to the putative homolog and the target, respectively. The positions of the starting residues in the alignment of the intermediate with the putative homolog and the target are denoted by its and iqs, respectively. The positions of the ending residues are denoted by ite and iqe. The number of residues in the common overlap region of the intermediate is indicated by C.

Figure 2.

Accuracy of ISSnew, ISSold, and PSI-BLAST on SEQS-EASY. The accuracy is described by the ROC curves (Materials and Methods) for PSI-BLAST (dashed line), ISSold (dotted line), and ISSnew (solid line).

Figure 3.

Accuracies of ISSnew, ISSold, and PSI-BLAST at different target-sequence lengths. (See Figure 2 ▶ legend for a description of the different symbols used.) The sequence lengths are less than 100 residues (A), between 100 and 200 residues (B), and greater than 200 residues (C).

Figure 4.

Average alignment accuracy as a function of the thresholds on the overlap length (_x_-axis) and the ISSnew score (M). Error bars indicate the standard error of the mean; they are so small that they are almost invisible. Alignment accuracy is measured by the Cα RMSD between the compared structures (A) and coverage (B).

Figure 5.

Average alignment accuracy as a function of the _E_-value. The _E_-values were calculated using the BLOSUM62 amino acid substitution matrix. The alignments of the pairs in SEQS-HARD are obtained by ISSnew. (See Figure 4 ▶ for details.)

Figure 6.

Average alignment accuracy of the top five alignments selected by _E_-value as a function of the thresholds on the overlap length (_x_-axis) and the ISSnew score (M). The _E_-values were calculated using the BLOSUM62 amino acid substitution matrix. (See Figure 4 ▶ for details.)

Figure 7.

Average alignment accuracy of the best alignments in the selected set of alignments. Accuracy of the alignments selected by _E_-value using BC0030, BLOSUM62, and OPTIMA residue-type substitution matrices. Alignment accuracy is measured by the Cα RMSD of the structures (A) and coverage (B).

Cited by

Detecting remotely related proteins by their interactions and sequence similarity.
Espadaler J, Aragüés R, Eswar N, Marti-Renom MA, Querol E, Avilés FX, Sali A, Oliva B. Espadaler J, et al. Proc Natl Acad Sci U S A. 2005 May 17;102(20):7151-6. doi: 10.1073/pnas.0500831102. Epub 2005 May 9. Proc Natl Acad Sci U S A. 2005. PMID: 15883372 Free PMC article.
ESG: extended similarity group method for automated protein function prediction.
Chitale M, Hawkins T, Park C, Kihara D. Chitale M, et al. Bioinformatics. 2009 Jul 15;25(14):1739-45. doi: 10.1093/bioinformatics/btp309. Epub 2009 May 12. Bioinformatics. 2009. PMID: 19435743 Free PMC article.
The limits of protein sequence comparison?
Pearson WR, Sierk ML. Pearson WR, et al. Curr Opin Struct Biol. 2005 Jun;15(3):254-60. doi: 10.1016/j.sbi.2005.05.005. Curr Opin Struct Biol. 2005. PMID: 15919194 Free PMC article. Review.
Graph pyramids for protein function prediction.
Sandhan T, Yoo Y, Choi J, Kim S. Sandhan T, et al. BMC Med Genomics. 2015;8 Suppl 2(Suppl 2):S12. doi: 10.1186/1755-8794-8-S2-S12. Epub 2015 May 29. BMC Med Genomics. 2015. PMID: 26044522 Free PMC article.
Profiles of Natural and Designed Protein-Like Sequences Effectively Bridge Protein Sequence Gaps: Implications in Distant Homology Detection.
Kumar G, Srinivasan N, Sandhya S. Kumar G, et al. Methods Mol Biol. 2022;2449:149-167. doi: 10.1007/978-1-0716-2095-3_5. Methods Mol Biol. 2022. PMID: 35507261

References

1. Altschul, S.F., Gish, W., Miller, W., Myers, E.W., and Lipman, D.J. 1990. Basic local alignment search tool. J. Mol. Biol. 215 403–410. - PubMed
1. Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. 1997. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res. 25 3389–3402. - PMC - PubMed
1. Apostolico, A. and Giancarlo, R. 1998. Sequence alignment in molecular biology. J. Comput. Biol. 5 173–196. - PubMed
1. Barton, G.J. 1994. Scop: Structural classification of proteins. Trends Biochem. Sci. 19 554–555. - PubMed
1. Blake, J.D. and Cohen, F.E. 2001. Pairwise sequence alignment below the twilight zone. J. Mol. Biol. 307 721–735. - PubMed

Detection of homologous proteins by an intermediate sequence search - PubMed (original) (raw)