Rapid similarity searches of nucleic acid and protein data banks (original) (raw)

Abstract

With the development of large data banks of protein and nucleic acid sequences, the need for efficient methods of searching such banks for sequences similar to a given sequence has become evident. We present an algorithm for the global comparison of sequences based on matching k-tuples of sequence elements for a fixed k. The method results in substantial reduction in the time required to search a data bank when compared with prior techniques of similarity analysis, with minimal loss in sensitivity. The algorithm has also been adapted, in a separate implementation, to produce rigorous sequence alignments. Currently, using the DEC KL-10 system, we can compare all sequences in the entire Protein Data Bank of the National Biomedical Research Foundation with a 350-residue query sequence in less than 3 min and carry out a similar analysis with a 500-base query sequence against all eukaryotic sequences in the Los Alamos Nucleic Acid Data Base in less than 2 min.

726

Selected References

These references are in PubMed. This may not be the complete list of references from this article.

  1. Barker W. C., Dayhoff M. O. Viral src gene products are related to the catalytic chain of mammalian cAMP-dependent protein kinase. Proc Natl Acad Sci U S A. 1982 May;79(9):2836–2839. doi: 10.1073/pnas.79.9.2836. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Dumas J. P., Ninio J. Efficient algorithms for folding and comparing nucleic acid sequences. Nucleic Acids Res. 1982 Jan 11;10(1):197–206. doi: 10.1093/nar/10.1.197. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Fitch W. M. An improved method of testing for evolutionary homology. J Mol Biol. 1966 Mar;16(1):9–16. doi: 10.1016/s0022-2836(66)80258-9. [DOI] [PubMed] [Google Scholar]
  4. Goad W. B., Kanehisa M. I. Pattern recognition in nucleic acid sequences. I. A general method for finding local homologies and symmetries. Nucleic Acids Res. 1982 Jan 11;10(1):247–263. doi: 10.1093/nar/10.1.247. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Korn L. J., Queen C. L., Wegman M. N. Computer analysis of nucleic acid regulatory sequences. Proc Natl Acad Sci U S A. 1977 Oct;74(10):4401–4405. doi: 10.1073/pnas.74.10.4401. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Maizel J. V., Jr, Lenk R. P. Enhanced graphic matrix analysis of nucleic acid and protein sequences. Proc Natl Acad Sci U S A. 1981 Dec;78(12):7665–7669. doi: 10.1073/pnas.78.12.7665. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Needleman S. B., Wunsch C. D. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol. 1970 Mar;48(3):443–453. doi: 10.1016/0022-2836(70)90057-4. [DOI] [PubMed] [Google Scholar]
  8. Sankoff D. Matching sequences under deletion-insertion constraints. Proc Natl Acad Sci U S A. 1972 Jan;69(1):4–6. doi: 10.1073/pnas.69.1.4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Sellers P. H. Pattern recognition in genetic sequences. Proc Natl Acad Sci U S A. 1979 Jul;76(7):3041–3041. doi: 10.1073/pnas.76.7.3041. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Smith T. F., Waterman M. S., Fitch W. M. Comparative biosequence metrics. J Mol Evol. 1981;18(1):38–46. doi: 10.1007/BF01733210. [DOI] [PubMed] [Google Scholar]
  11. Smith T. F., Waterman M. S. Identification of common molecular subsequences. J Mol Biol. 1981 Mar 25;147(1):195–197. doi: 10.1016/0022-2836(81)90087-5. [DOI] [PubMed] [Google Scholar]