Issues in searching molecular sequence databases (original) (raw)
Altschul, S.F. Amino acid substitution matrices from an information theoretic perspective. J. molec. Biol.219, 556–565 (1991). Article Google Scholar
Altschul, S.F. A protein alignment scoring system sensitive at all evolutionary distances. J. molec. Evol.36, 290–300 (1993). ArticleCASPubMed Google Scholar
States, D.J., Gish, W. & Altschul, S.F. Improved sensitivity of nucleic acid database searches using application-specific scoring matrices. Methods3, 66–70 (1991). ArticleCAS Google Scholar
Gish, W. & States, D.J. Identification of protein coding regions by database similarity search. Nature Genet.3, 266–272 (1993). ArticleCASPubMed Google Scholar
Claverie, J.-M. Detecting frameshifts by amino acid sequence comparison. J. molec. Biol.234, 1140–1157 (1993). ArticleCASPubMed Google Scholar
Karlin, S. & Altschul, S.F. Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc. natn. Acad. Sci. U.S.A.87, 2264–2268 (1990). ArticleCAS Google Scholar
Karlin, S., Dembo, A. & Kawabata, T. Statistical composition of high-scoring segments from molecular sequences. Ann. Stat.18, 571–581 (1990). Article Google Scholar
Dembo, A. & Karlin, S. Strong limit theorems of empirical functionals for large exceedances of partial sums of i.i.d. variables. Ann. Prob.19, 1737–1755 (1991). Google Scholar
Karlin, S. & Altschul, S.F. Applications and statistics for multiple high-scoring segments in molecular sequences. Proc. natn. Acad. Sci. U.S.A.90, 5873–5877 (1993). ArticleCAS Google Scholar
Smith, T.F., Waterman, M.S. & Burks, C. The statistical distribution of nucleic acid similarities. Nucl. Acids Res.13, 645–656 (1985). ArticleCASPubMedPubMed Central Google Scholar
Altschul, S.F. & Erickson, B.W. A nonlinear measure of subalignment similarity and its significance levels. Bull. math. Biol.48, 617–632 (1986). ArticleCASPubMed Google Scholar
Collins, J.F., Coulson, A.F.W. & Lyall, A. The significance of protein sequence similarities. CABIOS4, 67–71 (1988). CASPubMed Google Scholar
Mott, R. Maximum-likelihood estimation of the statistical distribution of Smith-Waterman local sequence similarity scores. Bull. math. Biol.54, 59–75 (1992). ArticleCASPubMed Google Scholar
Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. Basic local alignment search tool. J. molec. Biol.215, 403–410 (1990). ArticleCASPubMed Google Scholar
Needleman, S.B. & Wunsch, C.D. A general method applicable to the search for similarities in the amino acid sequences of two proteins. J. molec. Biol.48, 443–453 (1970). ArticleCASPubMed Google Scholar
Sellers, P.H. On the theory and computation of evolutionary distances. SIAM J. appl. Math.26, 787–793 (1974). Article Google Scholar
Sankoff, D. & Kruskal, J.B. Time Warps, String Edits and Macromolecules: The Theory and Practice of Sequence Comparison (Addison-Wesley, Reading, M.A, 1983). Google Scholar
Smith, T.F. & Waterman, M.S. Identification of common molecular subsequences. J. molec. Biol.147, 195–197 (1981). ArticleCASPubMed Google Scholar
Goad, W.B. & Kanehisa, M.I. Pattern recognition in nucleic acid sequences. I.A general method for finding local homologies and symmetries. Nucl. Acids Res.10, 247–263 (1982). ArticleCASPubMedPubMed Central Google Scholar
Sellers, P.H. Pattern recognition in genetic sequences by mismatch density. Bull. math. Biol.46, 501–514 (1984). ArticleCAS Google Scholar
Waterman, M.S. & Eggert, M. A new algorithm for best subsequence alignments with applications to tRNA-rRNA comparisons. J. molec. Biol.197, 723–728 (1987). ArticleCASPubMed Google Scholar
Coulson, A.F.W., Collins, J.F. & Lyall, A. Protein and nucleic acid database searching: a suitable case for parallel processing. Comp. J.30, 420–424 (1987). Article Google Scholar
Chow, E.T., Hunkapiller, T., Peterson, J.C., Zimmerman, B.A. & Waterman, M.S. in Proc. 1991 Int. Conf. on Supercomputing, 216–223 (ACMPress, New York, 1991). Google Scholar
Jones, R. Sequence pattern matching on a massively parallel computer. CABIOS8, 377–383 (1992). CASPubMed Google Scholar
Brutlag, D.L. et al. BLAZE: an implementation of the Smith-Waterman sequence comparison algorithm on a massively parallel computer. Comput. Chem.17, 203–207 (1993). ArticleCAS Google Scholar
Sturrock, S.S. & Collins, J.F. MPsrch version 1.3. (Biocomputing Research Unit, University of Edinburgh, 1993). Google Scholar
Lipman, D.J. & Pearson, W.R. Rapid and sensitive protein similarity searches. Science227, 1435–1441 (1985). ArticleCASPubMed Google Scholar
Pearson, W.R. & Lipman, D.J. Improved tools for biological sequence comparison. Proc. natn. Acad. Sci. U.S.A.85, 2444–2448 (1988). ArticleCAS Google Scholar
White, C.T. et al. in Proc. 1991 IEEE Int. Conf. Comp. Design: VLSI in Computers and Processors, 504–509 (IEEE Comp. Soc. Press, Los Alamitos, CA, 1991). Google Scholar
Pearson, W.R. Searching protein sequence libraries: comparison of the sensitivity and selectivity of the Smith-Waterman and FASTA algorithms. Genomics11, 635–650 (1991). ArticleCASPubMed Google Scholar
Altschul, S.F. & Lipman, D.J. Protein database searches for multiple alignments. Proc. natn. Acad. Sci. U.S.A.87, 5509–5513 (1990). ArticleCAS Google Scholar
Argos, P. A sensitive procedure to compare amino acid sequences. J. molec. Biol.193, 385–396 (1987). ArticleCASPubMed Google Scholar
Vogt, G. & Argos, P. Searching for distantly related protein sequences in large databases by parallel processing on a transputer machine. CABIOS8, 49–55 (1992). CASPubMed Google Scholar
McLachlan, A.D. Tests for comparing related amino-acid sequences. Cytochrome c and cytochrome C551 . J. molec. Biol.61, 409–424 (1971). ArticleCASPubMed Google Scholar
Dayhoff, M.O., Schwartz, R.M. & Orcutt, B.C. . in Atlas of Protein Sequence and Structure vol. 5, suppl. 3 (ed. M.O. Dayhoff) 345–352 (Natn. Biomed. Res. Found., Washington, 1978). Google Scholar
Schwartz, R.M. & Dayhoff, M.O. . in Atlas of Protein Sequence and Structure vol. 5, suppl. 3 (ed. M. O. Dayhoff) 353–358 (Natn. Biomed. Res. Found., Washington, 1978). Google Scholar
Feng, D.F., Johnson, M.S. & Doolittle, R.F. Aligning amino acid sequences: comparison of commonly used methods. J. molec. Evol.21, 112–125 (1985). ArticleCAS Google Scholar
Rao, J.K.M. New scoring matrix for amino acid residue exchanges based on residue characteristic physical parameters. Int. J. peptide protein Res.29, 276–281 (1987). ArticleCAS Google Scholar
Risler, J.L., Delorme, M.O., Delacroix, H. & Henaut, A. Amino acid substitutions in structurally related proteins. A pattern recognition approach. Determination of a new and efficient scoring matrix. J. molec. Biol.204, 1019–1029 (1988). ArticleCASPubMed Google Scholar
Gonnet, G.H., Cohen, M.A. & Benner, S.A. Exhaustive matching of the entire protein sequence database. Science256, 1443–1445 (1992). ArticleCASPubMed Google Scholar
Henikoff, S. & Henikoff, J.G. Amino acid substitution matrices from protein blocks. Proc. natn. Acad. Sci. U.S.A89, 10915–10919 (1992). ArticleCAS Google Scholar
Jones, D.T., Taylor, W.R. & Thornton, J.M. The rapid generation of mutation data matrices from protein sequences. CABIOS8, 275–282 (1992). CASPubMed Google Scholar
Overington, J., Donnelly, D., Johnson, M.S., Sali, A. & Blundell, T.L. Environment-specific amino acid substitution tables: tertiary templates and prediction of protein folds. Prot. Sci.1, 216–226 (1992). ArticleCAS Google Scholar
Wilbur, W.J. On the PAM matrix model of protein evolution. Molec. Biol. Evol.2, 434–447 (1985). CASPubMed Google Scholar
Henikoff, S. & Henikoff, J.G. Performance evaluation of amino acid substitution matrices. Proteins17, 49–61 (1993). ArticleCASPubMed Google Scholar
Waterman, M.S., Gordon, L. & Arratia, R. Phase transitions in sequence matches and nucleic acid structure. Proc. natn. Acad. Sci. U.S.A.84, 1239–1243 (1987). ArticleCAS Google Scholar
Fitch, W.M. & Smith, T.F. Optimal sequence alignments. Proc. natn. Acad. Sci. U.S.A.80, 1382–1386 (1983). ArticleCAS Google Scholar
Gotoh, O. An improved algorithm for matching biological sequences. J. molec. Biol.162, 705–708 (1982). ArticleCASPubMed Google Scholar
Altschul, S.F. & Erickson, B.W. Optimal sequence alignment using affine gap costs. Bull. math. Biol.48, 603–616 (1986). ArticleCASPubMed Google Scholar
Myers, E.W. & Miller, W. Optimal alignments in linear space. CABIOS4, 11–17 (1988). CASPubMed Google Scholar
Miller, W. & Myers, E.W. Sequence comparison with concave weighting functions. Bull. math. Biol.50, 97–120 (1988). ArticleCASPubMed Google Scholar
Pascarella, S. & Argos, P. Analysis of insertions/deletions in protein structures. J. molec. Biol.224, 461–471 (1992). ArticleCASPubMed Google Scholar
Benner, S.A., Cohen, M.A. & Gonnet, G.H. Empirical and structural models for insertions and deletions in the divergent evolution of proteins. J. molec. Biol.229, 1065–1082 (1993). ArticleCASPubMed Google Scholar
Barker, W.C., George, D.G., Mewes, H.-W., Pfeiffer, F. & Tsugita, A. The PIR-International databases. Nucl. Acids Res.21, 3089–3092 (1993). ArticleCASPubMedPubMed Central Google Scholar
Adams, M.D. et al. Complementary DNA sequencing: expressed sequence tags and human genome project. Science252, 1651–1656 (1991). ArticleCASPubMed Google Scholar
Sikela, J.M. & Auffray, C. Finding new genes faster than ever. Nature Genet.3, 189–191 (1993). ArticleCASPubMed Google Scholar
Bleasby, A.J. & Wootton, J.C. Construction of validated, non-redundant composite sequence databases. Protein Eng.3, 153–159 (1990). ArticleCASPubMed Google Scholar
Benson, D., Boguski, M., Lipman, D.J. & Ostell, J. The national center for biotechnology information. Genomics6, 389–391 (1990). ArticleCASPubMed Google Scholar
Wootton, J.C. & Federhen, S. Statistics of local complexity in amino acid sequences and sequence databases. Comput. Chem.17, 149–163 (1993). ArticleCAS Google Scholar
Green, P., Lipman, D., Hillier, L., Waterston, R., States, D.J. & Claverie, J.-M. Ancient conserved regions in new gene sequences. Science259, 1711–1716 (1993). ArticleCASPubMed Google Scholar
Riggins, G.J. et al. Human genes containing polymorphic trinucleotide repeats. Nature Genet.2, 186–191 (1992). ArticleCASPubMed Google Scholar
Karlin, S. & Brendel, V. Charge configurations in viral proteins. Proc. natn. Acad. Sci. U.S.A.85, 9396–9400 (1988). ArticleCAS Google Scholar
Karlin, S. & Brendel, V. Charge and statistical significance in protein and DNA sequence analysis. Science257, 39–49 (1992). ArticleCASPubMed Google Scholar
Brendel, V., Bucher, P., Nourbakhsh, I.R., Blaisdell, B.E. & Karlin, S. Methods and algorithms for statistical analysis of protein sequences. Proc. natn. Acad. Sci. U.S.A.89, 2002–2006 (1992). ArticleCAS Google Scholar
Claverie, J.-M. & States, D.J. Information enchancement methods for large scale sequence analysis. Comput. Chem.17, 191–201 (1993). ArticleCAS Google Scholar
Jurka, J., Walichiewicz, J. & Milosavljevic, A. Prototypic sequences for human repetitive DNA. J. molec. Evol.35, 286–291 (1992). ArticleCASPubMed Google Scholar
Hanks, S.K. & Quinn, A.M. Protein kinase catalytic domain sequence database: identification of conserved features of primary structure and classification of family members. Meth. Enzymol.200, 38–62 (1991). ArticleCAS Google Scholar
Collins, F. & Galas, D. A new five-year plan for the U.S. human genome project. Science262, 43–46 (1993). ArticleCASPubMed Google Scholar
Gumbel, E.J. Statistics of extremes. (Columbia Univ. Press, New York, 1958). Book Google Scholar
Arratia, R., Gordon, L. & Waterman, M.S. An extreme value theory for sequence matching. Ann. Stat.14, 971–993 (1986). Article Google Scholar
Arratia, R., Morris, P. & Waterman, M.S. Stochastic scrabble: large deviations for sequences with scores. J. appl. Prob.25, 106–119 (1988). Article Google Scholar
Arratia, R. & Waterman, M.S. The Erdos-Renyi strong law for pattern matching with a given proportion of mismatches. Ann. Prob.17, 1152–1169 (1989). Google Scholar
Salamon, P. & Konopka, A.K. A maximum entropy principle for distribution of local complexity in naturally occurring nucleotide sequences. Comput. Chem.16, 117–124 (1992). ArticleCAS Google Scholar
Salamon, P., Wootton, J.C., Konopka, A.K. & Hansen, L. On the robustness of maximum entropy relationships for complexity distributions of nucleotide sequences. Comput. Chem.17, 135–148 (1993). ArticleCAS Google Scholar
Miyoshi, H. et al. The t(8:21) translocation in acute myeloid leukemia results in production of an AML1-MTG8 fusion transcript. EMBO J.12, 2715–2721 (1993). ArticleCASPubMedPubMed Central Google Scholar
Kokubo, T., Gong, D.-W., Roeder, R.G., Horikoshi, M. & Nakatani, Y. The Drosophlla 110-kDa TFIID subunit directly interacts with the N-terminal region of the 230-kDa subunit. Proc. natn. Acad. Sci. U.S.A.90, 5896–5900 (1993). ArticleCAS Google Scholar
Hoey, T. et al. Molecular cloning and functional analysis of Drosophila TAF110 reveal properties expected of coactivators. Cell72, 247–260 (1993). ArticleCASPubMed Google Scholar
Owens, G.P., Hahn, W.E. & Cohen, J.J. Identification of mRNAs associated with programmed cell death in immature thymocytes. Mol. cell. Biol.11, 4177–4188 (1991). CASPubMedPubMed Central Google Scholar
Schwabe, J.W., Neuhaus, D. & Rhodes, D. Solution structure of the DNA-binding domain of the oestrogen receptor. Nature348, 458–461 (1990). ArticleCASPubMed Google Scholar
Boguski, M.S. & McCormick, F. Proteins regulating Ras and its relatives. Nature366, 643–654 (1993). ArticleCASPubMed Google Scholar
Rozakis-Adcock, M., Femley, R., Wade, J., Pawson, T. & Bowtell, D. The SH2 and SH3 domains of mammalian Grb2 couple the EGF receptor to the Ras activator mSos1. Nature363, 83–85 (1993). ArticleCASPubMed Google Scholar
Musacchio, A., Gibson, T., Rice, P., Thompson, J. & Saraste, M. The PH domain is a common piece in the structural patchwork of signalling (and other) proteins. Trends biochem. Sci.18, 343–348 (1993). ArticleCASPubMed Google Scholar
Arents, G., Burlingame, R.W., Wang, B.C., Love, W.E. & Moudrianakis, E.N. The nucleosomal core histone octamer at 3.1 A resolution: a tripartite protein assembly and a left-handed superhelix. Proc. natn. Acad. Sci. U.S.A.88, 10148–10152 (1991). ArticleCAS Google Scholar