G Z - Academia.edu (original) (raw)

Papers by G Z

Research paper thumbnail of MATRIX SEARCH 1.0: a computer program that scans DNA sequences for transcriptional elements using a database of weight matrices

Bioinformatics/computer Applications in The Biosciences, 1995

The information matrix database (IMD), a database of weight matrices of transcription factor bind... more The information matrix database (IMD), a database of weight matrices of transcription factor binding sites, is developed. MATRIX SEARCH, a program which can find potential transcription factor binding sites in DNA sequences using the IMD database, is also developed and accompanies the IMD database. MATRIX SEARCH adopts a user interface very similar to that of the SIGNAL SCAN program. MATRIX SEARCH allows the user to search an input sequence with the IMD automatically, to visualize the matrix representations of sites for particular factors, and to retrieve journal citations. The source code for MATRIX SEARCH is in the 'C language, and the program is available for unix platforms.

Research paper thumbnail of Identifying constraints on the higher-order structure of RNA: continued development and application of comparative sequence analysis methods

Nucleic Acids Research, 1992

Comparative sequence analysis addresses the problem of RNA folding and RNA structural diversity, ... more Comparative sequence analysis addresses the problem of RNA folding and RNA structural diversity, and is responsible for determining the folding of many RNA molecules, including 5S, 16S, and 23S rRNAs, tRNA, RNAse P RNA, and Group I and 11 introns. Initially this method was utilized to fold these sequences into their secondary structures. More recently, this method has revealed numerous tertiary correlations, elucidating novel RNA structural motifs, several of which have been experimentally tested and verified, substantiating the general application of this approach. As successful as the comparative methods have been in elucidating higher-order structure, it is clear that additional structure constraints remain to be found. Deciphering such constraints requires more sensitive and rigorous protocols, in addition to RNA sequence datasets that contain additional phylogenetic diversity and an overall increase in the number of sequences. Various RNA databases, including the tRNA and rRNA sequence datasets, continue to grow in number as well as diversity. Described herein is the development of more rigorous comparative analysis protocols. Our initial development and applications on different RNA datasets have been very encouraging. Such analyses on tRNA, 16S and 23S rRNA are substantiating previously proposed associations and are now beginning to reveal additional constraints on these molecules. A subset of these involve several positions that correlate simulataneously with one another, implying units larger than a basepair can be under a phylogenetic constraint.

Research paper thumbnail of Identifying DNA and protein patterns with statistically significant alignments of multiple sequences

Bioinformatics/computer Applications in The Biosciences, 1999

Motivation: Molecular biologists frequently can obtain interesting insight by aligning a set of r... more Motivation: Molecular biologists frequently can obtain interesting insight by aligning a set of related DNA, RNA or protein sequences. Such alignments can be used to determine either evolutionary or functional relationships. Our interest is in identifying functional relationships. Unless the sequences are very similar, it is necessary to have a specific strategy for measuring-or scoring-the relatedness of the aligned sequences. If the alignment is not known, one can be determined by finding an alignment that optimizes the scoring scheme. Results: We describe four components to our approach for determining alignments of multiple sequences. First, we review a log-likelihood scoring scheme we call information content. Second, we describe two methods for estimating the P value of an individual information content score: (i) a method that combines a technique from large-deviation statistics with numerical calculations; (ii) a method that is exclusively numerical. Third, we describe how we count the number of possible alignments given the overall amount of sequence data. This count is multiplied by the P value to determine the expected frequency of an information content score and, thus, the statistical significance of the corresponding alignment. Statistical significance can be used to compare alignments having differing widths and containing differing numbers of sequences. Fourth, we describe a greedy algorithm for determining alignments of functionally related sequences. Finally, we test the accuracy of our P value calculations, and give an example of using our algorithm to identify binding sites for the Escherichia coli CRP protein.

Research paper thumbnail of Identification of consensus patterns in unaligned DNA sequences known to be functionally related

Bioinformatics/computer Applications in The Biosciences, 1990

Using log-likelihood statistics to compare sequence alignments, we h a ve b e e n a b l e t o d e... more Using log-likelihood statistics to compare sequence alignments, we h a ve b e e n a b l e t o d e t e rmine alignments from multiple, unaligned, functionally related, DNA (Stormo and Hartzell. 1989. Proc. Natl. Acad. Sci. USA 86, 1183{1187 Hertz et al. 1990. Comput. Appl. Biosci. 6, 81{92) and protein sequences. In this paper, we reanalyze DNA sequences that bind the E. coli repressor LexA to demonstrate the ability of our scoring scheme to identify patterns when each sequence can contain zero or more binding sites.

Research paper thumbnail of Identification of consensus patterns in unaligned dna and protein sequences: a large-deviation statistical basis for penalizing gaps

Abstract Using log-likelihood statistics to compare sequence alignments, we have been able to det... more Abstract Using log-likelihood statistics to compare sequence alignments, we have been able to deter- mine alignments from multiple, unaligned, functionally related, DNA (Stormo and Hartzell. 1989. Proc. Natl. Acad. Sci. USA 86, 1183{1187; Hertz et al. 1990. Comput. Appl. Biosci. 6, 81{92) and protein sequences. In this paper, we reanalyze DNA sequences that bind the E. coli repressor LexA to

Research paper thumbnail of MATRIX SEARCH 1.0: a computer program that scans DNA sequences for transcriptional elements using a database of weight matrices

Bioinformatics/computer Applications in The Biosciences, 1995

The information matrix database (IMD), a database of weight matrices of transcription factor bind... more The information matrix database (IMD), a database of weight matrices of transcription factor binding sites, is developed. MATRIX SEARCH, a program which can find potential transcription factor binding sites in DNA sequences using the IMD database, is also developed and accompanies the IMD database. MATRIX SEARCH adopts a user interface very similar to that of the SIGNAL SCAN program. MATRIX SEARCH allows the user to search an input sequence with the IMD automatically, to visualize the matrix representations of sites for particular factors, and to retrieve journal citations. The source code for MATRIX SEARCH is in the 'C language, and the program is available for unix platforms.

Research paper thumbnail of Identifying constraints on the higher-order structure of RNA: continued development and application of comparative sequence analysis methods

Nucleic Acids Research, 1992

Comparative sequence analysis addresses the problem of RNA folding and RNA structural diversity, ... more Comparative sequence analysis addresses the problem of RNA folding and RNA structural diversity, and is responsible for determining the folding of many RNA molecules, including 5S, 16S, and 23S rRNAs, tRNA, RNAse P RNA, and Group I and 11 introns. Initially this method was utilized to fold these sequences into their secondary structures. More recently, this method has revealed numerous tertiary correlations, elucidating novel RNA structural motifs, several of which have been experimentally tested and verified, substantiating the general application of this approach. As successful as the comparative methods have been in elucidating higher-order structure, it is clear that additional structure constraints remain to be found. Deciphering such constraints requires more sensitive and rigorous protocols, in addition to RNA sequence datasets that contain additional phylogenetic diversity and an overall increase in the number of sequences. Various RNA databases, including the tRNA and rRNA sequence datasets, continue to grow in number as well as diversity. Described herein is the development of more rigorous comparative analysis protocols. Our initial development and applications on different RNA datasets have been very encouraging. Such analyses on tRNA, 16S and 23S rRNA are substantiating previously proposed associations and are now beginning to reveal additional constraints on these molecules. A subset of these involve several positions that correlate simulataneously with one another, implying units larger than a basepair can be under a phylogenetic constraint.

Research paper thumbnail of Identifying DNA and protein patterns with statistically significant alignments of multiple sequences

Bioinformatics/computer Applications in The Biosciences, 1999

Motivation: Molecular biologists frequently can obtain interesting insight by aligning a set of r... more Motivation: Molecular biologists frequently can obtain interesting insight by aligning a set of related DNA, RNA or protein sequences. Such alignments can be used to determine either evolutionary or functional relationships. Our interest is in identifying functional relationships. Unless the sequences are very similar, it is necessary to have a specific strategy for measuring-or scoring-the relatedness of the aligned sequences. If the alignment is not known, one can be determined by finding an alignment that optimizes the scoring scheme. Results: We describe four components to our approach for determining alignments of multiple sequences. First, we review a log-likelihood scoring scheme we call information content. Second, we describe two methods for estimating the P value of an individual information content score: (i) a method that combines a technique from large-deviation statistics with numerical calculations; (ii) a method that is exclusively numerical. Third, we describe how we count the number of possible alignments given the overall amount of sequence data. This count is multiplied by the P value to determine the expected frequency of an information content score and, thus, the statistical significance of the corresponding alignment. Statistical significance can be used to compare alignments having differing widths and containing differing numbers of sequences. Fourth, we describe a greedy algorithm for determining alignments of functionally related sequences. Finally, we test the accuracy of our P value calculations, and give an example of using our algorithm to identify binding sites for the Escherichia coli CRP protein.

Research paper thumbnail of Identification of consensus patterns in unaligned DNA sequences known to be functionally related

Bioinformatics/computer Applications in The Biosciences, 1990

Using log-likelihood statistics to compare sequence alignments, we h a ve b e e n a b l e t o d e... more Using log-likelihood statistics to compare sequence alignments, we h a ve b e e n a b l e t o d e t e rmine alignments from multiple, unaligned, functionally related, DNA (Stormo and Hartzell. 1989. Proc. Natl. Acad. Sci. USA 86, 1183{1187 Hertz et al. 1990. Comput. Appl. Biosci. 6, 81{92) and protein sequences. In this paper, we reanalyze DNA sequences that bind the E. coli repressor LexA to demonstrate the ability of our scoring scheme to identify patterns when each sequence can contain zero or more binding sites.

Research paper thumbnail of Identification of consensus patterns in unaligned dna and protein sequences: a large-deviation statistical basis for penalizing gaps

Abstract Using log-likelihood statistics to compare sequence alignments, we have been able to det... more Abstract Using log-likelihood statistics to compare sequence alignments, we have been able to deter- mine alignments from multiple, unaligned, functionally related, DNA (Stormo and Hartzell. 1989. Proc. Natl. Acad. Sci. USA 86, 1183{1187; Hertz et al. 1990. Comput. Appl. Biosci. 6, 81{92) and protein sequences. In this paper, we reanalyze DNA sequences that bind the E. coli repressor LexA to

Research paper thumbnail of MATRIX SEARCH 1.0: a computer program that scans DNA sequences for transcriptional elements using a database of weight matrices

Bioinformatics/computer Applications in The Biosciences, 1995

The information matrix database (IMD), a database of weight matrices of transcription factor bind... more The information matrix database (IMD), a database of weight matrices of transcription factor binding sites, is developed. MATRIX SEARCH, a program which can find potential transcription factor binding sites in DNA sequences using the IMD database, is also developed and accompanies the IMD database. MATRIX SEARCH adopts a user interface very similar to that of the SIGNAL SCAN program. MATRIX SEARCH allows the user to search an input sequence with the IMD automatically, to visualize the matrix representations of sites for particular factors, and to retrieve journal citations. The source code for MATRIX SEARCH is in the 'C language, and the program is available for unix platforms.

Research paper thumbnail of Identifying constraints on the higher-order structure of RNA: continued development and application of comparative sequence analysis methods

Nucleic Acids Research, 1992

Comparative sequence analysis addresses the problem of RNA folding and RNA structural diversity, ... more Comparative sequence analysis addresses the problem of RNA folding and RNA structural diversity, and is responsible for determining the folding of many RNA molecules, including 5S, 16S, and 23S rRNAs, tRNA, RNAse P RNA, and Group I and 11 introns. Initially this method was utilized to fold these sequences into their secondary structures. More recently, this method has revealed numerous tertiary correlations, elucidating novel RNA structural motifs, several of which have been experimentally tested and verified, substantiating the general application of this approach. As successful as the comparative methods have been in elucidating higher-order structure, it is clear that additional structure constraints remain to be found. Deciphering such constraints requires more sensitive and rigorous protocols, in addition to RNA sequence datasets that contain additional phylogenetic diversity and an overall increase in the number of sequences. Various RNA databases, including the tRNA and rRNA sequence datasets, continue to grow in number as well as diversity. Described herein is the development of more rigorous comparative analysis protocols. Our initial development and applications on different RNA datasets have been very encouraging. Such analyses on tRNA, 16S and 23S rRNA are substantiating previously proposed associations and are now beginning to reveal additional constraints on these molecules. A subset of these involve several positions that correlate simulataneously with one another, implying units larger than a basepair can be under a phylogenetic constraint.

Research paper thumbnail of Identifying DNA and protein patterns with statistically significant alignments of multiple sequences

Bioinformatics/computer Applications in The Biosciences, 1999

Motivation: Molecular biologists frequently can obtain interesting insight by aligning a set of r... more Motivation: Molecular biologists frequently can obtain interesting insight by aligning a set of related DNA, RNA or protein sequences. Such alignments can be used to determine either evolutionary or functional relationships. Our interest is in identifying functional relationships. Unless the sequences are very similar, it is necessary to have a specific strategy for measuring-or scoring-the relatedness of the aligned sequences. If the alignment is not known, one can be determined by finding an alignment that optimizes the scoring scheme. Results: We describe four components to our approach for determining alignments of multiple sequences. First, we review a log-likelihood scoring scheme we call information content. Second, we describe two methods for estimating the P value of an individual information content score: (i) a method that combines a technique from large-deviation statistics with numerical calculations; (ii) a method that is exclusively numerical. Third, we describe how we count the number of possible alignments given the overall amount of sequence data. This count is multiplied by the P value to determine the expected frequency of an information content score and, thus, the statistical significance of the corresponding alignment. Statistical significance can be used to compare alignments having differing widths and containing differing numbers of sequences. Fourth, we describe a greedy algorithm for determining alignments of functionally related sequences. Finally, we test the accuracy of our P value calculations, and give an example of using our algorithm to identify binding sites for the Escherichia coli CRP protein.

Research paper thumbnail of Identification of consensus patterns in unaligned DNA sequences known to be functionally related

Bioinformatics/computer Applications in The Biosciences, 1990

Using log-likelihood statistics to compare sequence alignments, we h a ve b e e n a b l e t o d e... more Using log-likelihood statistics to compare sequence alignments, we h a ve b e e n a b l e t o d e t e rmine alignments from multiple, unaligned, functionally related, DNA (Stormo and Hartzell. 1989. Proc. Natl. Acad. Sci. USA 86, 1183{1187 Hertz et al. 1990. Comput. Appl. Biosci. 6, 81{92) and protein sequences. In this paper, we reanalyze DNA sequences that bind the E. coli repressor LexA to demonstrate the ability of our scoring scheme to identify patterns when each sequence can contain zero or more binding sites.

Research paper thumbnail of Identification of consensus patterns in unaligned dna and protein sequences: a large-deviation statistical basis for penalizing gaps

Abstract Using log-likelihood statistics to compare sequence alignments, we have been able to det... more Abstract Using log-likelihood statistics to compare sequence alignments, we have been able to deter- mine alignments from multiple, unaligned, functionally related, DNA (Stormo and Hartzell. 1989. Proc. Natl. Acad. Sci. USA 86, 1183{1187; Hertz et al. 1990. Comput. Appl. Biosci. 6, 81{92) and protein sequences. In this paper, we reanalyze DNA sequences that bind the E. coli repressor LexA to

Research paper thumbnail of MATRIX SEARCH 1.0: a computer program that scans DNA sequences for transcriptional elements using a database of weight matrices

Bioinformatics/computer Applications in The Biosciences, 1995

The information matrix database (IMD), a database of weight matrices of transcription factor bind... more The information matrix database (IMD), a database of weight matrices of transcription factor binding sites, is developed. MATRIX SEARCH, a program which can find potential transcription factor binding sites in DNA sequences using the IMD database, is also developed and accompanies the IMD database. MATRIX SEARCH adopts a user interface very similar to that of the SIGNAL SCAN program. MATRIX SEARCH allows the user to search an input sequence with the IMD automatically, to visualize the matrix representations of sites for particular factors, and to retrieve journal citations. The source code for MATRIX SEARCH is in the 'C language, and the program is available for unix platforms.

Research paper thumbnail of Identifying constraints on the higher-order structure of RNA: continued development and application of comparative sequence analysis methods

Nucleic Acids Research, 1992

Comparative sequence analysis addresses the problem of RNA folding and RNA structural diversity, ... more Comparative sequence analysis addresses the problem of RNA folding and RNA structural diversity, and is responsible for determining the folding of many RNA molecules, including 5S, 16S, and 23S rRNAs, tRNA, RNAse P RNA, and Group I and 11 introns. Initially this method was utilized to fold these sequences into their secondary structures. More recently, this method has revealed numerous tertiary correlations, elucidating novel RNA structural motifs, several of which have been experimentally tested and verified, substantiating the general application of this approach. As successful as the comparative methods have been in elucidating higher-order structure, it is clear that additional structure constraints remain to be found. Deciphering such constraints requires more sensitive and rigorous protocols, in addition to RNA sequence datasets that contain additional phylogenetic diversity and an overall increase in the number of sequences. Various RNA databases, including the tRNA and rRNA sequence datasets, continue to grow in number as well as diversity. Described herein is the development of more rigorous comparative analysis protocols. Our initial development and applications on different RNA datasets have been very encouraging. Such analyses on tRNA, 16S and 23S rRNA are substantiating previously proposed associations and are now beginning to reveal additional constraints on these molecules. A subset of these involve several positions that correlate simulataneously with one another, implying units larger than a basepair can be under a phylogenetic constraint.

Research paper thumbnail of Identifying DNA and protein patterns with statistically significant alignments of multiple sequences

Bioinformatics/computer Applications in The Biosciences, 1999

Motivation: Molecular biologists frequently can obtain interesting insight by aligning a set of r... more Motivation: Molecular biologists frequently can obtain interesting insight by aligning a set of related DNA, RNA or protein sequences. Such alignments can be used to determine either evolutionary or functional relationships. Our interest is in identifying functional relationships. Unless the sequences are very similar, it is necessary to have a specific strategy for measuring-or scoring-the relatedness of the aligned sequences. If the alignment is not known, one can be determined by finding an alignment that optimizes the scoring scheme. Results: We describe four components to our approach for determining alignments of multiple sequences. First, we review a log-likelihood scoring scheme we call information content. Second, we describe two methods for estimating the P value of an individual information content score: (i) a method that combines a technique from large-deviation statistics with numerical calculations; (ii) a method that is exclusively numerical. Third, we describe how we count the number of possible alignments given the overall amount of sequence data. This count is multiplied by the P value to determine the expected frequency of an information content score and, thus, the statistical significance of the corresponding alignment. Statistical significance can be used to compare alignments having differing widths and containing differing numbers of sequences. Fourth, we describe a greedy algorithm for determining alignments of functionally related sequences. Finally, we test the accuracy of our P value calculations, and give an example of using our algorithm to identify binding sites for the Escherichia coli CRP protein.

Research paper thumbnail of Identification of consensus patterns in unaligned DNA sequences known to be functionally related

Bioinformatics/computer Applications in The Biosciences, 1990

Using log-likelihood statistics to compare sequence alignments, we h a ve b e e n a b l e t o d e... more Using log-likelihood statistics to compare sequence alignments, we h a ve b e e n a b l e t o d e t e rmine alignments from multiple, unaligned, functionally related, DNA (Stormo and Hartzell. 1989. Proc. Natl. Acad. Sci. USA 86, 1183{1187 Hertz et al. 1990. Comput. Appl. Biosci. 6, 81{92) and protein sequences. In this paper, we reanalyze DNA sequences that bind the E. coli repressor LexA to demonstrate the ability of our scoring scheme to identify patterns when each sequence can contain zero or more binding sites.

Research paper thumbnail of Identification of consensus patterns in unaligned dna and protein sequences: a large-deviation statistical basis for penalizing gaps

Abstract Using log-likelihood statistics to compare sequence alignments, we have been able to det... more Abstract Using log-likelihood statistics to compare sequence alignments, we have been able to deter- mine alignments from multiple, unaligned, functionally related, DNA (Stormo and Hartzell. 1989. Proc. Natl. Acad. Sci. USA 86, 1183{1187; Hertz et al. 1990. Comput. Appl. Biosci. 6, 81{92) and protein sequences. In this paper, we reanalyze DNA sequences that bind the E. coli repressor LexA to