Walter L. Ruzzo - Academia.edu (original) (raw)
Papers by Walter L. Ruzzo
Accurate estimates of the ordering and positioning of DNA markers (probes) on a chromosome are va... more Accurate estimates of the ordering and positioning of DNA markers (probes) on a chromosome are valuable tools used, for example, to help researchers isolate genetic factors in diseases. One such mapping technique, called fluorescent in situ hybridization (FISH), obtains approximate pairwise distance measurements between probes on a chromosome. We have developed two algorithms for computing least squares estimates of the ordering and positions of the probes: a branch and bound algorithm and a local search algorithm motivated by gradient descent. Simulations demonstrate the effectiveness of the branch and bound pruning heuristic and show that the local search algorithm is usually fast and accurate. The branch and bound algorithm is able to solve to optimality problems of 18 probes in about an hour, visiting about 10 6 nodes out of a search space of 10 16 nodes. The local search algorithm usually was able to find the global minimum of problems of 18 probes in about a minute. We also investigate (via simulation) the accuracy with which maps can be constructed from FISH data.
DNA arrays yield a global view of the cell by enabling the measurement of expression levels of th... more DNA arrays yield a global view of the cell by enabling the measurement of expression levels of thousands of genes simultaneously. When used to compare normal tissues and tissues at various stages of disease, or diseased tissues with different responses to treatment, arrays present opportunities for improved disease diagnosis and a deeper understanding of the molecular basis of observed phenotypes. Several machine learning methods have been applied to array data to classify genes on the basis of their expression levels in particular samples, and to classify tissue samples on the basis of their global patterns of gene expression . These tasks are made more difficult by the noisy nature of array data, and when classifying tissues, by the overwhelming number of gene attributes relative to the number of training samples. In this paper, we present a naive Bayes method for classifying tissues on the basis of DNA array data, and use a likelihood-based metric to select the most useful subset of genes for inclusion in the classifier. We applied this method to data sets with tissues of two different classes, and found its accuracy to exceed that of a recently described method in two of the three cases. Furthermore, our method is easily extendible to multiclass classification, and performed well when applied to a data set with three different classes of tissues.
A new algorithm for recognizing and parsing arbitrary context-free languages is presented, and se... more A new algorithm for recognizing and parsing arbitrary context-free languages is presented, and several new results are given on the computational complexity of these problems. The new algorithm is of both practical and theoretical interest. It is conceptually simple and allows a variety of ...
This paper serves two purposes. Firstly, it is an elementary introduction to thetheory of P-compl... more This paper serves two purposes. Firstly, it is an elementary introduction to thetheory of P-completeness --- the branch of complexity theory that focuses on identifyingthe problems in the class P that are "hardest," in the sense that they appear tolack highly parallel solutions. That is, they do not have parallel solutions using timepolynomial in the logarithm of the problem size
Restriction mapping is the process of determining the approximate positions of restriction sites ... more Restriction mapping is the process of determining the approximate positions of restriction sites along a target DNA molecule. Multiple complete digest (MCD) mapping is a protocol for constructing such maps in support of DNA sequencing. In the MCD mapping protocol, a clone library is generated which covers the region of interest with high redundancy. Each clone in the library is then completely digested multiple times, each time with a different restriction enzyme, and the lengths of the resulting fragments are measured via gel electrophoresis. This thesis describes an algorithmic method for constructing restriction maps from MCD data. The process is broken into two major steps. The first, clone ordering, is responsible for determining the order of the clone endpoints while simultaneously identifying bad data. The second, fragment identi ...
Journal of Medical Primatology
Background The genome annotations of rhesus (Macaca mulatta) and cynomolgus (Macaca fascicularis)... more Background The genome annotations of rhesus (Macaca mulatta) and cynomolgus (Macaca fascicularis) macaques, two of the most common non-human primate animal models, are limited.Methods We analyzed large-scale macaque RNA-based next-generation sequencing (RNAseq) data to identify un-annotated macaque transcripts.ResultsFor both macaque species, we uncovered thousands of novel isoforms for annotated genes and thousands of un-annotated intergenic transcripts enriched with non-coding RNAs. We also identified thousands of transcript sequences which are partially or completely ‘missing’ from current macaque genome assemblies. We showed that many newly identified transcripts were differentially expressed during SIV infection of rhesus macaques or during Ebola virus infection of cynomolgus macaques.Conclusions For two important macaque species, we uncovered thousands of novel isoforms and un-annotated intergenic transcripts including coding and non-coding RNAs, polyadenylated and non-polyade...
[1988] Proceedings. Structure in Complexity Theory Third Annual Conference
A recent proof that nondeterministic space-bounded complexity classes are closed under complement... more A recent proof that nondeterministic space-bounded complexity classes are closed under complementation is used to develop two further applications of the inductive counting technique. An errorless probabilistic algorithm is given for the undirected graph s-t connectivity problem that runs in O(log n) space and polynomial expected time, and it is shown that the class LOGCFL is closed under complementation. The
SIGOPS Oper. Syst. Rev., 1975
One of the key aspects of modern com-puting systems is the ability to allow many users to share t... more One of the key aspects of modern com-puting systems is the ability to allow many users to share the same facilities. These facilities may be memory, proces-sors, data bases or software, such as com-pilers or subroutines. When diverse users share common items, one is naturally ...
Proceedings of the eighth annual ACM symposium on Theory of computing - STOC '76, 1976
A new on-line context free language recogni-tion algorithm is presented which is derived from Ear... more A new on-line context free language recogni-tion algorithm is presented which is derived from Earley's algorithm and has several advantages over the original. First, the new al~orithm not only is conceptually simpler than Ear%ey's, but also allows significant speed ...
Proceedings of the thirteenth annual ACM symposium on Theory of computing - STOC '81, 1981
Information is not transferred instantaneously; there is always a propagation delay before an out... more Information is not transferred instantaneously; there is always a propagation delay before an output is available as an input to the next computational step. Propagation delay is a function of wire length, so we study the length of edges in planar graphs. We prove matching (to within a constant factor) upper and lower bounds on minimax edge length for four
VLSI Systems and Computations, 1981
6S RNA is an abundant noncoding RNA in Escherichia coli that binds to 70 RNA polymerase holoenzym... more 6S RNA is an abundant noncoding RNA in Escherichia coli that binds to 70 RNA polymerase holoenzyme to globally regulate gene expression in response to the shift from exponential growth to stationary phase. We have computationally identified >100 new 6S RNA homologs in diverse eubacterial lineages. Two abundant Bacillus subtilis RNAs of unknown function (BsrA and BsrB) and cyanobacterial 6Sa RNAs are now recognized as 6S homologs. Structural probing of E. coli 6S RNA and a B. subtilis homolog supports a common secondary structure derived from comparative sequence analysis. The conserved features of 6S RNA suggest that it binds RNA polymerase by mimicking the structure of DNA template in an open promoter complex. Interestingly, the two B. subtilis 6S RNAs are discoordinately expressed during growth, and many proteobacterial 6S RNAs could be cotranscribed with downstream homologs of the E. coli ygfA gene encoding a putative methenyltetrahydrofolate synthetase. The prevalence and robust expression of 6S RNAs emphasize their critical role in bacterial adaptation. 6604.
A novel family of riboswitches, called SAM-IV, is the fourth distinct set of mRNA elements to be ... more A novel family of riboswitches, called SAM-IV, is the fourth distinct set of mRNA elements to be reported that regulate gene expression via direct sensing of S-adenosylmethionine (SAM or AdoMet). SAM-IV riboswitches share conserved nucleotide positions with the previously described SAM-I riboswitches, despite rearranged structures and nucleotide positions with familyspecific nucleotide identities. Sequence analysis and molecular recognition experiments suggest that SAM-I and SAM-IV riboswitches share similar ligand binding sites, but have different scaffolds. Our findings support the view that RNA has considerable structural versatility and reveal that riboswitches exploit this potential to expand the scope of RNA in genetic regulation. .
J. Bioinform. Comput. Biol., 2009
Non-coding RNAs (ncRNAs) are transcripts that do not code for proteins. Recent findings have show... more Non-coding RNAs (ncRNAs) are transcripts that do not code for proteins. Recent findings have shown that RNA-mediated regulatory mechanisms influence a substantial portion of typical microbial genomes. We present an efficient method for finding potential ncRNAs in bacteria by clustering genomic sequences based on homology inferred from both primary sequence and secondary structure. We evaluate our approach using a set of predominantly Firmicutes sequences. Our results showed that, though primary sequence based-homology search was inaccurate for diverged ncRNA sequences, through our clustering method, we were able to infer motifs that recovered nearly all members of most known ncRNA families. Hence, our method shows promise for discovering new families of ncRNA.
Proceedings of the ACM/SIGDA international symposium on Field programmable gate arrays - FPGA '13, 2013
ABSTRACT Over the last decade, the number of known biologically important non-coding RNAs (ncRNAs... more ABSTRACT Over the last decade, the number of known biologically important non-coding RNAs (ncRNAs) has increased by orders of magnitude. The function performed by a specific ncRNA is partially determined by its structure, defined by which nucleotides of the molecule form pairs. These correlations may span large and variable distances in the linear RNA molecule. Because of these characteristics, algorithms that search for ncRNAs belonging to known families are computationally expensive, often taking many CPU weeks to run. To improve the speed of this search, multiple search algorithms arranged into a series of progressively more stringent filters can be used. In this paper, we present an FPGA based implementation of some of these algorithms. This is the first FPGA based approach to attempt to accelerate multiple filters used in ncRNA search. The FPGA is reconfigured for each filter, resulting in a total system speedup of 25x when compared with a single CPU.
Background: Transcription factor overexpression is common in biological experiments and transcrip... more Background: Transcription factor overexpression is common in biological experiments and transcription factor amplification is associated with many cancers, yet few studies have directly compared the DNA-binding profiles of endogenous versus overexpressed transcription factors. Methods: We analyzed MyoD ChIP-seq data from C2C12 mouse myotubes, primary mouse myotubes, and mouse fibroblasts differentiated into muscle cells by overexpression of MyoD and compared the genome-wide binding profiles and binding site characteristics of endogenous and overexpressed MyoD.
RNA bioinformatics and computational RNA biology have emerged from implementing methods for predi... more RNA bioinformatics and computational RNA biology have emerged from implementing methods for predicting the secondary structure of single sequences. The field has evolved to exploit multiple sequences to take evolutionary information into account, such as compensating (and structure preserving) base changes. These methods have been developed further and applied for computational screens of genomic sequence. Furthermore, a number of additional directions have emerged. These include methods to search for RNA 3D structure, RNA-RNA interactions, and design of interfering RNAs (RNAi) as well as methods for interactions between RNA and proteins.Here, we introduce the basic concepts of predicting RNA secondary structure relevant to the further analyses of RNA sequences. We also provide pointers to methods addressing various aspects of RNA bioinformatics and computational RNA biology.
Accurate estimates of the ordering and positioning of DNA markers (probes) on a chromosome are va... more Accurate estimates of the ordering and positioning of DNA markers (probes) on a chromosome are valuable tools used, for example, to help researchers isolate genetic factors in diseases. One such mapping technique, called fluorescent in situ hybridization (FISH), obtains approximate pairwise distance measurements between probes on a chromosome. We have developed two algorithms for computing least squares estimates of the ordering and positions of the probes: a branch and bound algorithm and a local search algorithm motivated by gradient descent. Simulations demonstrate the effectiveness of the branch and bound pruning heuristic and show that the local search algorithm is usually fast and accurate. The branch and bound algorithm is able to solve to optimality problems of 18 probes in about an hour, visiting about 10 6 nodes out of a search space of 10 16 nodes. The local search algorithm usually was able to find the global minimum of problems of 18 probes in about a minute. We also investigate (via simulation) the accuracy with which maps can be constructed from FISH data.
DNA arrays yield a global view of the cell by enabling the measurement of expression levels of th... more DNA arrays yield a global view of the cell by enabling the measurement of expression levels of thousands of genes simultaneously. When used to compare normal tissues and tissues at various stages of disease, or diseased tissues with different responses to treatment, arrays present opportunities for improved disease diagnosis and a deeper understanding of the molecular basis of observed phenotypes. Several machine learning methods have been applied to array data to classify genes on the basis of their expression levels in particular samples, and to classify tissue samples on the basis of their global patterns of gene expression . These tasks are made more difficult by the noisy nature of array data, and when classifying tissues, by the overwhelming number of gene attributes relative to the number of training samples. In this paper, we present a naive Bayes method for classifying tissues on the basis of DNA array data, and use a likelihood-based metric to select the most useful subset of genes for inclusion in the classifier. We applied this method to data sets with tissues of two different classes, and found its accuracy to exceed that of a recently described method in two of the three cases. Furthermore, our method is easily extendible to multiclass classification, and performed well when applied to a data set with three different classes of tissues.
A new algorithm for recognizing and parsing arbitrary context-free languages is presented, and se... more A new algorithm for recognizing and parsing arbitrary context-free languages is presented, and several new results are given on the computational complexity of these problems. The new algorithm is of both practical and theoretical interest. It is conceptually simple and allows a variety of ...
This paper serves two purposes. Firstly, it is an elementary introduction to thetheory of P-compl... more This paper serves two purposes. Firstly, it is an elementary introduction to thetheory of P-completeness --- the branch of complexity theory that focuses on identifyingthe problems in the class P that are "hardest," in the sense that they appear tolack highly parallel solutions. That is, they do not have parallel solutions using timepolynomial in the logarithm of the problem size
Restriction mapping is the process of determining the approximate positions of restriction sites ... more Restriction mapping is the process of determining the approximate positions of restriction sites along a target DNA molecule. Multiple complete digest (MCD) mapping is a protocol for constructing such maps in support of DNA sequencing. In the MCD mapping protocol, a clone library is generated which covers the region of interest with high redundancy. Each clone in the library is then completely digested multiple times, each time with a different restriction enzyme, and the lengths of the resulting fragments are measured via gel electrophoresis. This thesis describes an algorithmic method for constructing restriction maps from MCD data. The process is broken into two major steps. The first, clone ordering, is responsible for determining the order of the clone endpoints while simultaneously identifying bad data. The second, fragment identi ...
Journal of Medical Primatology
Background The genome annotations of rhesus (Macaca mulatta) and cynomolgus (Macaca fascicularis)... more Background The genome annotations of rhesus (Macaca mulatta) and cynomolgus (Macaca fascicularis) macaques, two of the most common non-human primate animal models, are limited.Methods We analyzed large-scale macaque RNA-based next-generation sequencing (RNAseq) data to identify un-annotated macaque transcripts.ResultsFor both macaque species, we uncovered thousands of novel isoforms for annotated genes and thousands of un-annotated intergenic transcripts enriched with non-coding RNAs. We also identified thousands of transcript sequences which are partially or completely ‘missing’ from current macaque genome assemblies. We showed that many newly identified transcripts were differentially expressed during SIV infection of rhesus macaques or during Ebola virus infection of cynomolgus macaques.Conclusions For two important macaque species, we uncovered thousands of novel isoforms and un-annotated intergenic transcripts including coding and non-coding RNAs, polyadenylated and non-polyade...
[1988] Proceedings. Structure in Complexity Theory Third Annual Conference
A recent proof that nondeterministic space-bounded complexity classes are closed under complement... more A recent proof that nondeterministic space-bounded complexity classes are closed under complementation is used to develop two further applications of the inductive counting technique. An errorless probabilistic algorithm is given for the undirected graph s-t connectivity problem that runs in O(log n) space and polynomial expected time, and it is shown that the class LOGCFL is closed under complementation. The
SIGOPS Oper. Syst. Rev., 1975
One of the key aspects of modern com-puting systems is the ability to allow many users to share t... more One of the key aspects of modern com-puting systems is the ability to allow many users to share the same facilities. These facilities may be memory, proces-sors, data bases or software, such as com-pilers or subroutines. When diverse users share common items, one is naturally ...
Proceedings of the eighth annual ACM symposium on Theory of computing - STOC '76, 1976
A new on-line context free language recogni-tion algorithm is presented which is derived from Ear... more A new on-line context free language recogni-tion algorithm is presented which is derived from Earley's algorithm and has several advantages over the original. First, the new al~orithm not only is conceptually simpler than Ear%ey's, but also allows significant speed ...
Proceedings of the thirteenth annual ACM symposium on Theory of computing - STOC '81, 1981
Information is not transferred instantaneously; there is always a propagation delay before an out... more Information is not transferred instantaneously; there is always a propagation delay before an output is available as an input to the next computational step. Propagation delay is a function of wire length, so we study the length of edges in planar graphs. We prove matching (to within a constant factor) upper and lower bounds on minimax edge length for four
VLSI Systems and Computations, 1981
6S RNA is an abundant noncoding RNA in Escherichia coli that binds to 70 RNA polymerase holoenzym... more 6S RNA is an abundant noncoding RNA in Escherichia coli that binds to 70 RNA polymerase holoenzyme to globally regulate gene expression in response to the shift from exponential growth to stationary phase. We have computationally identified >100 new 6S RNA homologs in diverse eubacterial lineages. Two abundant Bacillus subtilis RNAs of unknown function (BsrA and BsrB) and cyanobacterial 6Sa RNAs are now recognized as 6S homologs. Structural probing of E. coli 6S RNA and a B. subtilis homolog supports a common secondary structure derived from comparative sequence analysis. The conserved features of 6S RNA suggest that it binds RNA polymerase by mimicking the structure of DNA template in an open promoter complex. Interestingly, the two B. subtilis 6S RNAs are discoordinately expressed during growth, and many proteobacterial 6S RNAs could be cotranscribed with downstream homologs of the E. coli ygfA gene encoding a putative methenyltetrahydrofolate synthetase. The prevalence and robust expression of 6S RNAs emphasize their critical role in bacterial adaptation. 6604.
A novel family of riboswitches, called SAM-IV, is the fourth distinct set of mRNA elements to be ... more A novel family of riboswitches, called SAM-IV, is the fourth distinct set of mRNA elements to be reported that regulate gene expression via direct sensing of S-adenosylmethionine (SAM or AdoMet). SAM-IV riboswitches share conserved nucleotide positions with the previously described SAM-I riboswitches, despite rearranged structures and nucleotide positions with familyspecific nucleotide identities. Sequence analysis and molecular recognition experiments suggest that SAM-I and SAM-IV riboswitches share similar ligand binding sites, but have different scaffolds. Our findings support the view that RNA has considerable structural versatility and reveal that riboswitches exploit this potential to expand the scope of RNA in genetic regulation. .
J. Bioinform. Comput. Biol., 2009
Non-coding RNAs (ncRNAs) are transcripts that do not code for proteins. Recent findings have show... more Non-coding RNAs (ncRNAs) are transcripts that do not code for proteins. Recent findings have shown that RNA-mediated regulatory mechanisms influence a substantial portion of typical microbial genomes. We present an efficient method for finding potential ncRNAs in bacteria by clustering genomic sequences based on homology inferred from both primary sequence and secondary structure. We evaluate our approach using a set of predominantly Firmicutes sequences. Our results showed that, though primary sequence based-homology search was inaccurate for diverged ncRNA sequences, through our clustering method, we were able to infer motifs that recovered nearly all members of most known ncRNA families. Hence, our method shows promise for discovering new families of ncRNA.
Proceedings of the ACM/SIGDA international symposium on Field programmable gate arrays - FPGA '13, 2013
ABSTRACT Over the last decade, the number of known biologically important non-coding RNAs (ncRNAs... more ABSTRACT Over the last decade, the number of known biologically important non-coding RNAs (ncRNAs) has increased by orders of magnitude. The function performed by a specific ncRNA is partially determined by its structure, defined by which nucleotides of the molecule form pairs. These correlations may span large and variable distances in the linear RNA molecule. Because of these characteristics, algorithms that search for ncRNAs belonging to known families are computationally expensive, often taking many CPU weeks to run. To improve the speed of this search, multiple search algorithms arranged into a series of progressively more stringent filters can be used. In this paper, we present an FPGA based implementation of some of these algorithms. This is the first FPGA based approach to attempt to accelerate multiple filters used in ncRNA search. The FPGA is reconfigured for each filter, resulting in a total system speedup of 25x when compared with a single CPU.
Background: Transcription factor overexpression is common in biological experiments and transcrip... more Background: Transcription factor overexpression is common in biological experiments and transcription factor amplification is associated with many cancers, yet few studies have directly compared the DNA-binding profiles of endogenous versus overexpressed transcription factors. Methods: We analyzed MyoD ChIP-seq data from C2C12 mouse myotubes, primary mouse myotubes, and mouse fibroblasts differentiated into muscle cells by overexpression of MyoD and compared the genome-wide binding profiles and binding site characteristics of endogenous and overexpressed MyoD.
RNA bioinformatics and computational RNA biology have emerged from implementing methods for predi... more RNA bioinformatics and computational RNA biology have emerged from implementing methods for predicting the secondary structure of single sequences. The field has evolved to exploit multiple sequences to take evolutionary information into account, such as compensating (and structure preserving) base changes. These methods have been developed further and applied for computational screens of genomic sequence. Furthermore, a number of additional directions have emerged. These include methods to search for RNA 3D structure, RNA-RNA interactions, and design of interfering RNAs (RNAi) as well as methods for interactions between RNA and proteins.Here, we introduce the basic concepts of predicting RNA secondary structure relevant to the further analyses of RNA sequences. We also provide pointers to methods addressing various aspects of RNA bioinformatics and computational RNA biology.