Mark Robinson | Benaroya Research Institute (original) (raw)
Papers by Mark Robinson
Lecture Notes in Computer Science, 2011
Identifying transcription factor binding sites computationally is a hard problem as it produces m... more Identifying transcription factor binding sites computationally is a hard problem as it produces many false predictions. Combining the predictions from existing predictors can improve the overall predictions by using classification methods like Support Vector Machines (SVM). But conventional negative examples (that is, example of nonbinding sites) in this type of problem are highly unreliable. In this study, we have used different types of negative examples. One class of the negative examples has been taken from far away from the promoter regions, where the occurrence of binding sites is very low, and another one has been produced by randomization. Thus we observed the effect of using different negative examples in predicting transcription factor binding sites in mouse. We have also devised a novel cross-validation technique for this type of biological problem.
Lecture Notes in Computer Science, 2007
Currently the best algorithms for transcription factor binding site predictions are severely limi... more Currently the best algorithms for transcription factor binding site predictions are severely limited in accuracy. However, a non-linear combination of these algorithms could improve the quality of predictions. A support-vector machine was applied to combine the predictions of 12 key real valued algorithms. The data was divided into a training set and a test set, of which two were constructed: filtered and unfiltered. In addition, a different "window" of consecutive results was used in the input vector in order to contextualize the neighbouring results. Finally, classification results were improved with the aid of under and over sampling techniques. Our major finding is that we can reduce the False-Positive rate significantly. We also found that the bigger the window, the higher the F-score, but the more likely it is to make a false positive prediction, with the best trade-off being a window size of about 7.
Journal of Bioinformatics and Computational Biology, 2006
One of the main goals of analysing DNA sequences is to understand the temporal and positional inf... more One of the main goals of analysing DNA sequences is to understand the temporal and positional information that specifies gene expression. An important step in this process is the recognition of gene expression regulatory elements. Experimental procedures for this are slow and costly. In this paper we present a computational non-supervised algorithm that facilitates the process by statistically identifying the most likely regions within a putative regulatory sequence. A probabilistic technique is presented, based on the approximation of regulatory DNA with a Markov chain, for the location of putative transcription factor binding sites in a single stretch of DNA. Hereto we developed a procedure to approximate the order of Markov model for a given DNA sequence that circumvents some of the prohibitive assumptions underlying Markov modeling. Application of the algorithm to data from 55 genes in five species shows the high sensitivity of this Markov search algorithm. Our algorithm does no...
Lecture Notes in Computer Science
Computational prediction of cis-regulatory binding sites is widely acknowledged as a difficult ta... more Computational prediction of cis-regulatory binding sites is widely acknowledged as a difficult task. There are many different algorithms for searching for binding sites in current use. However, most of them produce a high rate of false positive predictions. Moreover, many algorithmic approaches are inherently constrained with respect to the range of binding sites that they can be expected to reliably predict. We propose to use SVMs to predict binding sites from multiple sources of evidence. We combine random selection under-sampling and the synthetic minority over-sampling technique to deal with the imbalanced nature of the data. In addition, we remove some of the final predicted binding sites on the basis of their biological plausibility. The results show that we can generate a new prediction that significantly improves on the performance of any one of the individual prediction algorithms.
Lecture Notes in Computer Science, 2005
Currently the best algorithms for transcription factor binding site prediction are severely limit... more Currently the best algorithms for transcription factor binding site prediction are severely limited in accuracy. There is good reason to believe that predictions from these different classes of algorithms could be used in conjunction to improve the quality of predictions. In this paper, we apply single layer networks, rules sets and support vector machines on predictions from 12 key algorithms. Furthermore, we use a 'window' of consecutive results in the input vector in order to contextualise the neighbouring results. Moreover, we improve the classification result with the aid of under-and over-sampling techniques. We find that support vector machines outperform each of the original individual algorithms and other classifiers employed in this work with both type of inputs, in that they maintain a better tradeoff between recall and precision.
Lecture Notes in Computer Science, 2007
The identification of cis-regulatory binding sites in DNA is a difficult problem in computational... more The identification of cis-regulatory binding sites in DNA is a difficult problem in computational biology. To obtain a full understanding of the complex machinery embodied in genetic regulatory networks it is necessary to know both the identity of the regulatory transcription factors together with the location of their binding sites in the genome. We show that using an SVM together with data sampling, to integrate the results of individual algorithms specialised for the prediction of binding site locations, can produce significant improvements upon the original algorithms. These results make more tractable the expensive experimental procedure of actually verifying the predictions.
2010 Ninth International Conference on Machine Learning and Applications, 2010
Finding the location of binding sites in DNA is a difficult problem. Although the location of som... more Finding the location of binding sites in DNA is a difficult problem. Although the location of some binding sites have been experimentally identified, other parts of the genome may or may not contain binding sites. This poses problems with negative data in a trainable classifier. Here we show that using randomized negative data gives a large boost in classifier performance when compared to the original labeled data.
2006 5th IEEE International Conference on Cognitive Informatics, 2006
Currently the best algorithms for transcription factor binding site prediction are severely limit... more Currently the best algorithms for transcription factor binding site prediction are severely limited in accuracy. In previous work we applied classification techniques on predictions from 12 key prediction algorithms. In this paper, we investigate the classification results when 4 feature selection filtering methods are used. They are Bi-Normal Separation, correlation coefficients, F-Score and a cross entropy based algorithm. It is found that all 4 filtering methods perform equally well. Moreover, we show that the worst performing algorithms are not detrimental to the overall performance.
Sixth International Conference on Machine Learning and Applications (ICMLA 2007), 2007
The identification of cis-regulatory binding sites in DNA in multicellular eukaryotes is a partic... more The identification of cis-regulatory binding sites in DNA in multicellular eukaryotes is a particularly difficult problem in computational biology. To obtain a full understanding of the complex machinery embodied in genetic regulatory networks it is necessary to know both the identity of the regulatory transcription factors together with the location of their binding sites in the genome. We show that using an SVM together with data sampling, to integrate the results of individual algorithms specialised for the prediction of binding site locations, can produce significant improvements upon the original algorithms applied to the mouse genome. These results make more tractable the expensive experimental procedure of actually verifying the predictions.
Neural Networks, 2008
The identification of cis-regulatory binding sites in DNA is a difficult problem in computational... more The identification of cis-regulatory binding sites in DNA is a difficult problem in computational biology. To obtain a full understanding of the complex machinery embodied in genetic regulatory networks it is necessary to know both the identity of the regulatory transcription factors together with the location of their binding sites in the genome. We show that using an SVM together with data sampling to classify the combination of the results of individual algorithms specialised for the prediction of binding site locations, can produce significant improvements upon the original algorithms. The resulting classifier produces fewer false positive predictions and so reduces the expensive experimental procedure of verifying the predictions.
Cold Spring Harbor Perspectives in Biology, 2010
Myxobacteria are renowned for the ability to sporulate within fruiting bodies whose shapes are sp... more Myxobacteria are renowned for the ability to sporulate within fruiting bodies whose shapes are species-specific. The capacity to build those multicellular structures arises from the ability of M. xanthus to organize high cell-density swarms, in which the cells tend to be aligned with each other while constantly in motion. The intrinsic polarity of rod-shaped cells lays the foundation, and each cell uses two polar engines for gliding on surfaces. It sprouts retractile type IV pili from the leading cell pole and secretes capsular polysaccharide through nozzles from the trailing pole. Regularly periodic reversal of the gliding direction was found to be required for swarming. Those reversals are generated by a G-protein switch which is driven by a sharply tuned oscillator. Starvation induces fruiting body development, and systematic reductions in the reversal frequency are necessary for the cells to aggregate rather than continue to swarm. Developmental gene expression is regulated by a network that is connected to the suppression of reversals.
The identification of cis-regulatory binding sites in DNA is a difficult problem in computational... more The identification of cis-regulatory binding sites in DNA is a difficult problem in computational biology. To obtain a full understanding of the complex machinery embodied in genetic regulatory networks it is necessary to know both the identity of the regulatory transcription factors together with the location of their binding sites in the genome. We show that using an SVM together with data sampling, to integrate the results of individual algorithms specialised for the prediction of binding site locations, can produce significant improvements upon the original algorithms. These results make more tractable the expensive experimental procedure of actually verifying the predictions.
Neural Computing and Applications, 2008
Currently the best algorithms for predicting transcription factor binding sites in DNA sequences ... more Currently the best algorithms for predicting transcription factor binding sites in DNA sequences are severely limited in accuracy. There is good reason to believe that predictions from different classes of algorithms could be used in conjunction to improve the quality of predictions. In this paper, we apply Single Layer Networks, Rules Sets, Support Vector Machines and the Adaboost algorithm to predictions from 12 key real valued algorithms. Furthermore, we use a 'window' of consecutive results as the input vector in order to contextualise the neighbouring results. We improve the classification result with the aid of under-and over-sampling techniques. We find that Support Vector Machines and the Adaboost algorithm outperform the original individual algorithms and the other classifiers employed in this work. In particular they give a better tradeoff between Recall and Precision.
BMC Systems Biology, 2007
His research interests are centered around ageing. This includes the study of the molecular pathw... more His research interests are centered around ageing. This includes the study of the molecular pathways that extend lifespan. His research uses both computational and wetlab techniques for modelling gene regulatory networks and predicting transcription factor binding sites.
homepages.feis.herts.ac.uk
The identification of cis-regulatory binding sites in DNA is a difficult problem in computational... more The identification of cis-regulatory binding sites in DNA is a difficult problem in computational biology. To obtain a full understanding of the complex machinery embodied in genetic regulatory networks it is necessary to know both the identity of the regulatory transcription factors together with the location of their binding sites in the genome. We show that using an SVM together with data sampling, to integrate the results of individual algorithms specialised for the prediction of binding site locations, can produce significant improvements upon the original algorithms. These results make more tractable the expensive experimental procedure of actually verifying the predictions.
In: Procs of the 14th European Symposium on Artificial Neural Networks, ESANN 2006, 2006
Currently the best algorithms for transcription factor binding site prediction are severely limit... more Currently the best algorithms for transcription factor binding site prediction are severely limited in accuracy. In previous work we combine random selection under-sampling into SMOTE over-sampling technique, working with several classification algorithms from machine learning field to integrate binding site predictions. In this paper, we improve the classification result with the aid of Tomek links as an either undersampling or cleaning technique.
The discovery of a living coelacanth specimen in 1938 was remarkable, as this lineage of lobe-fin... more The discovery of a living coelacanth specimen in 1938 was remarkable, as this lineage of lobe-finned fish was thought to have become extinct 70 million years ago. The modern coelacanth looks remarkably similar to many of its ancient relatives, and its evolutionary proximity to our own fish ancestors provides a glimpse of the fish that first walked on land. Here we report the genome sequence of the African coelacanth, Latimeria chalumnae. Through a phylogenomic analysis, we conclude that the lungfish, and not the coelacanth, is the closest living relative of tetrapods. Coelacanth protein-coding genes are significantly more slowly evolving than those of tetrapods, unlike other genomic features. Analyses of changes in genes and regulatory elements during the vertebrate adaptation to land highlight genes involved in immunity, nitrogen excretion and the development of fins, tail, ear, eye, brain and olfaction. Functional assays of enhancers involved in the fin-to-limb transition and in the emergence of extra-embryonic tissues show the importance of the coelacanth genome as a blueprint for understanding tetrapod evolution.
Abstract The identification of cis-regulatory binding sites in DNA in multicellular eukaryotes is... more Abstract The identification of cis-regulatory binding sites in DNA in multicellular eukaryotes is a particularly difficult problem in computational biology. To obtain a full understanding of the complex machinery embodied in genetic regulatory networks it is necessary to know both the identity of the regulatory transcription factors together with the location of their binding sites in the genome.
The identification of cis-regulatory binding sites in DNA is a difficult problem in computational... more The identification of cis-regulatory binding sites in DNA is a difficult problem in computational biology. To obtain a full understanding of the complex machinery embodied in genetic regulatory networks it is necessary to know both the identity of the regulatory transcription factors together with the location of their binding sites in the genome.
The identification of cis-regulatory binding sites in DNA is a difficult problem in computational... more The identification of cis-regulatory binding sites in DNA is a difficult problem in computational biology. To obtain a full understanding of the complex machinery embodied in genetic regulatory networks it is necessary to know both the identity of the regulatory transcription factors together with the location of their binding sites in the genome.
Lecture Notes in Computer Science, 2011
Identifying transcription factor binding sites computationally is a hard problem as it produces m... more Identifying transcription factor binding sites computationally is a hard problem as it produces many false predictions. Combining the predictions from existing predictors can improve the overall predictions by using classification methods like Support Vector Machines (SVM). But conventional negative examples (that is, example of nonbinding sites) in this type of problem are highly unreliable. In this study, we have used different types of negative examples. One class of the negative examples has been taken from far away from the promoter regions, where the occurrence of binding sites is very low, and another one has been produced by randomization. Thus we observed the effect of using different negative examples in predicting transcription factor binding sites in mouse. We have also devised a novel cross-validation technique for this type of biological problem.
Lecture Notes in Computer Science, 2007
Currently the best algorithms for transcription factor binding site predictions are severely limi... more Currently the best algorithms for transcription factor binding site predictions are severely limited in accuracy. However, a non-linear combination of these algorithms could improve the quality of predictions. A support-vector machine was applied to combine the predictions of 12 key real valued algorithms. The data was divided into a training set and a test set, of which two were constructed: filtered and unfiltered. In addition, a different "window" of consecutive results was used in the input vector in order to contextualize the neighbouring results. Finally, classification results were improved with the aid of under and over sampling techniques. Our major finding is that we can reduce the False-Positive rate significantly. We also found that the bigger the window, the higher the F-score, but the more likely it is to make a false positive prediction, with the best trade-off being a window size of about 7.
Journal of Bioinformatics and Computational Biology, 2006
One of the main goals of analysing DNA sequences is to understand the temporal and positional inf... more One of the main goals of analysing DNA sequences is to understand the temporal and positional information that specifies gene expression. An important step in this process is the recognition of gene expression regulatory elements. Experimental procedures for this are slow and costly. In this paper we present a computational non-supervised algorithm that facilitates the process by statistically identifying the most likely regions within a putative regulatory sequence. A probabilistic technique is presented, based on the approximation of regulatory DNA with a Markov chain, for the location of putative transcription factor binding sites in a single stretch of DNA. Hereto we developed a procedure to approximate the order of Markov model for a given DNA sequence that circumvents some of the prohibitive assumptions underlying Markov modeling. Application of the algorithm to data from 55 genes in five species shows the high sensitivity of this Markov search algorithm. Our algorithm does no...
Lecture Notes in Computer Science
Computational prediction of cis-regulatory binding sites is widely acknowledged as a difficult ta... more Computational prediction of cis-regulatory binding sites is widely acknowledged as a difficult task. There are many different algorithms for searching for binding sites in current use. However, most of them produce a high rate of false positive predictions. Moreover, many algorithmic approaches are inherently constrained with respect to the range of binding sites that they can be expected to reliably predict. We propose to use SVMs to predict binding sites from multiple sources of evidence. We combine random selection under-sampling and the synthetic minority over-sampling technique to deal with the imbalanced nature of the data. In addition, we remove some of the final predicted binding sites on the basis of their biological plausibility. The results show that we can generate a new prediction that significantly improves on the performance of any one of the individual prediction algorithms.
Lecture Notes in Computer Science, 2005
Currently the best algorithms for transcription factor binding site prediction are severely limit... more Currently the best algorithms for transcription factor binding site prediction are severely limited in accuracy. There is good reason to believe that predictions from these different classes of algorithms could be used in conjunction to improve the quality of predictions. In this paper, we apply single layer networks, rules sets and support vector machines on predictions from 12 key algorithms. Furthermore, we use a 'window' of consecutive results in the input vector in order to contextualise the neighbouring results. Moreover, we improve the classification result with the aid of under-and over-sampling techniques. We find that support vector machines outperform each of the original individual algorithms and other classifiers employed in this work with both type of inputs, in that they maintain a better tradeoff between recall and precision.
Lecture Notes in Computer Science, 2007
The identification of cis-regulatory binding sites in DNA is a difficult problem in computational... more The identification of cis-regulatory binding sites in DNA is a difficult problem in computational biology. To obtain a full understanding of the complex machinery embodied in genetic regulatory networks it is necessary to know both the identity of the regulatory transcription factors together with the location of their binding sites in the genome. We show that using an SVM together with data sampling, to integrate the results of individual algorithms specialised for the prediction of binding site locations, can produce significant improvements upon the original algorithms. These results make more tractable the expensive experimental procedure of actually verifying the predictions.
2010 Ninth International Conference on Machine Learning and Applications, 2010
Finding the location of binding sites in DNA is a difficult problem. Although the location of som... more Finding the location of binding sites in DNA is a difficult problem. Although the location of some binding sites have been experimentally identified, other parts of the genome may or may not contain binding sites. This poses problems with negative data in a trainable classifier. Here we show that using randomized negative data gives a large boost in classifier performance when compared to the original labeled data.
2006 5th IEEE International Conference on Cognitive Informatics, 2006
Currently the best algorithms for transcription factor binding site prediction are severely limit... more Currently the best algorithms for transcription factor binding site prediction are severely limited in accuracy. In previous work we applied classification techniques on predictions from 12 key prediction algorithms. In this paper, we investigate the classification results when 4 feature selection filtering methods are used. They are Bi-Normal Separation, correlation coefficients, F-Score and a cross entropy based algorithm. It is found that all 4 filtering methods perform equally well. Moreover, we show that the worst performing algorithms are not detrimental to the overall performance.
Sixth International Conference on Machine Learning and Applications (ICMLA 2007), 2007
The identification of cis-regulatory binding sites in DNA in multicellular eukaryotes is a partic... more The identification of cis-regulatory binding sites in DNA in multicellular eukaryotes is a particularly difficult problem in computational biology. To obtain a full understanding of the complex machinery embodied in genetic regulatory networks it is necessary to know both the identity of the regulatory transcription factors together with the location of their binding sites in the genome. We show that using an SVM together with data sampling, to integrate the results of individual algorithms specialised for the prediction of binding site locations, can produce significant improvements upon the original algorithms applied to the mouse genome. These results make more tractable the expensive experimental procedure of actually verifying the predictions.
Neural Networks, 2008
The identification of cis-regulatory binding sites in DNA is a difficult problem in computational... more The identification of cis-regulatory binding sites in DNA is a difficult problem in computational biology. To obtain a full understanding of the complex machinery embodied in genetic regulatory networks it is necessary to know both the identity of the regulatory transcription factors together with the location of their binding sites in the genome. We show that using an SVM together with data sampling to classify the combination of the results of individual algorithms specialised for the prediction of binding site locations, can produce significant improvements upon the original algorithms. The resulting classifier produces fewer false positive predictions and so reduces the expensive experimental procedure of verifying the predictions.
Cold Spring Harbor Perspectives in Biology, 2010
Myxobacteria are renowned for the ability to sporulate within fruiting bodies whose shapes are sp... more Myxobacteria are renowned for the ability to sporulate within fruiting bodies whose shapes are species-specific. The capacity to build those multicellular structures arises from the ability of M. xanthus to organize high cell-density swarms, in which the cells tend to be aligned with each other while constantly in motion. The intrinsic polarity of rod-shaped cells lays the foundation, and each cell uses two polar engines for gliding on surfaces. It sprouts retractile type IV pili from the leading cell pole and secretes capsular polysaccharide through nozzles from the trailing pole. Regularly periodic reversal of the gliding direction was found to be required for swarming. Those reversals are generated by a G-protein switch which is driven by a sharply tuned oscillator. Starvation induces fruiting body development, and systematic reductions in the reversal frequency are necessary for the cells to aggregate rather than continue to swarm. Developmental gene expression is regulated by a network that is connected to the suppression of reversals.
The identification of cis-regulatory binding sites in DNA is a difficult problem in computational... more The identification of cis-regulatory binding sites in DNA is a difficult problem in computational biology. To obtain a full understanding of the complex machinery embodied in genetic regulatory networks it is necessary to know both the identity of the regulatory transcription factors together with the location of their binding sites in the genome. We show that using an SVM together with data sampling, to integrate the results of individual algorithms specialised for the prediction of binding site locations, can produce significant improvements upon the original algorithms. These results make more tractable the expensive experimental procedure of actually verifying the predictions.
Neural Computing and Applications, 2008
Currently the best algorithms for predicting transcription factor binding sites in DNA sequences ... more Currently the best algorithms for predicting transcription factor binding sites in DNA sequences are severely limited in accuracy. There is good reason to believe that predictions from different classes of algorithms could be used in conjunction to improve the quality of predictions. In this paper, we apply Single Layer Networks, Rules Sets, Support Vector Machines and the Adaboost algorithm to predictions from 12 key real valued algorithms. Furthermore, we use a 'window' of consecutive results as the input vector in order to contextualise the neighbouring results. We improve the classification result with the aid of under-and over-sampling techniques. We find that Support Vector Machines and the Adaboost algorithm outperform the original individual algorithms and the other classifiers employed in this work. In particular they give a better tradeoff between Recall and Precision.
BMC Systems Biology, 2007
His research interests are centered around ageing. This includes the study of the molecular pathw... more His research interests are centered around ageing. This includes the study of the molecular pathways that extend lifespan. His research uses both computational and wetlab techniques for modelling gene regulatory networks and predicting transcription factor binding sites.
homepages.feis.herts.ac.uk
The identification of cis-regulatory binding sites in DNA is a difficult problem in computational... more The identification of cis-regulatory binding sites in DNA is a difficult problem in computational biology. To obtain a full understanding of the complex machinery embodied in genetic regulatory networks it is necessary to know both the identity of the regulatory transcription factors together with the location of their binding sites in the genome. We show that using an SVM together with data sampling, to integrate the results of individual algorithms specialised for the prediction of binding site locations, can produce significant improvements upon the original algorithms. These results make more tractable the expensive experimental procedure of actually verifying the predictions.
In: Procs of the 14th European Symposium on Artificial Neural Networks, ESANN 2006, 2006
Currently the best algorithms for transcription factor binding site prediction are severely limit... more Currently the best algorithms for transcription factor binding site prediction are severely limited in accuracy. In previous work we combine random selection under-sampling into SMOTE over-sampling technique, working with several classification algorithms from machine learning field to integrate binding site predictions. In this paper, we improve the classification result with the aid of Tomek links as an either undersampling or cleaning technique.
The discovery of a living coelacanth specimen in 1938 was remarkable, as this lineage of lobe-fin... more The discovery of a living coelacanth specimen in 1938 was remarkable, as this lineage of lobe-finned fish was thought to have become extinct 70 million years ago. The modern coelacanth looks remarkably similar to many of its ancient relatives, and its evolutionary proximity to our own fish ancestors provides a glimpse of the fish that first walked on land. Here we report the genome sequence of the African coelacanth, Latimeria chalumnae. Through a phylogenomic analysis, we conclude that the lungfish, and not the coelacanth, is the closest living relative of tetrapods. Coelacanth protein-coding genes are significantly more slowly evolving than those of tetrapods, unlike other genomic features. Analyses of changes in genes and regulatory elements during the vertebrate adaptation to land highlight genes involved in immunity, nitrogen excretion and the development of fins, tail, ear, eye, brain and olfaction. Functional assays of enhancers involved in the fin-to-limb transition and in the emergence of extra-embryonic tissues show the importance of the coelacanth genome as a blueprint for understanding tetrapod evolution.
Abstract The identification of cis-regulatory binding sites in DNA in multicellular eukaryotes is... more Abstract The identification of cis-regulatory binding sites in DNA in multicellular eukaryotes is a particularly difficult problem in computational biology. To obtain a full understanding of the complex machinery embodied in genetic regulatory networks it is necessary to know both the identity of the regulatory transcription factors together with the location of their binding sites in the genome.
The identification of cis-regulatory binding sites in DNA is a difficult problem in computational... more The identification of cis-regulatory binding sites in DNA is a difficult problem in computational biology. To obtain a full understanding of the complex machinery embodied in genetic regulatory networks it is necessary to know both the identity of the regulatory transcription factors together with the location of their binding sites in the genome.
The identification of cis-regulatory binding sites in DNA is a difficult problem in computational... more The identification of cis-regulatory binding sites in DNA is a difficult problem in computational biology. To obtain a full understanding of the complex machinery embodied in genetic regulatory networks it is necessary to know both the identity of the regulatory transcription factors together with the location of their binding sites in the genome.