Paul Gardner | University of Canterbury/Te Whare Wānanga o Waitaha (original) (raw)

Papers by Paul Gardner

Research paper thumbnail of A meta-analysis of bioinformatics software benchmarks reveals that publication-bias unduly influences software accuracy

Research paper thumbnail of Predicting RNA Structure Using Mutual Information

Applied Bioinformatics, Feb 1, 2005

With the ever-increasing number of sequenced RNAs and the establishment of new RNA Abstract datab... more With the ever-increasing number of sequenced RNAs and the establishment of new RNA Abstract databases, such as the Comparative RNA Web Site and Rfam, there is a growing need for accurately and automatically predicting RNA structures from multiple alignments. Since RNA secondary structure is often conserved in evolution, the well known, but underused, mutual information measure for identifying covarying sites in an alignment can be useful for identifying structural elements. This article presents MIfold, a MATLAB  toolbox that employs mutual information, or a related covariation measure, to display and predict conserved RNA secondary structure (including pseudoknots) from an alignment. Results: We show that MIfold can be used to predict simple pseudoknots, and that the performance can be adjusted to make it either more sensitive or more selective. We also demonstrate that the overall performance of MIfold improves with the number of aligned sequences for certain types of RNA sequences. In addition, we show that, for these sequences, MIfold is more sensitive but less selective than the related RNAalifold structure prediction program and is comparable with the COVE structure prediction package. Conclusion: MIfold provides a useful supplementary tool to programs such as RNA Structure Logo, RNAalifold and COVE and should be useful for automatically generating structural predictions for databases such as Rfam. Availability: MIfold is freely available from

Research paper thumbnail of Simulating the RNA-world and computational ribonomics : a thesis presented for the degree of Doctor of Philosophy in Biomathematics at Massey University, Palmerston North, New Zealand

Research paper thumbnail of Building non-coding RNA families

Emerging high-throughput technologies have led to a deluge of putative non-coding RNA (ncRNA) seq... more Emerging high-throughput technologies have led to a deluge of putative non-coding RNA (ncRNA) sequences identified in a wide variety of organisms. Systematic characterization of these transcripts will be a tremendous challenge. Homology detection is critical to making maximal use of functional information gathered about ncRNAs: identifying homologous sequence allows us to transfer information gathered in one organism to another quickly and with a high degree of confidence. ncRNA presents a challenge for homology detection, as the primary sequence is often poorly conserved and de novo secondary structure prediction and search remains difficult. This protocol introduces methods developed by the Rfam database for identifying "families" of homologous ncRNAs starting from single "seed" sequences using manually curated sequence alignments to build powerful statistical models of sequence and structure conservation known as covariance models (CMs), implemented in the Infernal software package. We provide a step-by-step iterative protocol for identifying ncRNA homologs, then constructing an alignment and corresponding CM. We also work through an example for the bacterial small RNA MicA, discovering a previously unreported family of divergent MicA homologs in genus Xenorhabdus in the process.

Research paper thumbnail of Crowdsourcing RNA structural alignments with an online computer game

Pacific Symposium on Biocomputing Pacific Symposium on Biocomputing, 2015

The annotation and classification of ncRNAs is essential to decipher molecular mechanisms of gene... more The annotation and classification of ncRNAs is essential to decipher molecular mechanisms of gene regulation in normal and disease states. A database such as Rfam maintains alignments, consensus secondary structures, and corresponding annotations for RNA families. Its primary purpose is the automated, accurate annotation of non-coding RNAs in genomic sequences. However, the alignment of RNAs is computationally challenging, and the data stored in this database are often subject to improvements. Here, we design and evaluate Ribo, a human-computing game that aims to improve the accuracy of RNA alignments already stored in Rfam. We demonstrate the potential of our techniques and discuss the feasibility of large scale collaborative annotation and classification of RNA families.

Research paper thumbnail of An evaluation of the accuracy and speed of metagenome analysis tools

Metagenome studies are becoming increasingly widespread, yielding important insights into microbi... more Metagenome studies are becoming increasingly widespread, yielding important insights into microbial communities covering diverse environments from terrestrial and aquatic ecosystems to human skin and gut. With the advent of high-throughput sequencing platforms, the use of large scale shotgun sequencing approaches is now commonplace. However, a thorough independent benchmark comparing state-of-the-art metagenome analysis tools is lacking. Here, we present a benchmark where the most widely used tools are tested on complex, realistic data sets. Our results clearly show that the most widely used tools are not necessarily the most accurate, that the most accurate tool is not necessarily the most time consuming, and that there is a high degree of variability between available tools. These findings are important as the conclusions of any metagenomics study are affected by errors in the predicted community composition and functional capacity. Data sets and results are freely available from http://www.ucbioinformatics.org/metabenchmark.html.

Research paper thumbnail of Conservation and Losses of Non-Coding RNAs in Avian Genomes

PLOS ONE, 2015

Here we present the results of a large-scale bioinformatics annotation of non-coding RNA loci in ... more Here we present the results of a large-scale bioinformatics annotation of non-coding RNA loci in 48 avian genomes. Our approach uses probabilistic models of hand-curated families from the Rfam database to infer conserved RNA families within each avian genome. We supplement these annotations with predictions from the tRNA annotation tool, tRNAscan-SE and microRNAs from miRBase. We identify 34 lncRNA-associated loci that are conserved between birds and mammals and validate 12 of these in chicken. We report several intriguing cases where a reported mammalian lncRNA, but not its function, is conserved. We also demonstrate extensive conservation of classical ncRNAs (e.g., tRNAs) and more recently discovered ncRNAs (e.g., snoRNAs and miRNAs) in birds. Furthermore, we describe numerous "losses" of several RNA families, and attribute these to either genuine loss, divergence or missing data. In particular, we show that many of these losses are due to the challenges associated with assembling avian microchromosomes. These combined results illustrate the utility of applying homology-based methods for annotating novel vertebrate genomes.

Research paper thumbnail of RNA folding argues against a hot-start origin of life

Journal of molecular evolution, 2000

Opinion is strongly divided on whether life arose on earth under hot or cold conditions, the hot-... more Opinion is strongly divided on whether life arose on earth under hot or cold conditions, the hot-start and cold-start scenarios, respectively. The origin of life close to deep thermal vents appears as the majority opinion among biologists, but there is considerable biochemical evidence that high temperatures are incompatible with an RNA world. To be functional, RNA has to fold into a three-dimensional structure. We report both theoretical and experimental results on RNA folding and show that (as expected) hot conditions strongly reduce RNA folding. The theoretical results come from energy-minimization calculations of the average extent of folding of RNA, mainly from 0-90 degrees C, for both random sequences and tRNA sequences. The experimental results are from circular-dichroism measurements of tRNA over a similar range of temperatures. The quantitative agreement between calculations and experiment is remarkable, even to the shape of the curves indicating the cooperative nature of R...

Research paper thumbnail of Robust Identification of Noncoding RNA from Transcriptomes Requires Phylogenetically-Informed Sampling

PLoS Computational Biology, 2014

Noncoding RNAs are integral to a wide range of biological processes, including translation, gene ... more Noncoding RNAs are integral to a wide range of biological processes, including translation, gene regulation, host-pathogen interactions and environmental sensing. While genomics is now a mature field, our capacity to identify noncoding RNA elements in bacterial and archaeal genomes is hampered by the difficulty of de novo identification. The emergence of new technologies for characterizing transcriptome outputs, notably RNA-seq, are improving noncoding RNA identification and expression quantification. However, a major challenge is to robustly distinguish functional outputs from transcriptional noise. To establish whether annotation of existing transcriptome data has effectively captured all functional outputs, we analysed over 400 publicly available RNA-seq datasets spanning 37 different Archaea and Bacteria. Using comparative tools, we identify close to a thousand highly-expressed candidate noncoding RNAs. However, our analyses reveal that capacity to identify noncoding RNA outputs is strongly dependent on phylogenetic sampling. Surprisingly, and in stark contrast to protein-coding genes, the phylogenetic window for effective use of comparative methods is perversely narrow: aggregating public datasets only produced one phylogenetic cluster where these tools could be used to robustly separate unannotated noncoding RNAs from a null hypothesis of transcriptional noise. Our results show that for the full potential of transcriptomics data to be realized, a change in experimental design is paramount: effective transcriptomics requires phylogeny-aware sampling.

Research paper thumbnail of A comprehensive comparison of comparative RNA structure prediction approaches

BMC Bioinformatics, 2004

Background: An increasing number of researchers have released novel RNA structure analysis and pr... more Background: An increasing number of researchers have released novel RNA structure analysis and prediction algorithms for comparative approaches to structure prediction. Yet, independent benchmarking of these algorithms is rarely performed as is now common practice for proteinfolding, gene-finding and multiple-sequence-alignment algorithms.

Research paper thumbnail of A comparison of RNA folding measures

BMC Bioinformatics, 2005

In the last few decades there has been a great deal of discussion concerning whether or not nonco... more In the last few decades there has been a great deal of discussion concerning whether or not noncoding RNA sequences (ncRNAs) fold in a more well-defined manner than random sequences. In this paper, we investigate several existing measures for how well an RNA sequence folds, and compare the behaviour of these measures over a large range of Rfam ncRNA families. Such measures can be useful in, for example, identifying novel ncRNAs, and indicating the presence of alternate RNA foldings.

Research paper thumbnail of Mutation of miRNA target sequences during human evolution

Research paper thumbnail of An introduction to RNA databases

We present an introduction to RNA databases. The history and technology behind RNA databases is b... more We present an introduction to RNA databases. The history and technology behind RNA databases is briefly discussed. We examine differing methods of data collection and curation, and discuss their impact on both the scope and accuracy of the resulting databases. Finally, we demonstrate these principals through detailed examination of four leading RNA databases: Noncode, miRBase, Rfam, and SILVA.

Research paper thumbnail of SnoPatrol: How many snoRNA genes are there?

Research paper thumbnail of A hidden Markov model approach for determining expression from genomic tiling micro arrays

BMC BIOINFORMATICS, 2006

Background: Genomic tiling micro arrays have great potential for identifying previously undiscove... more Background: Genomic tiling micro arrays have great potential for identifying previously undiscovered coding as well as non-coding transcription. To-date, however, analyses of these data have been performed in an ad hoc fashion.

Research paper thumbnail of Rfam 12.0: updates to the RNA families database

Nucleic acids research, Jan 28, 2015

The Rfam database (available at http://rfam.xfam.org) is a collection of non-coding RNA families ... more The Rfam database (available at http://rfam.xfam.org) is a collection of non-coding RNA families represented by manually curated sequence alignments, consensus secondary structures and annotation gathered from corresponding Wikipedia, taxonomy and ontology resources. In this article, we detail updates and improvements to the Rfam data and website for the Rfam 12.0 release. We describe the upgrade of our search pipeline to use Infernal 1.1 and demonstrate its improved homology detection ability by comparison with the previous version. The new pipeline is easier for users to apply to their own data sets, and we illustrate its ability to annotate RNAs in genomic and metagenomic data sets of various sizes. Rfam has been expanded to include 260 new families, including the well-studied large subunit ribosomal RNA family, and for the first time includes information on short sequence- and structure-based RNA motifs present within families.

Research paper thumbnail of The RNA WikiProject: Community annotation of RNA families

RNA, 2008

The online encyclopedia Wikipedia has become one of the most important online references in the w... more The online encyclopedia Wikipedia has become one of the most important online references in the world and has a substantial and growing scientific content. A search of Google with many RNA-related keywords identifies a Wikipedia article as the top hit. We believe that the RNA community has an important and timely opportunity to maximize the content and quality of RNA information in Wikipedia. To this end, we have formed the RNA WikiProject (http://en.wikipedia.org/wiki/Wikipedia: WikiProject_RNA) as part of the larger Molecular and Cellular Biology WikiProject. We have created over 600 new Wikipedia articles describing families of noncoding RNAs based on the Rfam database, and invite the community to update, edit, and correct these articles. The Rfam database now redistributes this Wikipedia content as the primary textual annotation of its RNA families. Users can, therefore, for the first time, directly edit the content of one of the major RNA databases. We believe that this Wikipedia/Rfam link acts as a functioning model for incorporating community annotation into molecular biology databases.

Research paper thumbnail of RNASTAR: An RNA STructural Alignment Repository that provides insight into the evolution of natural and artificial RNAs

RNA, 2012

Automated RNA alignment algorithms often fail to recapture the essential conserved sites that are... more Automated RNA alignment algorithms often fail to recapture the essential conserved sites that are critical for function. To assist in the refinement of these algorithms, we manually curated a set of 148 alignments with a total of 9600 unique sequences, in which each alignment was backed by at least one crystal or NMR structure. These alignments included both naturally and artificially selected molecules. We used principles of isostericity to improve the alignments from an average of 83%-94% isosteric base pairs. We expect that this alignment collection will assist in a wide range of benchmarking efforts and provide new insight into evolutionary principles governing change in RNA structural motifs. The improved alignments have been contributed to the Rfam database.

Research paper thumbnail of Optimal alphabets for an RNA world

Proceedings of the Royal Society B: Biological Sciences, 2003

Experiments have shown that the canonical AUCG genetic alphabet is not the only possible nucleoti... more Experiments have shown that the canonical AUCG genetic alphabet is not the only possible nucleotide alphabet. In this work we address the question 'is the canonical alphabet optimal?' We make the assumption that the genetic alphabet was determined in the RNA world. Computational tools are used to infer the RNA secondary structure (shape) from a given RNA sequence, and statistics from RNA shapes are gathered with respect to alphabet size. Then, simulations based upon the replication and selection of fixed-sized RNA populations are used to investigate the effect of alternative alphabets upon RNA's ability to step through a fitness landscape. These results show that for a low copy fidelity the canonical alphabet is fitter than two-, six-and eight-letter alphabets. In higher copy-fidelity experiments, six-letter alphabets outperform the four-letter alphabets, suggesting that the canonical alphabet is indeed a relic of the RNA world.

Research paper thumbnail of Long- and Short-Term Selective Forces on Malaria Parasite Genomes

PLoS Genetics, 2010

Plasmodium parasites, the causal agents of malaria, result in more than 1 million deaths annually... more Plasmodium parasites, the causal agents of malaria, result in more than 1 million deaths annually. Plasmodium are unicellular eukaryotes with small ,23 Mb genomes encoding ,5200 protein-coding genes. The protein-coding genes comprise about half of these genomes. Although evolutionary processes have a significant impact on malaria control, the selective pressures within Plasmodium genomes are poorly understood, particularly in the non-protein-coding portion of the genome. We use evolutionary methods to describe selective processes in both the coding and non-coding regions of these genomes. Based on genome alignments of seven Plasmodium species, we show that protein-coding, intergenic and intronic regions are all subject to purifying selection and we identify 670 conserved non-genic elements. We then use genome-wide polymorphism data from P. falciparum to describe short-term selective processes in this species and identify some candidate genes for balancing (diversifying) selection. Our analyses suggest that there are many functional elements in the non-genic regions of these genomes and that adaptive evolution has occurred more frequently in the protein-coding regions of the genome.

Research paper thumbnail of A meta-analysis of bioinformatics software benchmarks reveals that publication-bias unduly influences software accuracy

Research paper thumbnail of Predicting RNA Structure Using Mutual Information

Applied Bioinformatics, Feb 1, 2005

With the ever-increasing number of sequenced RNAs and the establishment of new RNA Abstract datab... more With the ever-increasing number of sequenced RNAs and the establishment of new RNA Abstract databases, such as the Comparative RNA Web Site and Rfam, there is a growing need for accurately and automatically predicting RNA structures from multiple alignments. Since RNA secondary structure is often conserved in evolution, the well known, but underused, mutual information measure for identifying covarying sites in an alignment can be useful for identifying structural elements. This article presents MIfold, a MATLAB  toolbox that employs mutual information, or a related covariation measure, to display and predict conserved RNA secondary structure (including pseudoknots) from an alignment. Results: We show that MIfold can be used to predict simple pseudoknots, and that the performance can be adjusted to make it either more sensitive or more selective. We also demonstrate that the overall performance of MIfold improves with the number of aligned sequences for certain types of RNA sequences. In addition, we show that, for these sequences, MIfold is more sensitive but less selective than the related RNAalifold structure prediction program and is comparable with the COVE structure prediction package. Conclusion: MIfold provides a useful supplementary tool to programs such as RNA Structure Logo, RNAalifold and COVE and should be useful for automatically generating structural predictions for databases such as Rfam. Availability: MIfold is freely available from

Research paper thumbnail of Simulating the RNA-world and computational ribonomics : a thesis presented for the degree of Doctor of Philosophy in Biomathematics at Massey University, Palmerston North, New Zealand

Research paper thumbnail of Building non-coding RNA families

Emerging high-throughput technologies have led to a deluge of putative non-coding RNA (ncRNA) seq... more Emerging high-throughput technologies have led to a deluge of putative non-coding RNA (ncRNA) sequences identified in a wide variety of organisms. Systematic characterization of these transcripts will be a tremendous challenge. Homology detection is critical to making maximal use of functional information gathered about ncRNAs: identifying homologous sequence allows us to transfer information gathered in one organism to another quickly and with a high degree of confidence. ncRNA presents a challenge for homology detection, as the primary sequence is often poorly conserved and de novo secondary structure prediction and search remains difficult. This protocol introduces methods developed by the Rfam database for identifying "families" of homologous ncRNAs starting from single "seed" sequences using manually curated sequence alignments to build powerful statistical models of sequence and structure conservation known as covariance models (CMs), implemented in the Infernal software package. We provide a step-by-step iterative protocol for identifying ncRNA homologs, then constructing an alignment and corresponding CM. We also work through an example for the bacterial small RNA MicA, discovering a previously unreported family of divergent MicA homologs in genus Xenorhabdus in the process.

Research paper thumbnail of Crowdsourcing RNA structural alignments with an online computer game

Pacific Symposium on Biocomputing Pacific Symposium on Biocomputing, 2015

The annotation and classification of ncRNAs is essential to decipher molecular mechanisms of gene... more The annotation and classification of ncRNAs is essential to decipher molecular mechanisms of gene regulation in normal and disease states. A database such as Rfam maintains alignments, consensus secondary structures, and corresponding annotations for RNA families. Its primary purpose is the automated, accurate annotation of non-coding RNAs in genomic sequences. However, the alignment of RNAs is computationally challenging, and the data stored in this database are often subject to improvements. Here, we design and evaluate Ribo, a human-computing game that aims to improve the accuracy of RNA alignments already stored in Rfam. We demonstrate the potential of our techniques and discuss the feasibility of large scale collaborative annotation and classification of RNA families.

Research paper thumbnail of An evaluation of the accuracy and speed of metagenome analysis tools

Metagenome studies are becoming increasingly widespread, yielding important insights into microbi... more Metagenome studies are becoming increasingly widespread, yielding important insights into microbial communities covering diverse environments from terrestrial and aquatic ecosystems to human skin and gut. With the advent of high-throughput sequencing platforms, the use of large scale shotgun sequencing approaches is now commonplace. However, a thorough independent benchmark comparing state-of-the-art metagenome analysis tools is lacking. Here, we present a benchmark where the most widely used tools are tested on complex, realistic data sets. Our results clearly show that the most widely used tools are not necessarily the most accurate, that the most accurate tool is not necessarily the most time consuming, and that there is a high degree of variability between available tools. These findings are important as the conclusions of any metagenomics study are affected by errors in the predicted community composition and functional capacity. Data sets and results are freely available from http://www.ucbioinformatics.org/metabenchmark.html.

Research paper thumbnail of Conservation and Losses of Non-Coding RNAs in Avian Genomes

PLOS ONE, 2015

Here we present the results of a large-scale bioinformatics annotation of non-coding RNA loci in ... more Here we present the results of a large-scale bioinformatics annotation of non-coding RNA loci in 48 avian genomes. Our approach uses probabilistic models of hand-curated families from the Rfam database to infer conserved RNA families within each avian genome. We supplement these annotations with predictions from the tRNA annotation tool, tRNAscan-SE and microRNAs from miRBase. We identify 34 lncRNA-associated loci that are conserved between birds and mammals and validate 12 of these in chicken. We report several intriguing cases where a reported mammalian lncRNA, but not its function, is conserved. We also demonstrate extensive conservation of classical ncRNAs (e.g., tRNAs) and more recently discovered ncRNAs (e.g., snoRNAs and miRNAs) in birds. Furthermore, we describe numerous "losses" of several RNA families, and attribute these to either genuine loss, divergence or missing data. In particular, we show that many of these losses are due to the challenges associated with assembling avian microchromosomes. These combined results illustrate the utility of applying homology-based methods for annotating novel vertebrate genomes.

Research paper thumbnail of RNA folding argues against a hot-start origin of life

Journal of molecular evolution, 2000

Opinion is strongly divided on whether life arose on earth under hot or cold conditions, the hot-... more Opinion is strongly divided on whether life arose on earth under hot or cold conditions, the hot-start and cold-start scenarios, respectively. The origin of life close to deep thermal vents appears as the majority opinion among biologists, but there is considerable biochemical evidence that high temperatures are incompatible with an RNA world. To be functional, RNA has to fold into a three-dimensional structure. We report both theoretical and experimental results on RNA folding and show that (as expected) hot conditions strongly reduce RNA folding. The theoretical results come from energy-minimization calculations of the average extent of folding of RNA, mainly from 0-90 degrees C, for both random sequences and tRNA sequences. The experimental results are from circular-dichroism measurements of tRNA over a similar range of temperatures. The quantitative agreement between calculations and experiment is remarkable, even to the shape of the curves indicating the cooperative nature of R...

Research paper thumbnail of Robust Identification of Noncoding RNA from Transcriptomes Requires Phylogenetically-Informed Sampling

PLoS Computational Biology, 2014

Noncoding RNAs are integral to a wide range of biological processes, including translation, gene ... more Noncoding RNAs are integral to a wide range of biological processes, including translation, gene regulation, host-pathogen interactions and environmental sensing. While genomics is now a mature field, our capacity to identify noncoding RNA elements in bacterial and archaeal genomes is hampered by the difficulty of de novo identification. The emergence of new technologies for characterizing transcriptome outputs, notably RNA-seq, are improving noncoding RNA identification and expression quantification. However, a major challenge is to robustly distinguish functional outputs from transcriptional noise. To establish whether annotation of existing transcriptome data has effectively captured all functional outputs, we analysed over 400 publicly available RNA-seq datasets spanning 37 different Archaea and Bacteria. Using comparative tools, we identify close to a thousand highly-expressed candidate noncoding RNAs. However, our analyses reveal that capacity to identify noncoding RNA outputs is strongly dependent on phylogenetic sampling. Surprisingly, and in stark contrast to protein-coding genes, the phylogenetic window for effective use of comparative methods is perversely narrow: aggregating public datasets only produced one phylogenetic cluster where these tools could be used to robustly separate unannotated noncoding RNAs from a null hypothesis of transcriptional noise. Our results show that for the full potential of transcriptomics data to be realized, a change in experimental design is paramount: effective transcriptomics requires phylogeny-aware sampling.

Research paper thumbnail of A comprehensive comparison of comparative RNA structure prediction approaches

BMC Bioinformatics, 2004

Background: An increasing number of researchers have released novel RNA structure analysis and pr... more Background: An increasing number of researchers have released novel RNA structure analysis and prediction algorithms for comparative approaches to structure prediction. Yet, independent benchmarking of these algorithms is rarely performed as is now common practice for proteinfolding, gene-finding and multiple-sequence-alignment algorithms.

Research paper thumbnail of A comparison of RNA folding measures

BMC Bioinformatics, 2005

In the last few decades there has been a great deal of discussion concerning whether or not nonco... more In the last few decades there has been a great deal of discussion concerning whether or not noncoding RNA sequences (ncRNAs) fold in a more well-defined manner than random sequences. In this paper, we investigate several existing measures for how well an RNA sequence folds, and compare the behaviour of these measures over a large range of Rfam ncRNA families. Such measures can be useful in, for example, identifying novel ncRNAs, and indicating the presence of alternate RNA foldings.

Research paper thumbnail of Mutation of miRNA target sequences during human evolution

Research paper thumbnail of An introduction to RNA databases

We present an introduction to RNA databases. The history and technology behind RNA databases is b... more We present an introduction to RNA databases. The history and technology behind RNA databases is briefly discussed. We examine differing methods of data collection and curation, and discuss their impact on both the scope and accuracy of the resulting databases. Finally, we demonstrate these principals through detailed examination of four leading RNA databases: Noncode, miRBase, Rfam, and SILVA.

Research paper thumbnail of SnoPatrol: How many snoRNA genes are there?

Research paper thumbnail of A hidden Markov model approach for determining expression from genomic tiling micro arrays

BMC BIOINFORMATICS, 2006

Background: Genomic tiling micro arrays have great potential for identifying previously undiscove... more Background: Genomic tiling micro arrays have great potential for identifying previously undiscovered coding as well as non-coding transcription. To-date, however, analyses of these data have been performed in an ad hoc fashion.

Research paper thumbnail of Rfam 12.0: updates to the RNA families database

Nucleic acids research, Jan 28, 2015

The Rfam database (available at http://rfam.xfam.org) is a collection of non-coding RNA families ... more The Rfam database (available at http://rfam.xfam.org) is a collection of non-coding RNA families represented by manually curated sequence alignments, consensus secondary structures and annotation gathered from corresponding Wikipedia, taxonomy and ontology resources. In this article, we detail updates and improvements to the Rfam data and website for the Rfam 12.0 release. We describe the upgrade of our search pipeline to use Infernal 1.1 and demonstrate its improved homology detection ability by comparison with the previous version. The new pipeline is easier for users to apply to their own data sets, and we illustrate its ability to annotate RNAs in genomic and metagenomic data sets of various sizes. Rfam has been expanded to include 260 new families, including the well-studied large subunit ribosomal RNA family, and for the first time includes information on short sequence- and structure-based RNA motifs present within families.

Research paper thumbnail of The RNA WikiProject: Community annotation of RNA families

RNA, 2008

The online encyclopedia Wikipedia has become one of the most important online references in the w... more The online encyclopedia Wikipedia has become one of the most important online references in the world and has a substantial and growing scientific content. A search of Google with many RNA-related keywords identifies a Wikipedia article as the top hit. We believe that the RNA community has an important and timely opportunity to maximize the content and quality of RNA information in Wikipedia. To this end, we have formed the RNA WikiProject (http://en.wikipedia.org/wiki/Wikipedia: WikiProject_RNA) as part of the larger Molecular and Cellular Biology WikiProject. We have created over 600 new Wikipedia articles describing families of noncoding RNAs based on the Rfam database, and invite the community to update, edit, and correct these articles. The Rfam database now redistributes this Wikipedia content as the primary textual annotation of its RNA families. Users can, therefore, for the first time, directly edit the content of one of the major RNA databases. We believe that this Wikipedia/Rfam link acts as a functioning model for incorporating community annotation into molecular biology databases.

Research paper thumbnail of RNASTAR: An RNA STructural Alignment Repository that provides insight into the evolution of natural and artificial RNAs

RNA, 2012

Automated RNA alignment algorithms often fail to recapture the essential conserved sites that are... more Automated RNA alignment algorithms often fail to recapture the essential conserved sites that are critical for function. To assist in the refinement of these algorithms, we manually curated a set of 148 alignments with a total of 9600 unique sequences, in which each alignment was backed by at least one crystal or NMR structure. These alignments included both naturally and artificially selected molecules. We used principles of isostericity to improve the alignments from an average of 83%-94% isosteric base pairs. We expect that this alignment collection will assist in a wide range of benchmarking efforts and provide new insight into evolutionary principles governing change in RNA structural motifs. The improved alignments have been contributed to the Rfam database.

Research paper thumbnail of Optimal alphabets for an RNA world

Proceedings of the Royal Society B: Biological Sciences, 2003

Experiments have shown that the canonical AUCG genetic alphabet is not the only possible nucleoti... more Experiments have shown that the canonical AUCG genetic alphabet is not the only possible nucleotide alphabet. In this work we address the question 'is the canonical alphabet optimal?' We make the assumption that the genetic alphabet was determined in the RNA world. Computational tools are used to infer the RNA secondary structure (shape) from a given RNA sequence, and statistics from RNA shapes are gathered with respect to alphabet size. Then, simulations based upon the replication and selection of fixed-sized RNA populations are used to investigate the effect of alternative alphabets upon RNA's ability to step through a fitness landscape. These results show that for a low copy fidelity the canonical alphabet is fitter than two-, six-and eight-letter alphabets. In higher copy-fidelity experiments, six-letter alphabets outperform the four-letter alphabets, suggesting that the canonical alphabet is indeed a relic of the RNA world.

Research paper thumbnail of Long- and Short-Term Selective Forces on Malaria Parasite Genomes

PLoS Genetics, 2010

Plasmodium parasites, the causal agents of malaria, result in more than 1 million deaths annually... more Plasmodium parasites, the causal agents of malaria, result in more than 1 million deaths annually. Plasmodium are unicellular eukaryotes with small ,23 Mb genomes encoding ,5200 protein-coding genes. The protein-coding genes comprise about half of these genomes. Although evolutionary processes have a significant impact on malaria control, the selective pressures within Plasmodium genomes are poorly understood, particularly in the non-protein-coding portion of the genome. We use evolutionary methods to describe selective processes in both the coding and non-coding regions of these genomes. Based on genome alignments of seven Plasmodium species, we show that protein-coding, intergenic and intronic regions are all subject to purifying selection and we identify 670 conserved non-genic elements. We then use genome-wide polymorphism data from P. falciparum to describe short-term selective processes in this species and identify some candidate genes for balancing (diversifying) selection. Our analyses suggest that there are many functional elements in the non-genic regions of these genomes and that adaptive evolution has occurred more frequently in the protein-coding regions of the genome.