TarBase: A comprehensive database of experimentally supported animal microRNA targets (original) (raw)

RNA. 2006 Feb; 12(2): 192–197.

PRAVEEN SETHUPATHY

1Center for Bioinformatics, 2Department of Genetics, School of Medicine, 3Genomics and Computational Biology Graduate Group, School of Medicine, and 4Department of Computer and Information Science, School of Engineering, University of Pennsylvania, Philadelphia, Pennsylvania 19104, USA

BENOIT CORDA

1Center for Bioinformatics, 2Department of Genetics, School of Medicine, 3Genomics and Computational Biology Graduate Group, School of Medicine, and 4Department of Computer and Information Science, School of Engineering, University of Pennsylvania, Philadelphia, Pennsylvania 19104, USA

ARTEMIS G. HATZIGEORGIOU

1Center for Bioinformatics, 2Department of Genetics, School of Medicine, 3Genomics and Computational Biology Graduate Group, School of Medicine, and 4Department of Computer and Information Science, School of Engineering, University of Pennsylvania, Philadelphia, Pennsylvania 19104, USA

1Center for Bioinformatics, 2Department of Genetics, School of Medicine, 3Genomics and Computational Biology Graduate Group, School of Medicine, and 4Department of Computer and Information Science, School of Engineering, University of Pennsylvania, Philadelphia, Pennsylvania 19104, USA

Reprint requests to: Praveen Sethupathy, Center for Bioinformatics, University of Pennsylvania, Philadelphia, Pennsylvania 19104, USA; e-mail: ude.nnepu.dem.liam@sneevarp; fax: (215) 573-3111.

Received 2005 Sep 21; Accepted 2005 Nov 3.

Abstract

MicroRNAs (miRNAs) are ~22-nt RNA segments that are involved in the regulation of protein expression primarily by binding to one or more target sites on an mRNA transcript and inhibiting translation. MicroRNAs are likely to factor into multiple developmental pathways, multiple mechanisms of gene regulation, and underlie an array of inherited disease processes and phenotypic determinants. Several computational programs exist to predict miRNA targets in mammals, fruit flies, worms, and plants. However, to date, there is no systematic collection and description of miRNA targets with experimental support. We describe a database, TarBase, which houses a manually curated collection of experimentally tested miRNA targets, in human/mouse, fruit fly, worm, and zebrafish, distinguishing between those that tested positive and those that tested negative. Each positive target site is described by the miRNA that binds it, the gene in which it occurs, the nature of the experiments that were conducted to test it, the sufficiency of the site to induce translational repression and/or cleavage, and the paper from which all these data were extracted. Additionally, the database is functionally linked to several other useful databases such as Gene Ontology (GO) and UCSC Genome Browser. TarBase reveals significantly more experimentally supported targets than even recent reviews claim, thereby providing a comprehensive data set from which to assess features of miRNA targeting that will be useful for the next generation of target prediction programs. TarBase can be accessed at http://www.diana.pcbi.upenn.edu/tarbase.

Keywords: microRNA, microRNA target, database, experimentally supported, computational

INTRODUCTION

MicroRNAs (miRNAs) are ~22-nt RNA segments that are involved in the regulation of protein expression primarily by binding to one or more target sites on an mRNA transcript and inhibiting translation. The first miRNA gene, lin-4, was identified more than a decade ago in Caenorhabditis elegans (Lee et al. 1993). It is now reported that miRNAs constitute anywhere from ~1% (Lim et al. 2003; Bartel 2004) to ~3% (Bentwich et al. 2005; Berezikov et al. 2005) of the known genes in eukaryotes. The Sanger Institute’s miRBase reports 326 human miRNAs. Although computational studies have suggested that thousands of human genes are targets of these 326 miRNAs, the lack of a high-throughput experimental technique has impeded a large-scale confirmation of these targets (Lewis et al. 2005). Notwithstanding this limitation, the last 2 yr have witnessed a significant increase in the number of experimentally supported miRNA targets, primarily due to an increase in the number of laboratories interested in miRNA targeting.

The first target prediction programs for mammals (Lewis et al. 2003; Kiriakidou et al. 2004) and for fruit fly (Enright et al. 2003; Stark et al. 2003; Rajewsky and Socci 2004) were published in late 2003/early 2004. Since then, at least four others for mammals (John et al. 2004; Lewis et al. 2005; Krek et al. 2005; Rusinov et al. 2005) and five others for fruit fly (Rehmsmeier et al. 2004; Burgler and Macdonald 2005; Grun et al. 2005; Robins et al. 2005; Saetrom et al. 2005) have been published. A majority of these 14 programs are available for public use (Table 1). Several of these programs search for potential miRNA targets by scanning the genome for short segments that have features very similar to a set of experimentally supported miRNA targets (Enright et al. 2003; Stark et al. 2003; John et al. 2004; Kiriakidou et al. 2004; Rajewsky and Socci 2004). Examples of these sets are: Rajewsky et al.: target sites of C. elegans lin-4 and let-7; Enright et al.: target sites of Drosophila lin-4 and let-7; Stark et al.: target sites of C. elegans lin-4, let-7, and Drosophila bantam; and Kiriakidou et al.: target sites of C. elegans lin-4, let-7, and variants of a let-7 target site on C. elegans lin-41 obtained from extensive mutational analyses. A common limitation of these programs is the very small sets of data from which they decipher rules/steps for genome-wide predictions. Since the first generation of programs in 2003/2004, many more miRNA targets have gained experimental support in human cells, fruit flies, worms, and even zebrafish. However, due to the lack of any up-to-date collection of experimentally supported targets, subsequent target prediction programs that relied on training sets still harbored the same limitation mentioned above (Robins et al. 2005; Saetrom et al. 2005). For example: Robins et al.: target sites for C. elegans lin-4, let-7, and Drosophila bantam, and Saetrom et al.: target sites for C. elegans lin-4, let-7, and Drosophila mir-13a, bantam. The target sites in these sets account for a small percentage of experimentally verified miRNA targets in mammals, fruit flies, worms, and zebrafish. A comprehensive collection of experimentally supported targets could improve the performance of the programs mentioned above and even introduce a new class of more sophisticated machine-learning approaches. However, it must be noted that there exists a distinct class of target prediction programs that do not rely on training sets, but rather, rely heavily on conservation (Lewis et al. 2003, 2005; Krek et al. 2005; Xie et al. 2005). The utility of these programs reveals that training sets are optional and merely provide a means for future machine-learning approaches that may supplement existing approaches.

TABLE 1.

Current target prediction programs that are available for public use

aOrganism(s) for which the program is best suited.

Here we present a database, TarBase, that provides a collection of all experimentally tested miRNA targets. Tar-Base describes at least 45 human/mouse genes, 28 fruit fly genes, 7 worm genes, and 1 zebrafish gene that have gained experimental support as translationally repressed miRNA targets. TarBase also describes ~350 human/mouse genes and 3 worm genes that have gained experimental support as cleaved miRNA targets. The total number of target sites recorded in TarBase for human/mouse, fruit fly, worm, and zebrafish exceeds 550—a much larger data set for the next generation of target prediction programs. Furthermore, TarBase describes each supported target site by the miRNA that binds it, the gene in which it occurs, the location within the 3′ UTR where it occurs, the nature of the experiments that were conducted to validate it, and the sufficiency of the site to induce translational repression and/or cleavage. Such a comprehensive description of each target site will be useful for focused bioinformatic and experimental studies to further understand the features of miRNA targeting, the mechanisms of miRNA-based translational repression and/or cleavage, and the roles of miRNAs in various biological networks. TarBase can be accessed at http://www.diana.pcbi.upenn.edu/tarbase.

RESULTS

A snapshot of the graphical user interface to the database is provided in Figure 1. Users can query the database of either true or false miRNA targets as determined by experimental testing. Furthermore, users can query for target sites according to the organism to which it belongs, the gene in which it occurs, the miRNA that it binds, the technique used for its experimental support, its single-site sufficiency status, or any combination thereof. The schema of the database is depicted in Figure 2. Users are provided the option of choosing the fields from the schema that they would like to view for each of their queries (Fig. 1).

An external file that holds a picture, illustration, etc. Object name is 192fig1.jpg

A snapshot of the graphical user interface to the database.

An external file that holds a picture, illustration, etc. Object name is 192fig2.jpg

A snapshot of the results page showing the various functionalities of the database. Target genes are linked to Gene Ontology (GO); miRNA Recognition Elements (MREs) are linked to UCSC Genome Browser via custom tracks; and binding pictures are viewable in a new window.

It is important to note that not all target sites are tested using the same experimental technique. For example, Krek et al. (2005) use a traditional in vitro reporter gene assay to test the ability of a target 3′ UTR to induce translational repression, whereas Zhao et al. (2005) use transgenic mice overexpressing an miRNA of interest to test changes in target protein levels. Also, not all target sites are individually tested. Kiriakidou et al. (2004) use a reporter gene assay to individually test a single target site (STS), whereas Lewis et al. (2003) use a reporter gene assay to test multiple target sites (MTS) simultaneously. The former is useful for constructing the class of target sites that are independently sufficient for translational repression, whereas the latter is useful for constructing the class of target sites that may require the presence of other target sites on the same 3′ UTR to induce translational repression. As mentioned previously, TarBase includes the experimental techniques used for support and the sts/mts status of each target site. Users of TarBase can search and compare target sites supported by different experimental techniques. TarBase does not provide a direct comparison of the techniques or the full details of the advantages and limitations of each, but we encourage any user of TarBase to use and understand the data within the context of the experimental techniques that are employed.

TarBase also includes functional links to other databases such as Gene Ontology (GO) and UCSC Genome Browser. A target gene with experimental support is linked to GO in order to provide a clearer picture of what kinds of biological pathways the targeting miRNA may regulate. The specific target sites within the target gene are linked to UCSC Genome Browser via custom tracks for facile viewing of their genic location and sequence composition. Finally, each target site is also linked to an in-house database of all known miRNA:target site base-pairing diagrams. All of these centralized data are valuable either for inquiries regarding a specific gene/miRNA/target site/experimental technique or for extracting general features of miRNA targeting or miRNA-based mechanisms for translational repression and/or cleavage. We used TarBase to study the locations of target sites within their 3′ UTRs.

Location of experimentally verified target sites within their 3′ UTRs

A recent paper suggests that miRNPs impede 5′ cap recognition, thereby interfering with translation (Pillai et al. 2005). However, details of the mechanism of miRNA-based translational repression are not yet well understood. It has been proposed that miRNA/RISC complexes may prefer to bind at certain locations on the 3′ UTR, perhaps to facilitate miRNP binding at the 5′ cap. We took 60 human/mouse experimentally supported target sites from TarBase and mapped their locations within their 3′ UTRs. We then determined the distance of each target site from the start of the host 3′ UTR and normalized this distance by the length of the host 3′ UTR. It appears that there is a marked tendency for experimentally supported target sites to occur closer to the start of the 3′ UTR than the end of the 3′ UTR. However, we repeated this test for 15 of the human/mouse target sites that were shown to be false positives and we noticed a similar distribution. Therefore, it would seem that the location within the 3′ UTR is not a feature that would increase the specificity (i.e., discriminating between true and false positives) of target prediction programs.

Since both the true and the false set of target sites have a similar distribution of locations within the 3′ UTR, it is possible that target prediction programs are implicitly biased toward locations nearer to the start of the 3′ UTR. As a final point, it must be noted that it is also possible that the current sets of true and false target sites are still too small for such an analysis as described above.

DISCUSSION

We describe a database, TarBase, which houses a manually curated collection of experimentally tested miRNA targets, in human/mouse, fruit fly, worm, and zebrafish. Simple analyses of the data have raised interesting considerations about the features used for target prediction and the experimental techniques used for support (see Results, above). We believe that TarBase will not only be useful for biologists interested in miRNA function, but also for bioinformaticians interested in using the most comprehensive set of supported targets currently available to train and test a new cohort of machine-learning methods for target prediction. Machine-learning methods are inherently biased toward predicting target sites that bear a significant resemblance to sites in the training set. Small training sets are likely to have a highly homogenous composition, restricting the types of sites that can be predicted. However, as the number of sites with experimental support increases, we expect the training set to provide a more accurate representation of target site diversity. We must emphasize, however, that machine-learning methods for target prediction are not necessarily superior to approaches that do not employ training sets (Lewis et al. 2003, 2005; Krek et al. 2005; Xie et al. 2005). But as machine-learning methods become more popular with an increasing set of experimentally supported targets, it will be interesting to observe how they compare with and how they can supplement these other approaches.

To ensure that TarBase is up to date, we have provided two semiautomated means of submission. We kindly request that future experimentally supported targets are submitted in a timely manner so that we can maintain TarBase for the growing miRNA research community. Authors are directed to http://www.diana.pcbi.upenn.edu/cgi-bin/AddTarBase.cgi for submission instructions.

MATERIALS AND METHODS

The first step in the construction of the database was to perform an exhaustive literature search for all experimentally tested miRNA targets. We found 16 relevant papers for human/mouse (Lewis et al. 2003, 2005; Esau et al. 2004; Kiriakidou et al. 2004; Poy et al. 2004; Yekta et al. 2004; Cimmino et al. 2005; Davis et al. 2005; Kawasaki and Taira 2005; Krek et al. 2005; Lecellier et al. 2005; Lim et al. 2005; Nakamoto et al. 2005; O’Donnell et al. 2005; Wu and Belasco 2005; Zhao et al. 2005), 8 for fruit fly (Brennecke et al. 2003, 2005; Stark et al. 2003; Xu et al. 2003; Burgler and Macdonald 2005; Lai et al. 2005; Leaman et al. 2005; Robins et al. 2005), 11 for worm (Lee et al. 1993; Wightman et al. 1993; Moss et al. 1997; Reinhart et al. 2000; Slack et al. 2000; Abrahante et al. 2003; Johnston and Hobert 2003; Lin et al. 2003; Chang et al. 2004; Grosshans et al. 2005; Johnson et al. 2005), and 1 for zebrafish (Kloosterman et al. 2004). For each target site from each paper we recorded the miRNA that binds it, the gene in which it occurs, the nature of the experiments that were conducted to support it, the sufficiency of the site to induce translational repression and/or cleavage, the miRNA:target site binding/base-pairing diagram, and the paper from which all these data were extracted. It must be noted that several papers did not provide the binding diagrams. In these cases, if the authors used a target prediction program to predict these target sites prior to experimental testing, we used the same target prediction program to obtain the binding diagrams from the output of the software. If the authors did not provide binding diagrams and also did not use a target prediction program to predict the target sites, then we indicate that this binding diagram is unavailable at the current time.

The recorded data were then manually uploaded in to a SQL database with tables for each organism: human/mouse, fruit fly, worm, and zebrafish. We performed quality control on the data several times to ensure that the data were uploaded accurately. We plan to include plant and virus miRNA targets very soon.

To make the database available for public use we designed a Web site that uses Perl DBI and CGI to communicate with the database and retrieve/display data that are relevant to the user’s query. The Web site is designed so as to allow users to query the database for records involving a particular organism, miRNA(s), gene(s), sts/mts status, experimental technique(s), or any combination thereof (Fig. 1). Upon retrieval of the data relevant to the user’s query, the Web site displays the results in a table format. Target genes are provided as a link to GO, where the user can directly ascertain the molecular and cellular functions of the genes (Fig. 2). The specific target sites are provided as a link to the UCSC genome browser that displays the location of the target sites as a custom track (Fig. 2). The custom track information for each experimentally supported target site within each organism was determined by using Blat (http://genome.ucsc.edu/cgi-bin/hgBlat) to locate the target site sequence within the genome.

Acknowledgments

We thank M. Megraw for her helpful comments and discussions. We also thank P. Rajasethupathy, P. Nelson, and S.A. Liebhaber for their critical reviews of the manuscript. P.S. is supported by a predoctoral NIH training grant (5T32GM008216). B.C. and A.G.H. are supported by an NSF Career Award Grant (DBI-0238295).

REFERENCES


Articles from RNA are provided here courtesy of The RNA Society