miRBase: tools for microRNA genomics (original) (raw)

Journal Article

,

1 Faculty of Life Sciences, University of Manchester, Michael Smith Building, Oxford Road, Manchester, M13 9PT and 2 The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, Hinxton, UK

Search for other works by this author on:

,

1 Faculty of Life Sciences, University of Manchester, Michael Smith Building, Oxford Road, Manchester, M13 9PT and 2 The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, Hinxton, UK

Search for other works by this author on:

,

1 Faculty of Life Sciences, University of Manchester, Michael Smith Building, Oxford Road, Manchester, M13 9PT and 2 The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, Hinxton, UK

Search for other works by this author on:

1 Faculty of Life Sciences, University of Manchester, Michael Smith Building, Oxford Road, Manchester, M13 9PT and 2 The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, Hinxton, UK

Search for other works by this author on:

Received:

14 September 2007

Revision received:

10 October 2007

Accepted:

16 October 2007

Published:

08 November 2007

Cite

Sam Griffiths-Jones, Harpreet Kaur Saini, Stijn van Dongen, Anton J. Enright, miRBase: tools for microRNA genomics, Nucleic Acids Research, Volume 36, Issue suppl_1, 1 January 2008, Pages D154–D158, https://doi.org/10.1093/nar/gkm952
Close

Navbar Search Filter Mobile Enter search term Search

Abstract

miRBase is the central online repository for microRNA (miRNA) nomenclature, sequence data, annotation and target prediction. The current release (10.0) contains 5071 miRNA loci from 58 species, expressing 5922 distinct mature miRNA sequences: a growth of over 2000 sequences in the past 2 years. miRBase provides a range of data to facilitate studies of miRNA genomics: all miRNAs are mapped to their genomic coordinates. Clusters of miRNA sequences in the genome are highlighted, and can be defined and retrieved with any inter-miRNA distance. The overlap of miRNA sequences with annotated transcripts, both protein- and non-coding, are described. Finally, graphical views of the locations of a wide range of genomic features in model organisms allow for the first time the prediction of the likely boundaries of many miRNA primary transcripts. miRBase is available at http://microrna.sanger.ac.uk/ .

INTRODUCTION

MicroRNAs (miRNAs) are short RNA sequences expressed from longer transcripts encoded in animal, plant and virus genomes, and recently discovered in a single-celled eukaryote ( 1 , 2 ). miRNAs regulate the expression of target genes by binding to complementary sites in their transcripts to cause translational repression or transcript degradation ( 3 ). Translational repression is thought to be the primary mechanism for imperfect target duplexes in animals, with transcript degradation the dominant mechanism for largely perfect matches found throughout plant target transcripts. miRNAs have been implicated in processes and pathways such as development, cell proliferation, apoptosis, metabolism and morphogenesis, and in diseases including cancer ( 4 , 5 ).

miRBase is the primary repository and database resource for miRNA data. The database has three main functions:

The miRNA nomenclature scheme has been presented and discussed previously ( 6 , 8 , 9 ). Novel miRNAs require cloning or expression evidence, and should be submitted only after a manuscript describing their identification is accepted for publication. Assigned names should then be incorporated into the final version of the manuscript prior to publication. Obvious homologues of miRNAs validated in closely related species need not be experimentally verified and may be submitted at any time. Primary features of the nomenclature scheme are:

However, it is important to note that a short name cannot always encode complex information such as orthology and paralogy relationships. In some cases, the short name is a pragmatic choice that is the most consistent of conflicting representations of these sequence relationships. While the names provide a guide of family and function, they should not therefore be relied upon to confer any complex meaning. Instead, dedicated fields in the database provide information about gene and mature miRNA sequence families.

The published miRNA literature is huge. Readers are referred to a number of comprehensive reviews of miRNA structure, biogenesis and function ( 4 , 10–12 ). Here, we focus on specific issues and points of interest with respect to the provision of miRNA data in the miRBase database.

miRBase DATA AND UPDATES

How many miRNA genes?

The number of miRNA hairpin loci in the miRBase database continues to grow rapidly, from 2909 in 36 genomes (June 2005, release 7.0) to 5071 in 58 genomes (August 2007, release 10.0) in the past 2 years. The number of miRNAs in a genome has been the subject of much discussion in the literature. Early estimates of the number of miRNAs in the worm and human genomes were put at 123 and 255, respectively ( 13 , 14 ). However, these estimates were based largely on conservation studies. It is now clear that many miRNAs may be clade- or even organism-specific. A number of recent large-scale studies have lifted the number of miRNA loci known in human to 533 ( Table 1 ) ( 15–17 ), around 60% of which are obviously conserved in mouse (miRBase release 10.0).

Table 1.

The number of published hairpin precursor and mature miRNA sequences in selected model organisms

Hairpin precursor loci Mature miR sequences a
Total number Clustered ≤10 kb from another miRNA Overlap annotated transcripts Distinct forms Experimentally verified
Homo sapiens 533 190 (36%) 267 (50%) 555 546 (98%)
Mus musculus 442 199 (45%) 174 (39%) 461 455 (99%)
Danio rerio 337 151 (34%) 41 (12%) 193 183 (95%)
Caenorhabditis elegans 135 34 (25%) 23 (17%) 135 135 (100%)
Drosophila melanogaster 93 34 (36%) 36 (39%) 88 85 (97%)
Arabidopsis thaliana 184 19 (10%) 16 (9%) 199 199 (100%)
Populus trichocarpa 215 42 (20%) 9 (4%) 215 55 (26%)
Hairpin precursor loci Mature miR sequences a
Total number Clustered ≤10 kb from another miRNA Overlap annotated transcripts Distinct forms Experimentally verified
Homo sapiens 533 190 (36%) 267 (50%) 555 546 (98%)
Mus musculus 442 199 (45%) 174 (39%) 461 455 (99%)
Danio rerio 337 151 (34%) 41 (12%) 193 183 (95%)
Caenorhabditis elegans 135 34 (25%) 23 (17%) 135 135 (100%)
Drosophila melanogaster 93 34 (36%) 36 (39%) 88 85 (97%)
Arabidopsis thaliana 184 19 (10%) 16 (9%) 199 199 (100%)
Populus trichocarpa 215 42 (20%) 9 (4%) 215 55 (26%)

a miR* sequences are excluded from the mature miRNA count.

Table 1.

The number of published hairpin precursor and mature miRNA sequences in selected model organisms

Hairpin precursor loci Mature miR sequences a
Total number Clustered ≤10 kb from another miRNA Overlap annotated transcripts Distinct forms Experimentally verified
Homo sapiens 533 190 (36%) 267 (50%) 555 546 (98%)
Mus musculus 442 199 (45%) 174 (39%) 461 455 (99%)
Danio rerio 337 151 (34%) 41 (12%) 193 183 (95%)
Caenorhabditis elegans 135 34 (25%) 23 (17%) 135 135 (100%)
Drosophila melanogaster 93 34 (36%) 36 (39%) 88 85 (97%)
Arabidopsis thaliana 184 19 (10%) 16 (9%) 199 199 (100%)
Populus trichocarpa 215 42 (20%) 9 (4%) 215 55 (26%)
Hairpin precursor loci Mature miR sequences a
Total number Clustered ≤10 kb from another miRNA Overlap annotated transcripts Distinct forms Experimentally verified
Homo sapiens 533 190 (36%) 267 (50%) 555 546 (98%)
Mus musculus 442 199 (45%) 174 (39%) 461 455 (99%)
Danio rerio 337 151 (34%) 41 (12%) 193 183 (95%)
Caenorhabditis elegans 135 34 (25%) 23 (17%) 135 135 (100%)
Drosophila melanogaster 93 34 (36%) 36 (39%) 88 85 (97%)
Arabidopsis thaliana 184 19 (10%) 16 (9%) 199 199 (100%)
Populus trichocarpa 215 42 (20%) 9 (4%) 215 55 (26%)

a miR* sequences are excluded from the mature miRNA count.

miR and miR * sequences

The 5071 miRNA hairpin loci in the database express 4922 dominant mature miRNA (miR) products ( Table 1 ). In many cases, deep sequencing technologies have detected large numbers of miR* sequences—biogenesis byproducts that are often detected at very low levels and are likely non-functional. Starting in miRBase release 10.0, mature miR and miR* sequences are better distinguished in the database, and distributed in separate release files. In many cases, mature miRNAs from both 5′ and 3′ arms of the hairpin precursor are frequently identified, suggesting that both may be functional, or there is insufficient data to determine the predominant product. Such miRNAs are given names of the form hsa-miR-140-5p and hsa-miR-140-3p, and both are retained in the miR set. Often, subsequent improved data allow one product to be chosen and annotated as the dominant miR. Recent data updates have occasionally caused the annotation of a miR and miR* pair to be reversed.

Variable ends

Increasingly deep and comprehensive cloning and sequencing studies identify many mature miRNAs with variable 3' (and, to a lesser extent, 5') ends [see for example ( 17 )]. The miRNAs in the database currently represent the consensus of the most dominantly expressed sequence. As more data become available, the ends of mature miRNAs in the database will be adjusted to reflect the most up-to-date consensus information. We also aim to provide specific data on the distribution of ends in future releases. All changes in name and sequence between releases are specifically described in the diff file on the FTP site, along with all data from previous releases.

Experimental support

Usually the only available experimental data supports the mature miRNAs—hairpin precursors are very rarely experimentally validated. Rather, the precursors are the result of computational prediction of hairpin structures that include the mature miRNA. When a number of loci include the same mature miRNA, we cannot usually say with confidence which loci are actually expressed. In addition, the extents of the hairpins depicted in the database are somewhat arbitrary—the approximate extent of the predicted hairpin structure is shown. Formally, this includes the true precursor (the product of DROSHA cleavage) and a small amount of flanking sequence. Future developments will include the provision to retrieve the precursor with user-defined lengths of flanking sequence. About 3685 of 5922 mature miRNA products in the database are validated experimentally in the originating organism—the remainders are obvious homologues of validated miRNAs from a related species ( Table 1 ). The ‘evidence’ field describes the origin of each sequence in the database.

miRBase::Targets

The miRBase::Targets database uses the miRanda algorithm ( 7 ) to predict targets in untranslated regions (UTRs) of 37 animal genomes from Ensembl ( 18 ). The quality of the predictions has recently benefited from significantly improved 3′UTR information, based on DITAG and 5′CAGE data, available from Ensembl. The number of human and mouse transcripts without an experimentally supported 3′UTR (for which we search a region 2 kb downstream) has therefore dropped significantly in the latest release (v5). A number of validated miR/target pairs are shown to have mismatches in the so-called ‘seed’ region ( 19 ). The miRBase/miRanda pipeline is therefore not constrained by the requirement for exact ‘seed’ matches. Recent papers have also highlighted the importance of secondary features for miRNA/target recognition, such as sequence accessibility, AU bias and UTR position ( 20 , 21 ). We intend to incorporate these features into the miRBase::Target prediction pipeline over the coming 12 months. In addition, links are provided to other target prediction sites and algorithms, and to the TarBase database of experimentally supported targets ( 22 ).

miRBase GENOMICS

Recently, we have focused on the provision of tools to distribute miRNA genomic information.

Genomic coordinates

Where an assembled genome sequence is available, coordinates of all miRNAs are provided: in summary tables for each organism and miRNA family, on each miRNA entry page, and for bulk download in GFF format. Links are provided from each coordinate to the appropriate genome browsers.

miRNA gene context

40–70% of vertebrate miRNAs appear to be expressed from introns of protein- and non-coding transcripts ( Table 1 ) ( 23 ). In worms and flies, intronic miRNAs are less common (15% and 39%, respectively, in protein-coding genes), and only 5–10% of Arabidopsis miRNAs overlap annotated transcripts. For all animals with Ensembl-annotated genome assemblies, we provide a list of transcripts overlapping each miRNA, with overlap type (intron, exon and UTR), and sense (forward and reverse strands).

Clustered miRNAs

miRNAs are often clustered close together in the genome. This clustering has been suggested as evidence that >1 miRNA may be expressed from the same primary miRNA transcript (pri-miRNA). Furthermore, known ‘polycistronic’ miRNA transcripts are shown to be long: up to tens of kilobases in mammals. Over 40% of human miRNAs, over 30% of worm and fly miRNAs and only around 10% of Arabidopsis miRNAs are within 10 kb of another miRNA ( Table 1 ). miRBase provides a list of clustered miRNAs on each applicable entry page. In addition, a new search facility allows the user to retrieve clusters of miRNAs in any organism separated by any choice of distance.

Genomic features

While the mapping of mature and hairpin miRNA sequences to assembled genomes is readily available in miRBase, the extents of only very few primary miRNA transcripts (pri-miRNA) are determined and annotated. For intronic miRNAs, the pri-miRNA is assumed to be the protein- (or non-)coding host transcript. Information about the extents of intergenic pri-miRNAs can be inferred from collective analysis of genomic features such as transcription start sites (TSS), CpG islands, EST and cDNA overlap, DITAG and 5′CAGE data, transcription factor binding sites (TFBS) and polyadenylation site predictions (polyA). A detailed analysis of these data suggest that pri-miRNA transcripts vary in length from a few hundreds of bases up to tens of kilobases ( 24 ). We have recently developed a tool to visualize the relative positions of these predictions and mappings with respect to annotated miRNA genes and clusters. Careful inspection of these data allows the prediction of the 5′ and 3′ boundaries of a significant number of putative pri-miRNAs. For example, Figure 1 shows TSSs, CpG island, ESTs, cDNAs, DITAG (172B22 and 172B221) and polyA site predictions surrounding mmu-mir-135b on mouse chromosome 1, which support a primary transcript of length around 15 kb with 5′ and 3′ ends ∼7–8 kb upstream and downstream of the miRNA. Links from each miRNA entry page provide a tabulated list of features overlapping flanking regions of the miRNA with their corresponding coordinates and scores, and a graphical view of the features present in the miRNA gene neighbourhood (as in Figure 1 ). These views are currently available for human, mouse, rat, worm and fly miRNAs, and will be extended to other organisms in the future. For human, mouse and rat genomes, TSSs are predicted using the Eponine-TSS software ( 25 ) at a threshold of 0.990. Drosophila TSS predictions, together with CpG islands, ESTs, cDNAs, repeats and DITAGs for all species are obtained from Ensembl. TFBSs in the flanking regions of human miRNAs are obtained from the conserved TFBS track of the UCSC genome browser ( 26 ). Other TFBS data are imported from the regulatory features track of Ensembl. PolyA signals are predicted in-house using the DNAFSMiner method ( 27 ) with a cutoff score of 0.6. The ‘Genomics’ section of the miRBase site allows the user to specify flanking and clustering distances, and the range of features desired.

miRBase view of the distribution of genomic features around mmu-mir-135b on mouse chromosome 1, showing TSS, CpG island, EST, cDNA, DITAG (172B221 and 172B22) and polyA site support for a 15 kb primary transcript.

Figure 1.

miRBase view of the distribution of genomic features around mmu-mir-135b on mouse chromosome 1, showing TSS, CpG island, EST, cDNA, DITAG (172B221 and 172B22) and polyA site support for a 15 kb primary transcript.

AVAILABILITY

miRBase is available on the web at http://microrna.sanger.ac.uk/ . All data are available for download from the FTP site ( ftp://ftp.sanger.ac.uk/pub/mirbase/ ) in a variety of formats including FASTA sequences and MYSQL relational database dumps.

ACKNOWLEDGEMENTS

S.G.-J. is funded by the University of Manchester. H.K.S. holds a GlaxoSmithKline postdoctoral fellowship, and work at the Sanger Institute is funded by the Wellcome Trust. Funding to pay the Open Access publication charges for this article was provided by the University of Manchester.

Conflict of interest statement . None declared.

REFERENCES

1

miRNAs control gene expression in the single-cell alga Chlamydomonas reinhardtii

,

Nature

,

2007

, vol.

447

(pg.

1126

-

1129

)

2

A complex system of small RNAs in the unicellular green alga Chlamydomonas reinhardtii

,

Genes Dev.

,

2007

, vol.

21

(pg.

1190

-

1203

)

3

Repression of protein synthesis by miRNAs: how many mechanisms?

,

Trends Cell Biol.

,

2007

, vol.

17

(pg.

118

-

126

)

4

The diverse functions of microRNAs in animal development and disease

,

Dev. Cell

,

2006

, vol.

11

(pg.

441

-

450

)

5

MicroRNA expression and function in cancer

,

Trends Mol. Med.

,

2006

, vol.

12

(pg.

580

-

587

)

6

miRBase: microRNA sequences, targets and gene nomenclature

,

Nucleic Acids Res.

,

2006

, vol.

34

(pg.

D140

-

D144

)

7

Human microRNA targets

,

PLoS Biol.

,

2004

, vol.

2

pg.

e363

8

et al.

A uniform system for microRNA annotation

,

RNA

,

2003

, vol.

9

(pg.

277

-

279

)

9

The microRNA registry

,

Nucleic Acids Res.

,

2004

, vol.

32

(pg.

D109

-

D111

)

10

Genomics of microRNA

,

Trends Genet.

,

2006

, vol.

22

(pg.

165

-

173

)

11

MicroRNA biogenesis: coordinated cropping and dicing

,

Nat. Rev. Mol. Cell. Biol.

,

2005

, vol.

6

(pg.

376

-

385

)

12

MicroRNAs: genomics, biogenesis, mechanism, and function

,

Cell

,

2004

, vol.

116

(pg.

281

-

297

)

13

The microRNAs of Caenorhabditis elegans

,

Genes Dev.

,

2003

, vol.

17

(pg.

991

-

1008

)

14

Vertebrate microRNA genes

,

Science

,

2003

, vol.

299

pg.

1540

15

et al.

The colorectal microRNAome

,

Proc. Natl Acad. Sci. USA

,

2006

, vol.

103

(pg.

3687

-

3692

)

16

et al.

Many novel mammalian microRNA candidates identified by extensive cloning and RAKE analysis

,

Genome Res.

,

2006

, vol.

16

(pg.

1289

-

1298

)

17

et al.

A mammalian microRNA expression atlas based on small RNA library sequencing

,

Cell

,

2007

, vol.

129

(pg.

1401

-

1414

)

18

et al.

Ensembl 2007

,

Nucleic Acids Res.

,

2007

, vol.

35

(pg.

D610

-

D617

)

19

Perfect seed pairing is not a generally reliable predictor for miRNA-target interactions

,

Nat. Struct. Mol. Biol.

,

2006

, vol.

13

(pg.

849

-

845

)

20

Potent effect of target structure on microRNA function

,

Nat. Struct. Mol. Biol.

,

2007

, vol.

14

(pg.

287

-

294

)

21

MicroRNA targeting specificity in mammals: determinants beyond seed pairing

,

Mol. Cell

,

2007

, vol.

27

(pg.

91

-

105

)

22

TarBase: a comprehensive database of experimentally supported animal microRNA targets

,

RNA

,

2006

, vol.

12

(pg.

192

-

197

)

23

Identification of mammalian microRNA host genes and transcription units

,

Genome Res.

,

2004

, vol.

14

(pg.

1902

-

1910

)

24

Genomic analysis of human microRNA transcripts

,

Proc. Natl Acad. Sci. USA

,

2007

, vol.

104

(pg.

17719

-

17724

)

25

Computational detection and location of transcription start sites in mammalian genomic DNA

,

Genome Res.

,

2002

, vol.

12

(pg.

458

-

461

)

26

et al.

The UCSC genome browser database: update 2007

,

Nucleic Acids Res.

,

2007

, vol.

35

(pg.

D668

-

D673

)

27

DNAFSMiner: a web-based software toolbox to recognize two types of functional sites in DNA sequences

,

Bioinformatics

,

2005

, vol.

21

(pg.

671

-

673

)

© 2007 The Author(s)

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

I agree to the terms and conditions. You must accept the terms and conditions.

Submit a comment

Name

Affiliations

Comment title

Comment

You have entered an invalid code

Thank you for submitting a comment on this article. Your comment will be reviewed and published at the journal's discretion. Please check for further notifications by email.

Citations

Views

Altmetric

Metrics

Total Views 22,142

18,109 Pageviews

4,033 PDF Downloads

Since 11/1/2016

Month: Total Views:
November 2016 7
December 2016 12
January 2017 46
February 2017 147
March 2017 176
April 2017 209
May 2017 203
June 2017 173
July 2017 125
August 2017 128
September 2017 159
October 2017 135
November 2017 121
December 2017 331
January 2018 298
February 2018 331
March 2018 418
April 2018 342
May 2018 519
June 2018 478
July 2018 342
August 2018 390
September 2018 568
October 2018 505
November 2018 407
December 2018 312
January 2019 255
February 2019 240
March 2019 288
April 2019 328
May 2019 250
June 2019 223
July 2019 237
August 2019 223
September 2019 193
October 2019 205
November 2019 258
December 2019 218
January 2020 258
February 2020 205
March 2020 187
April 2020 131
May 2020 201
June 2020 263
July 2020 241
August 2020 235
September 2020 211
October 2020 245
November 2020 258
December 2020 249
January 2021 215
February 2021 194
March 2021 262
April 2021 269
May 2021 226
June 2021 168
July 2021 187
August 2021 167
September 2021 190
October 2021 222
November 2021 234
December 2021 201
January 2022 218
February 2022 218
March 2022 277
April 2022 277
May 2022 271
June 2022 228
July 2022 230
August 2022 211
September 2022 246
October 2022 254
November 2022 269
December 2022 216
January 2023 205
February 2023 228
March 2023 251
April 2023 229
May 2023 224
June 2023 182
July 2023 187
August 2023 176
September 2023 162
October 2023 158
November 2023 132
December 2023 236
January 2024 208
February 2024 215
March 2024 252
April 2024 177
May 2024 190
June 2024 123
July 2024 113
August 2024 104
September 2024 157
October 2024 177
November 2024 122

×

Email alerts

Citing articles via

More from Oxford Academic