lncRNAdb: a reference database for long noncoding RNAs (original) (raw)

Abstract

Large numbers of long RNAs with little or no protein-coding potential [long noncoding RNAs (lncRNAs)] are being identified in eukaryotes. In parallel, increasing data describing the expression profiles, molecular features and functions of individual lncRNAs in a variety of systems are accumulating. To enable the systematic compilation and updating of this information, we have developed a database (lncRNAdb) containing a comprehensive list of lncRNAs that have been shown to have, or to be associated with, biological functions in eukaryotes, as well as messenger RNAs that have regulatory roles. Each entry contains referenced information about the RNA, including sequences, structural information, genomic context, expression, subcellular localization, conservation, functional evidence and other relevant information. lncRNAdb can be searched by querying published RNA names and aliases, sequences, species and associated protein-coding genes, as well as terms contained in the annotations, such as the tissues in which the transcripts are expressed and associated diseases. In addition, lncRNAdb is linked to the UCSC Genome Browser for visualization and Noncoding RNA Expression Database (NRED) for expression information from a variety of sources. lncRNAdb provides a platform for the ongoing collation of the literature pertaining to lncRNAs and their association with other genomic elements. lncRNAdb can be accessed at: http://www.lncrnadb.org/.

INTRODUCTION

The eukaryotic transcriptome is enormous, and comprises not only a large set of protein-coding messenger RNAs, but also large numbers of non-protein coding transcripts that have structural, regulatory or unknown functions (1). Noncoding RNAs include intergenic transcripts as well as a complex array of RNAs that overlap protein-coding loci on both strands (2–4), including promoter-associated RNAs (4), intronic RNAs (5), convergent and bi-directional transcripts (3), noncoding alternatively spliced isoforms of protein-coding genes (6) and mRNAs that also have regulatory roles as untranslated RNAs (7,8). While some of these RNAs are processed to form small RNA species such as microRNAs (miRNAs) and small nucleolar RNAs (snoRNAs), many can have diverse roles as primary or spliced long noncoding RNAs (lncRNAs) (9). In structural terms, lncRNAs range in size from approximately 100 to 100 000 bases, can be spliced or unspliced, polyadenylated or non-polyadenylated, nuclear or cytoplasmic, and are usually transcribed by RNA polymerase II and/or III (10).

The combination of different experimental methodologies to study the transcriptome of several species has resulted in a continuous discovery of novel transcripts (11), with the FANTOM project alone cataloguing more than 30 000 putative lncRNA transcripts in mouse tissues by full-length cDNA cloning (2). This provides a challenge for their molecular and functional characterization, as well as for their cataloguing. Although the subset of lncRNAs characterized to date corresponds to a small fraction of the long noncoding transcripts in multicellular organisms, growing interest in the field has resulted in a rapid increase in the characterization of individual lncRNAs in a variety of systems.

A decade ago less than a dozen lncRNAs were recognized and catalogued in all eukaryotes (12), but following the dramatic increase in their numbers, a recent catalogue listed approximately 40 functionally characterized lncRNAs in mammals alone (13). This list contained lncRNAs whose functions had been assessed either by loss- or gain-of-function experiments, and did not include many that have other types of evidence for functionality. For example, several studies have documented the cell type-specific and/or dynamic expression of hundreds of long ncRNAs in various developmental systems, such as embryonic stem cell differentiation (14) and oligodendrogenesis (15), finding in each case different subsets of differentially expressed lncRNAs.

These studies have indicated that lncRNAs comprise a class of bona fide gene products that have been largely unaccounted for in public databases such as RefSeq (16), UniGene (17) and the Mammalian Gene Collection (18). While there are databases cataloguing ncRNAs, these have been predominantly focused on well-validated classes of small RNAs and contain limited data pertaining to lncRNAs. Recently, a database with diverse information on imprinted RNAs was made available, which is largely comprised of small RNAs, such as piRNAs (piwi-interacting RNAs), but also contains imprinted lncRNAs, and is restricted to mammals (19). Our previous database of noncoding RNAs, RNAdb (20), was also limited to mammals, and included all classes of regulatory RNAs, such as snoRNAs, piRNAs and miRNAs. Due to the rapid expansion in each of these classes, they are now independently curated in dedicated databases, including miRbase (21), piRNAbank (22) and Sno/scaRNAbase (23). Consequently, there is a need for a dedicated database of lncRNAs that includes not only mammalian lncRNAs, but detailed annotations of lncRNAs from all eukaryotic species.

AIMS OF THE DATABASE

lncRNAdb provides a central repository of known lncRNAs in eukaryotic cells (including those derived from viruses), their aliases and published characteristics. A well-collated library will greatly facilitate research of these poorly annotated genes. It also aims to decrease instances of replication and unknown identity by the inclusion of aliases such as p15AS, which was reported as a novel antisense RNA (24), but appears to be an unspliced isoform of a previously described ncRNA called ANRIL (25), and Gomafu (26), which is also known as MIAT (27) and RNCR2 (28).

The centralization of lncRNAdb enables integration with other resources including the UCSC Genome Browser (29) and the Noncoding RNA Expression Database (NRED) (30), which provide insight into genomic context and expression data. This ensures that researchers interested in lncRNAs can conveniently find a wide range of information regarding genes of interest from a single location.

Currently, the characteristics and functions of most lncRNAs are still unexplored, but it is expected that the number of studied RNAs will rapidly increase by in silico, in vitro and in vivo characterization. By providing a simple interface accessible to public users, we aim to provide a tool for the scientific community that will allow existing entries to be updated, modified to improve accuracy, and new entries to be added by users. Following verification of new published data by the curators, the information will become available in lncRNAdb.

DATABASE STRUCTURE

lncRNAdb is available online at: http://www.lncrnadb.org. Users can search the database by lncRNA name, nucleotide sequence string, species, annotation status or through a full-text search with results being displayed for online perusal and available as a tab delimited file download.

Annotated entries include one or more literature references, annotations across a series of categories and a list of species in which the lncRNA is observed. Literature references and genomic coordinates are hyperlinked to PubMed and the UCSC Genome Browser, respectively. Other biological components related to the annotated lncRNAs, such as genomically-associated genes or interacting proteins, are also listed and briefly described in a separate table that provides links to the PubMed reference.

The lncRNAdb database also links to NRED, our online expression analysis application for ncRNAs in mouse and human (30). This extends the database through access to relative expression levels of both the lncRNAs and their contextually related transcripts in various public microarray experiments, such as NCode data and the GNF atlas (31), as well as the Allen Brain Atlas (32), which includes mouse brain in situ expression data for over 800 expressed lncRNAs (33).

The application architecture consists of a Microsoft asp.NET 3.5 presentation layer (c#), c# 4.0 data model and application layer, and MySQL persistent storage.

QUERYING THE DATABASE

Querying the database is a matter of entering one or more search criteria (Figure 1A); a full or partial lncRNA name (or alias), a search string to interrogate each ncRNA’s annotations or a species in which the lncRNA is known, or any combination of those. lncRNAdb will return a list of matching entries whose full detail can be viewed by clicking on the lncRNA name.

Figure 1.

Figure 1.

Representative screenshots of lncRNAdb showing (A) the search bar and (B) a lncRNA catalogued in lncRNAdb. Figure 1A shows the search fields available for querying the database and some of the pre-made descriptors available. Figure 1B depicts part of the annotation and references for the Neat1 transcript.

To download the database values for the entire list as a file of tab delimited text, users can click on the ‘Export Results’ button under the search results.

lncRNAdb CONTENT

Although lncRNAs have been defined previously as transcripts >200 nt (9), this was an arbitrary definition largely based on a convenient biochemical cutoff in RNA isolation protocols and the fact that it excluded most known small RNAs. In an effort to make a clear distinction to small RNA species and to create a more biologically meaningful definition of lncRNAs, we consider lncRNAs as noncoding RNAs that may have a function as either primary or spliced transcripts, which is independent of processing into known classes of small RNAs, such as miRNAs, piRNAs and snoRNAs, while also excluding structural RNAs from classical housekeeping families (tRNAs, snoRNAs, spliceosomal RNAs, etc). Existing databases and archives, including Sno/scaRNAbase (23), miRBase (21), tRNAdb-CE (34) and piRNAbank (22), already represent such ncRNAs. However, some lncRNAs that are host genes for small RNA species (35,36), but may also have roles as regulatory lncRNAs, have been included in the annotations. An example is the GAS5 lncRNA, which is a repressor of the glucocorticoid receptor but also encodes several intronic snoRNAs (37).

The database currently contains over 150 lncRNAs identified from the literature in around 60 different species. Each entry contains a comprehensive range of available information about the RNA, including sequences, structural information, genomic context, expression, subcellular localization, conservation, functional evidence and relevant ‘miscellaneous’ information (see e.g. in Figure 1B). As expected, most (∼75%) of catalogued lncRNAs are from mammals, for which more transcriptomic data is available and which have been more intensively studied, but lncRNAs from vertebrates to single-celled eukaryotes have been included.

Among the entries in lncRNAdb, approximately 100 have functions directly tested by in vitro and/or in vivo experiments. For quick reference, these have been listed in Supplementary Table S1, highlighting the wide array of functional mechanisms and processes affected by lncRNAs. This list is not exhaustive, but it catalogues functional lncRNAs found not only in mammals but also lncRNAs tested in diverse eukaryotic species, such as meiRNA (38) in yeast, frq antisense RNAs in Neurospora crassa (39), rncs-1 in Caenorhabditis elegans (40), IPS1 (41) in plants, Xlsirts in frogs (42) and bereft in Drosophila (43). In addition to RNAs expressed in normal physiological states, the database contains information about RNAs expressed in disease, and even lncRNAs derived from viruses and expressed in eukaryotic cells during infection (see below). Examples of lncRNA categories that can be used to limit queries are listed below.

Imprinted lncRNAs

LncRNAs are prevalent in imprinted regions where they can function to control imprinting and the expression of other genes from the locus, such as Air (44) and Kncq1ot1 (45). Some imprinted lncRNAs are host genes for small RNAs, such as Bsr (46), where the lncRNA host has no known function; or has an independent function, e.g. the putative tumour suppressor Meg3/Gtl2 (47).

Disease associated lncRNAs

Underscoring their importance in cellular functions, a growing number of lncRNAs have been implicated in a variety of diseases, including cancer. These are described in the database and include putative or confirmed cancer-associated lncRNAs, such as NDM29 (48), HOTAIR, which regulates metastatic progression (49), and H19, which has been described as both an oncogene (50) and a tumour suppressor (51). We also include lncRNAs that have been implicated in neurological functions and diseases, including BACE1AS (52) that shows increased expression in Alzheimer’s disease, and the Drosophila hsr-ω gene, which is induced by a variety of stresses and which has been shown to greatly increase protein polyglutamine-induced toxicity and neurodegeneration (53).

In addition, the observation that a substantial fraction of the genotypic variation underlying complex phenotypic traits occurs in noncoding regions, many of which are transcribed into discrete lncRNAs, has led to the appreciation that lncRNAs may play a central role in the molecular etiology of complex diseases (13). We have catalogued RNAs associated with these loci in lncRNAdb, such as ANRIL, a well-characterized lncRNA located in the complex genetic susceptibility locus INK4b/ARF/INK4a implicated in coronary artery disease, type 2 diabetes, periodontitis and cancer (24,54,55). We opted to also include uncharacterized lncRNAs linked to disease-susceptibility loci, as cataloguing these transcripts may facilitate their recognition as candidates for functional studies in normal and pathological conditions. Examples include AK023948, which is located in a susceptibility locus to papillary thyroid tumour (56), and LOC285194, which is located in a copy number alteration and loss-of-heterozygosity region in osteosarcoma (57).

Pathogen-induced or derived lncRNAs

Some lncRNAs are produced and modulated by pathogens or host cells during infection. These have been mostly omitted from other noncoding RNA catalogues, but incorporated in lncRNAdb. These include eukaryotic parasite transcripts such as the Pinci1 ncRNA family in the fungal plant-pathogen Phytophthora infestans, which are specifically upregulated during infection (58), and mammalian lncRNAs that are induced during viral infection (59), such as Neat1/VINC in the mouse brain (60), or produced in infected cells by oncogenic viruses, such as human herpesviruses (61). Indeed, the accumulating examples of lncRNAs encoded by viruses and expressed in eukaryotic cells have been annotated in lncRNAdb, because they can regulate cell function and are relevant to disease etiology. For instance, β2.7 is a ∼2.7 kb ncRNA encoded in the herpesvirus HCMV genome that is rapidly accumulated upon infection and has a fundamental role in preventing metabolic dysfunction and apoptosis of the host cell (62).

Bifunctional RNAs

An emerging class of genes is those encoding bifunctional RNAs, which can have multiple independent roles, such as acting as a regulatory lncRNA or being translated into a protein. Examples include well-known lncRNAs such as the co-activator SRA transcript, isoforms of which also encode a protein (8,63), and known protein coding genes, such as p53, whose transcripts also act as regulatory RNAs (7). In some cases, splicing isoforms are known to encode a protein, whereas specific splicing isoforms encode regulatory lncRNAs, such as the LXRB/LXRBSV isoform pair (6).

lncRNAs of unknown function

lncRNAdb includes RNAs that are well described in the literature but whose functions have yet to be identified. These regions are transcribed into lncRNAs that have been characterized to some extent at the structural and/or expression level, including dozens of lncRNAs that show tissue specificity and dynamic expression during development (64), as well as cellular localization, suggestive of undiscovered functionality. Likewise, the database also includes transcripts such as PHO5 antisense lncRNA in yeast (65) where it is not yet established whether the functional role is conveyed by the lncRNA or its transcription.

FUTURE DIRECTIONS

The data in lncRNAdb will be extended through manual curation by the authors and submissions to the site by other researchers on an ongoing basis. The lowering cost and improved depths of RNA-sequencing methodologies is already enabling transcriptomics studies for alternative model- and non-model species, and it is expected that the identification and characterization of lncRNAs will follow, which may substantially increase the representation of non-mammalian species in the database. New functionality will be included with integration of the publicly available successor to NRED (NRED2) early in 2011, giving access to not only additional microarray expression data but also transcriptomic RNA-seq data for contextually related long noncoding and coding genes, tools for visualization of the lncRNA and cross-referencing of experimental expression profiles. Finally, the absence of a centralized database of lncRNAs is reflected by the ad hoc naming of newly identified members, which are often not self-consistent and potentially confusing. With the number of functional lncRNAs likely to be very large, it will be important to establish a standardized nomenclature to minimize confusion and allow this emerging field to be as accessible as possible to all biologists.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

FUNDING

Australian Research Council/University of Queensland co-sponsored Federation Fellowship (FF0561986; to J.S.M.); National Health and Medical Research Council of Australia Career Development Award (CDA631542; to M.E.D.); Queensland Government Department of Employment, Economic Development and Innovation Smart Futures Fellowship (to M.E.D.); Australian Research Council Postgraduate Awards (to P.P.A., M.B.C. and D.G.A). Funding for open access charge: The University of Queensland.

Conflict of interest statement. None declared.

REFERENCES