LNCipedia: a database for annotated human lncRNA transcript sequences and structures (original) (raw)
Journal Article
,
Search for other works by this author on:
,
Search for other works by this author on:
,
Search for other works by this author on:
,
Search for other works by this author on:
,
Search for other works by this author on:
,
Search for other works by this author on:
,
Search for other works by this author on:
Search for other works by this author on:
Accepted:
10 September 2012
Published:
05 October 2012
Cite
Pieter-Jan Volders, Kenny Helsens, Xiaowei Wang, Björn Menten, Lennart Martens, Kris Gevaert, Jo Vandesompele, Pieter Mestdagh, LNCipedia: a database for annotated human lncRNA transcript sequences and structures, Nucleic Acids Research, Volume 41, Issue D1, 1 January 2013, Pages D246–D251, https://doi.org/10.1093/nar/gks915
Close
Navbar Search Filter Mobile Enter search term Search
Abstract
Here, we present LNCipedia (http://www.lncipedia.org), a novel database for human long non-coding RNA (lncRNA) transcripts and genes. LncRNAs constitute a large and diverse class of non-coding RNA genes. Although several lncRNAs have been functionally annotated, the majority remains to be characterized. Different high-throughput methods to identify new lncRNAs (including RNA sequencing and annotation of chromatin-state maps) have been applied in various studies resulting in multiple unrelated lncRNA data sets. LNCipedia offers 21 488 annotated human lncRNA transcripts obtained from different sources. In addition to basic transcript information and gene structure, several statistics are determined for each entry in the database, such as secondary structure information, protein coding potential and microRNA binding sites. Our analyses suggest that, much like microRNAs, many lncRNAs have a significant secondary structure, in-line with their presumed association with proteins or protein complexes. Available literature on specific lncRNAs is linked, and users or authors can submit articles through a web interface. Protein coding potential is assessed by two different prediction algorithms: Coding Potential Calculator and HMMER. In addition, a novel strategy has been integrated for detecting potentially coding lncRNAs by automatically re-analysing the large body of publicly available mass spectrometry data in the PRIDE database. LNCipedia is publicly available and allows users to query and download lncRNA sequences and structures based on different search criteria. The database may serve as a resource to initiate small- and large-scale lncRNA studies. As an example, the LNCipedia content was used to develop a custom microarray for expression profiling of all available lncRNAs.
INTRODUCTION
Long non-coding RNAs (lncRNAs) constitute a recently discovered class of non-coding RNAs that grew in size drastically during the past few years. LncRNA genes give rise to long (>200 bp) and often multiexonic transcripts that are supposed not to get translated to protein, as commonly assessed by means of in silico prediction algorithms (1). In comparison with their protein-coding counterparts, lncRNA genes are poorly conserved (2) and are more numerous in biologically complex species (3). Although only a fraction of the lncRNA genes has been characterized experimentally, lncRNAs seem to function as transcriptional regulators through direct interaction with chromatin-modifying proteins and transcription factors (1,4,5).
LncRNAs with experimentally validated functions or expression patterns have been named accordingly. Notable examples are XIST (X inactive-specific transcript) (6), HOTAIR (HOX transcript antisense RNA) (7) and HULC (highly up-regulated in liver cancer) (8). The HUGO Gene Nomenclature Committee currently uses several schemes to name lncRNAs with an unknown function. LncRNAs that reside on the opposite strand to (antisense) or in an intron of (intronic) a protein-coding gene are named after the protein-coding gene with suffixes ‘-AS’ and ‘-IT’, respectively. Intergenic lncRNAs are numbered and get the prefix ‘LINC’ (9).
Recent advances in non-coding RNA research have led to the creation of several lncRNA resources. LncRNAdb focuses on lncRNA transcripts with well-described functions in literature (10), whereas the ncRNA database (ncRNAdb) provides RNA sequences and annotation from different sources (11). The NONCODE database (12) contains a larger collection of human long non-coding RNAs (33 829) obtained from different sources and by different experimental procedures (13). Rfam provides structures and annotation of well-known RNA families along with predictions of new members of these families (14). However, it does not provide information for an individual lncRNA. Although each of these resources provides valuable information, database unification and integration of lncRNA transcript sequence details with a broad set of bioinformatics tools and a universal lncRNA gene building and naming scheme is currently lacking. Here, we present LNCipedia, a catalogue of 21 488 lncRNA transcripts that were clustered into genes and named accordingly, and they were analysed using multiple bioinformatics tools, revealing insights in lncRNA structure, experimentally verified (lack of) protein coding potential, function and regulation. We believe such a database facilitates human lncRNA research and communication among scientists.
DATABASE DEVELOPMENT
The sources used in the data collection step are listed in Table 1. The most recent version of each source at the time of development has been included. The sequences and annotations are extracted and stored in a mongoDB database using custom Perl scripts. To this purpose, import scripts for different file formats, such as FASTA, BED and GFF, have been developed. Redundant transcripts are grouped in a single record, while maintaining all annotation from the original sources. The web interface for LNCipedia is build using the Mojolicious Perl web framework and offers different ways of querying the data (Figure 1). LNCipedia will be updated when newer versions of the lncRNA sources are released or if new sources become available. In addition, researchers are encouraged to submit new transcript sequences or annotations trough lncipedia.org.
Figure 1.
LNCipedia is generated in a multistep process that comprises importing, naming, analysis and visualization of lncRNA genes. Import scripts for the FASTA, BED and GFF file formats process lncRNA transcripts and detect redundancy. LncRNA naming is preceded by the creation of lncRNA transcript clusters and requires information on the nearest protein-coding gene on the same DNA strand. Every lncRNA transcript is subsequently analysed using multiple algorithms, and the results are appended to the database. A web-interface build using Perl enables lncRNA visualization and database querying.
Table 1.
The different sources of lncRNA transcripts used for LNCipedia at the time of developmenta
Source | Version | Number of transcripts |
---|---|---|
Ensembl (biotype = lincRNA) | Version 64 | 9069 |
Human bodymap lincRNAs (2) | 14 279 | |
LncRNAdb (10) | September 2011 | 134 |
Total number of unique transcripts | 21 488 |
Source | Version | Number of transcripts |
---|---|---|
Ensembl (biotype = lincRNA) | Version 64 | 9069 |
Human bodymap lincRNAs (2) | 14 279 | |
LncRNAdb (10) | September 2011 | 134 |
Total number of unique transcripts | 21 488 |
aThe database will be updated with new transcripts when new versions of the sources are released.
Table 1.
The different sources of lncRNA transcripts used for LNCipedia at the time of developmenta
Source | Version | Number of transcripts |
---|---|---|
Ensembl (biotype = lincRNA) | Version 64 | 9069 |
Human bodymap lincRNAs (2) | 14 279 | |
LncRNAdb (10) | September 2011 | 134 |
Total number of unique transcripts | 21 488 |
Source | Version | Number of transcripts |
---|---|---|
Ensembl (biotype = lincRNA) | Version 64 | 9069 |
Human bodymap lincRNAs (2) | 14 279 | |
LncRNAdb (10) | September 2011 | 134 |
Total number of unique transcripts | 21 488 |
aThe database will be updated with new transcripts when new versions of the sources are released.
Of note, each of the input sources uses a different naming scheme. LncRNA researchers have previously used the gene symbol of the nearest protein coding gene to refer to a given lncRNA (15). Based on this strategy, we have implemented a universal lncRNA nomenclature to ease communication among researchers. Different lncRNA transcripts are considered to belong to the same gene if they share at least one (partially) overlapping exon and reside on the same DNA strand. In this way, transcripts are clustered into genes. These lncRNA genes are then named after the HUGO symbol of the nearest protein-coding gene on the same strand using the following scheme: ‘lnc-HUGO-#’. The lncRNA genes are numbered, starting with the lncRNA gene closest to the protein-coding gene. A second number is added to denote the different transcript variants starting with the most upstream transcript, for example, lnc-MYCN-1:1 denotes transcript 1 from gene lnc-MYCN-1 (Figure 2).
Figure 2.
The SOX1 protein-coding gene locus contains three lncRNAs on the same DNA strand, numbered according to their distance in relation to SOX1. LncRNA transcripts are numbered according to their order in the gene, starting with the most upstream transcript.
INTEGRATED ANALYSIS TOOLS
LncRNA-protein interactions are, in part, mediated by the secondary structure of the lncRNA. The Vienna RNA package (16,17) consists of a set of algorithms for predicting and analysing RNA secondary structures. We applied the RNAfold algorithm to generate a secondary structure plot and dot plot with pair probabilities. Both of these images are processed with the provided relplot.pl script to obtain a structure plot with colour annotated base pair probabilities. The output postscript (.ps) images are converted to the graphics interchange format (.gif) for display in web browsers.
Structural RNAs, such as miRNAs, have a significantly lower minimum free energy of folding compared with randomly shuffled sequences (18). The Randfold algorithm implements the randomization test and returns the mean free energy of folding and _P_-value for every RNA sequence. Hence, a significant _P_-value denotes a high propensity in the sequence towards a stable secondary structure.
Recently, it has been shown that lncRNAs can act as a miRNA sponge by binding specific microRNAs and, thus, interfering with their role as negative regulators of gene expression (5,19,20). We include miRNA seed predictions for every lncRNA to allow researchers to evaluate possible miRNA–lncRNA interactions. miRNA seed predictions were performed using the MirTarget2 algoritm (21).
PROTEIN CODING POTENTIAL
Assessment of protein coding potential is an important aspect in the study of non-coding RNAs. LNCipedia reports the outcome of two different protein coding potential prediction algorithms. The Coding Potential Calculator (CPC) applies a support vector machine classifier to the output of open reading frame analysis and Basic Local Alignment Search Tool search (22). CPC returns the predicted status of the transcript (coding/non-coding) and a coding potential score. We applied version 0.9 of the CPC software and report the predicted status and the coding potential score for every transcript. Another popular strategy for detection of coding sequences is based on known protein domains. The HMMER3 suite provides software based on hidden Markov models for sequence based homology searches (23). It is often used in combination with the Pfam protein families database (24). Using the hmmscan algorithm, we searched for Pfam protein domains in the RNA sequence. All six reading frames were translated in silico, and the number of hits in 5′ to 3′ and 3′ to 5′ direction are reported.
A unique feature of LNCipedia is the incorporation of an automated reprocessing pipeline that relies on publicly available fragmentation spectra from the PRIDE database at EMBL-EBI (25) to detect potentially coding lncRNAs. The concept behind this feature is that mass spectrometry based proteomics data may contain serendipitously recorded mass spectra derived from translated lncRNAs. As standard identification strategies in proteomics are based on searching these spectra against protein sequence databases, such as UniProtKB/Swiss-Prot (26), they are implicitly unable to detect coding forms of lncRNAs, as they are not present in these databases. To uncover such potential traces of coding lncRNAs, the spectra, thus, need to be re-searched against a purpose-built database that comprises a combination of the possible translations of known lncRNAs, the known proteins for that organism as obtained from a traditional sequence database and corresponding decoy sequences for both these constituent databases for quality control and FDR estimation purposes (27). A spectrum can, thus, be matched against a lncRNA, a known protein, or a decoy sequence. The known proteins must be included to prevent relatively low-scoring matches of spectra against lncRNAs to be picked up where a much better match for that spectrum can be found for a known protein.
We have implemented such a pipeline by using the SearchGUI tool (28) to run the X!Tandem (29) search algorithm. All results are then collated and filtered at 1% FDR by the PeptideShaker algorithm (http://code.google.com/p/peptide-shaker). The pipeline infers the original search parameters, such as mass errors and post-translational modifications both directly from the PRIDE database and by using the PRIDE automatic spectrum annotation pipeline (http://code.google.com/p/pride-asa-pipeline). All the tools and algorithms used are freely available as open source.
The pipeline has so far been ran on 149 PRIDE experiments from at least 15 different tissues, yielding 81 579 peptide-to-spectrum matches (PSMs) against the custom-built protein sequence database that includes UniprotKB/Swiss-Prot and LNCipedia translations (Supplementary Figure S1). Within these PSMs, there were just 14 matches that could provide evidence for translation of LNCipedia entries. However, after close inspection of the FDR of the PSMs that passed our quality criteria, we noticed that although the PSMs from UniProtKB/Swiss-Prot have an expected FDR of 0.9%, the subset of PSMs from translated LNCipedia entries comes with an overwhelming FDR of 166% (Supplementary Figure S2). As such, there are only vague suggestions so far that any of these entries can effectively be translated.
As the PRIDE database is growing exponentially, and additional lncRNA transcript discovery is ongoing, searches for potentially coding lncRNAs need to be carried out anew at regular intervals to stay up-to-date with the growing amount of public data. We, therefore, envision running the full pipeline on all applicable PRIDE data at a set interval of 3 months; thus, periodically updating the knowledge on which lncRNAs might have coding potential. The output of each reprocessing effort will be used to annotate the LNCipedia, and past results will be kept available as well.
Besides this recurrent re-analysis of the relevant publicly available proteomics data, we also plan to extend the statistical approach used to evaluate the identification of a lncRNA by including information about the consistency with which such an identification is found across (unrelated) PRIDE experiments. Indeed, a relatively poor match in any individual experimental data set that, however, keeps returning across many such data sets, may well be a real indication that translation is taken place for that lncRNA.
LNCIPEDIA ACCES
LNCipedia is publicly available through a web interface at http://www.lncipedia.org. The interface allows users to query lncRNAs by name, chromosomal region or (partial) sequence. Several statistics are calculated that allow the user to evaluate different parameters regarding lncRNA secondary structure and regulation (Figure 3). The entire LNCipedia collection is available for download in the FASTA, GFF or BED format.
Figure 3.
The transcript page in the web interface provides a clear overview of information available on a specific lncRNA transcript.
LncRNA researchers can contribute to LNCipedia by contacting the authors. In addition, registered users can modify existing records (updating aliases and adding PubMed literature records) directly using a web interface.
LNCRNA EXPRESSION ARRAY
The LNCipedia content can prove useful when designing large-scale screening experiments, such as lncRNA gene expression profiling. As a proof of concept, we have developed a custom lncRNA gene expression array using the Agilent Sureprint 60 k platform. In addition to roughly 33 000 probes for protein coding genes, we selected 23 042 probes for lncRNA transcripts in LNCipedia covering 97% of all LNCipedia transcripts with at least one probe (Agilent MicroArray Design ID: 039714). The performance of the expression array was evaluated using RNA sample titrations according to the MicroArray Quality Control standards (30). Adequate titration response of the lncRNA probes is shown in Supplementary Figure S3.
CONCLUSION AND FUTURE DIRECTION
Three important features are unique to LNCipedia: gene definitions and usage of a universal nomenclature for lncRNA transcripts, PRIDE analysis for detection of lncRNAs that may code for small peptides and miRNA seed predictions for lncRNA transcripts. These, along with the other tools available, are expected to make LNCipedia a powerful resource for human lncRNA research.
With the advances in RNA sequencing technology, more lncRNA genes are expected to get discovered. The authors will update LNCipedia when new sequences are reported in the literature or in other sources. In addition, new features will be developed to increase the interactive capabilities of LNCipedia. In this way, the lncRNA community will be able to upload and maintain records in the database. LNCipedia has the potential to become a community resource for lncRNA transcript information and annotation.
FUNDING
Ghent University Multidisciplinary Research Partnership ‘Bioinformatics: from nucleotides to networks’ (to P.J.V., L.M., K.G., J.V.); National Institutes of Health [R01GM089784 to X.W.]; Flemish Fund for Scientific Research Flanders (FWO) (to P.M.); Ghent University Special Research Fund (BOF) (to J.V.). Funding for open access charge: Ghent University.
Conflict of interest statement. None declared.
ACKNOWLEDGEMENTS
The authors would like to acknowledge the equal contribution of J.V. and P.M.
REFERENCES
1
Long non-coding RNAs: insights into functions
,
Nat. Rev. Genet.
,
2009
, vol.
10
(pg.
155
-
159
)
2
Integrative annotation of human large intergenic noncoding RNAs reveals global properties and specific subclasses
,
Genes Dev.
,
2011
, vol.
25
(pg.
1915
-
1927
)
3
Increasing biological complexity is positively correlated with the relative genome-wide expansion of non-protein-coding DNA sequences
,
Genome Biol.
,
2003
, vol.
5
(pg.
P1
-
P24
)
4
Molecular mechanisms of long noncoding RNAs
,
Mol. Cell
,
2011
, vol.
43
(pg.
904
-
914
)
5
Modular regulatory principles of large non-coding RNAs
,
Nature
,
2012
, vol.
482
(pg.
339
-
346
)
6
A gene from the region of the human X inactivation centre is expressed exclusively from the inactive X chromosome
,
Nature
,
1991
, vol.
349
(pg.
38
-
44
)
7
et al.
Long non-coding RNA HOTAIR reprograms chromatin state to promote cancer metastasis
,
Nature
,
2010
, vol.
464
(pg.
1071
-
1076
)
8
et al.
Characterization of HULC, a novel gene with striking up-regulation in hepatocellular carcinoma, as noncoding RNA
,
Gastroenterology
,
2007
, vol.
132
(pg.
330
-
342
)
9
Naming ‘junk’: human non-protein coding RNA (ncRNA) gene nomenclature
,
Hum. Genomics
,
2011
, vol.
5
(pg.
90
-
98
)
10
lncRNAdb: a reference database for long noncoding RNAs
,
Nucleic Acids Res.
,
2010
, vol.
39
(pg.
D146
-
D151
)
11
Noncoding RNAs database (ncRNAdb)
,
Nucleic Acids Res.
,
2007
, vol.
35
(pg.
D162
-
D164
)
12
NONCODE: an integrated knowledge database of non-coding RNAs
,
Nucleic Acids Res.
,
2005
, vol.
33
(pg.
D112
-
D115
)
13
et al.
NONCODE v3.0: integrative annotation of long noncoding RNAs
,
Nucleic Acids Res.
,
2011
, vol.
40
(pg.
D210
-
D215
)
14
et al.
Rfam: wikipedia, clans and the ‘decimal’ release
,
Nucleic Acids Res.
,
2010
, vol.
39
(pg.
D141
-
D145
)
15
et al.
Chromatin signature reveals over a thousand highly conserved large non-coding RNAs in mammals
,
Nature
,
2009
, vol.
458
(pg.
223
-
227
)
16
Vienna RNA secondary structure server
,
Nucleic Acids Res.
,
2003
, vol.
31
(pg.
3429
-
3431
)
17
Fast folding and comparison of RNA secondary structures
,
Monatsh. Chem. Chem. Mon.
,
1994
, vol.
125
(pg.
167
-
188
)
18
Evidence that microRNA precursors, unlike other non-coding RNAs, have lower folding free energies than random sequences
,
Bioinformatics
,
2004
, vol.
20
(pg.
2911
-
2917
)
19
A long noncoding RNA controls muscle differentiation by functioning as a competing endogenous RNA
,
Gastroenterology
,
2011
, vol.
147
(pg.
358
-
369
)
20
et al.
Suppression of progenitor differentiation requires the long noncoding RNA ANCR
,
Genes Dev.
,
2012
, vol.
26
(pg.
338
-
343
)
21
Prediction of both conserved and nonconserved microRNA targets in animals
,
Bioinformatics
,
2008
, vol.
24
(pg.
325
-
332
)
22
CPC: assess the protein-coding potential of transcripts using sequence features and support vector machine
,
Nucleic Acids Res.
,
2007
, vol.
35
(pg.
W345
-
W349
)
23
Accelerated profile HMM searches
,
PLoS Comput. Biol.
,
2011
, vol.
7
pg.
e1002195
24
et al.
The Pfam protein families database
,
Nucleic Acids Res.
,
2011
, vol.
40
(pg.
D290
-
D301
)
25
PRIDE: the proteomics identifications database
,
Proteomics
,
2005
, vol.
5
(pg.
3537
-
3545
)
26
Large-scale database searching using tandem mass spectra: looking up the answer in the back of the book
,
Nat. Methods
,
2004
, vol.
1
(pg.
195
-
202
)
27
Peptide identification quality control
,
Proteomics
,
2011
, vol.
11
(pg.
2105
-
2114
)
28
SearchGUI: an open-source graphical user interface for simultaneous OMSSA and X!Tandem searches
,
Proteomics
,
2011
, vol.
11
(pg.
996
-
999
)
29
TANDEM: matching proteins with tandem mass spectra
,
Bioinformatics
,
2004
, vol.
20
(pg.
1466
-
1467
)
30
et al.
Evaluation of DNA microarray results with quantitative gene expression platforms
,
Nat. Biotechnol.
,
2006
, vol.
24
(pg.
1115
-
1122
)
© The Author(s) 2012. Published by Oxford University Press.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by-nc/3.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com.
Supplementary data
I agree to the terms and conditions. You must accept the terms and conditions.
Submit a comment
Name
Affiliations
Comment title
Comment
You have entered an invalid code
Thank you for submitting a comment on this article. Your comment will be reviewed and published at the journal's discretion. Please check for further notifications by email.
Citations
Views
Altmetric
Metrics
Total Views 7,409
5,792 Pageviews
1,617 PDF Downloads
Since 11/1/2016
Month: | Total Views: |
---|---|
November 2016 | 1 |
December 2016 | 4 |
January 2017 | 15 |
February 2017 | 32 |
March 2017 | 48 |
April 2017 | 36 |
May 2017 | 52 |
June 2017 | 46 |
July 2017 | 38 |
August 2017 | 39 |
September 2017 | 42 |
October 2017 | 30 |
November 2017 | 39 |
December 2017 | 126 |
January 2018 | 132 |
February 2018 | 81 |
March 2018 | 111 |
April 2018 | 102 |
May 2018 | 116 |
June 2018 | 89 |
July 2018 | 93 |
August 2018 | 128 |
September 2018 | 84 |
October 2018 | 52 |
November 2018 | 80 |
December 2018 | 77 |
January 2019 | 84 |
February 2019 | 65 |
March 2019 | 101 |
April 2019 | 104 |
May 2019 | 91 |
June 2019 | 67 |
July 2019 | 90 |
August 2019 | 91 |
September 2019 | 96 |
October 2019 | 74 |
November 2019 | 88 |
December 2019 | 67 |
January 2020 | 117 |
February 2020 | 71 |
March 2020 | 52 |
April 2020 | 61 |
May 2020 | 66 |
June 2020 | 62 |
July 2020 | 79 |
August 2020 | 86 |
September 2020 | 82 |
October 2020 | 81 |
November 2020 | 84 |
December 2020 | 107 |
January 2021 | 80 |
February 2021 | 75 |
March 2021 | 122 |
April 2021 | 97 |
May 2021 | 93 |
June 2021 | 68 |
July 2021 | 90 |
August 2021 | 74 |
September 2021 | 90 |
October 2021 | 122 |
November 2021 | 86 |
December 2021 | 55 |
January 2022 | 67 |
February 2022 | 57 |
March 2022 | 81 |
April 2022 | 91 |
May 2022 | 85 |
June 2022 | 50 |
July 2022 | 83 |
August 2022 | 108 |
September 2022 | 109 |
October 2022 | 114 |
November 2022 | 79 |
December 2022 | 86 |
January 2023 | 51 |
February 2023 | 67 |
March 2023 | 103 |
April 2023 | 88 |
May 2023 | 59 |
June 2023 | 55 |
July 2023 | 94 |
August 2023 | 68 |
September 2023 | 68 |
October 2023 | 80 |
November 2023 | 45 |
December 2023 | 80 |
January 2024 | 116 |
February 2024 | 71 |
March 2024 | 120 |
April 2024 | 68 |
May 2024 | 88 |
June 2024 | 57 |
July 2024 | 71 |
August 2024 | 68 |
September 2024 | 87 |
October 2024 | 84 |
Citations
411 Web of Science
×
Email alerts
Citing articles via
More from Oxford Academic