Prediction of novel long non-coding RNAs based on RNA-Seq data of mouse Klf1 knockout study - PubMed (original) (raw)

Prediction of novel long non-coding RNAs based on RNA-Seq data of mouse Klf1 knockout study

Lei Sun et al. BMC Bioinformatics. 2012.

Abstract

Background: Study on long non-coding RNAs (lncRNAs) has been promoted by high-throughput RNA sequencing (RNA-Seq). However, it is still not trivial to identify lncRNAs from the RNA-Seq data and it remains a challenge to uncover their functions.

Results: We present a computational pipeline for detecting novel lncRNAs from the RNA-Seq data. First, the genome-guided transcriptome reconstruction is used to generate initially assembled transcripts. The possible partial transcripts and artefacts are filtered according to the quantified expression level. After that, novel lncRNAs are detected by further filtering known transcripts and those with high protein coding potential, using a newly developed program called lncRScan. We applied our pipeline to a mouse Klf1 knockout dataset, and discussed the plausible functions of the novel lncRNAs we detected by differential expression analysis. We identified 308 novel lncRNA candidates, which have shorter transcript length, fewer exons, shorter putative open reading frame, compared with known protein-coding transcripts. Of the lncRNAs, 52 large intergenic ncRNAs (lincRNAs) show lower expression level than the protein-coding ones and 13 lncRNAs represent significant differential expression between the wild-type and Klf1 knockout conditions.

Conclusions: Our method can predict a set of novel lncRNAs from the RNA-Seq data. Some of the lncRNAs are showed differentially expressed between the wild-type and Klf1 knockout strains, suggested that those novel lncRNAs can be given high priority in further functional studies.

PubMed Disclaimer

Figures

Figure 1

Figure 1

Pipeline for predicting novel lncRNAs. (a) Initial assembly. Raw reads are first mapped onto the reference mouse genome. The un-mapped reads are trimmed before re-mapping. Merging the read alignments of all 6 replicates is to increase the read coverage. At the assembly stage, RABT generates synthetic reads from the RefSeq gene annotation to compensate the read coverage gaps over transcripts; (b) Novel lncRNAs detection. The initial assemblies are categorized by cuffcompare, compared with the combined gene annotations. The low-quality transcripts are then filtered according to the optimum FPKM (2.12). The lncRScan program is performed to detect the novel lncRNAs from the remaining high-quality assemblies according to multiple criteria.

Figure 2

Figure 2

Steps of lncRScan. (1) ‘extract_category’ extracts five candidate categories of assemblies (Transcripts-1), including ‘i’, ‘j’, ‘o’, ‘u’ and ‘x’; (2) ‘extract_length’ is used to extract the transcripts with length > 200 nt (Transcripts-2); (3) ‘extract_ORF’ selects the transcripts with maximum putative ORF < 300 nt (Transcripts-3); (4) ‘extract_PhyloCSF’ extracts the transcripts with PhyloCSF score < 0 or test failure due to ORF < 25 aa (Transcripts-4); (5) ’extract_Pfam’ searches the remaining transcripts in the Pfam database and excludes the transcripts with significant protein domain hits. Towards the end of lncRScan, the remaining 308 transcripts (Transcripts-5) are defined as the novel lncRNAs.

Figure 3

Figure 3

Differential expression tests. The cuffdiff program performs differential expression tests between the WT and Klf1 KO samples based on the read alignments (BAM) of the six replicates and high-quality assemblies (GTF).

Figure 4

Figure 4

FPKM distributions of complete and partial transcripts. The ‘=’ classcode is originally assigned to the transcripts that have complete match intron chain with a reference transcript and they can be treated as complete transcripts while the ‘c’ classcode is attached to the transcripts contained by reference and they are defined as partial assemblies. The complete (‘=’, red curve) and partial (‘c’, blue curve) transcripts assembled from the read alignments represent distinguishable FPKM distributions from each other (∼29.67 vs ∼4.86).

Figure 5

Figure 5

Performance of FPKM in distinguishing between complete and partial transcripts. An assembled transcript will be classified into the category of complete assemblies (‘=’ classcode) if its FPKM is larger than a given threshold, otherwise it will be put into the partial category (‘c’ classcode). The blue ROC curve [39] represents the performance of FPKM in classifying the complete and partial transcripts. The corresponding Area Under Curve (AOC) is 0.7825.

Figure 6

Figure 6

Comparison between novel lncRNAs and NONCODE lncRNAs. There are 36991 lncRNAs annotated by NONCODE 3.0 and 308 lncRNAs predicted by our method. Of the 80 (25.97% of our prediction) overlapped lncRNAs, 5 ones have been exactly annotated by NONCODE 3.0

Figure 7

Figure 7

Comparisons of transcript length, exon number and ORF length. (a) Comparison of transcript length. The novel lncRNAs show shorter length (∼1.2kb) on average than either RefSeq protein-coding (∼3.1kb) or non-coding transcripts (∼1.9kb); (b) Comparison of exon number. The lncRNAs represent fewer exons (∼2.8) than the other two categories of transcripts (∼10.0 and ∼3.3, respectively) on average; (c) Comparison of ORF length. The novel lncRNAs show shorter putative ORF length (∼0.17kb) than either of the two RefSeq gene categories (∼1.6kb and ∼0.3kb, respectively) on average. All means are marked by red points

Figure 8

Figure 8

Comparison of expression level between protein-coding transcripts and novel lncRNAs. (a) In the WT condition, the protein-coding transcripts (∼50.92) represent slightly higher expression level than the novel lncRNAs (∼44.54), but significantly higher expression than the lincRNAs (∼11.29) extracted from the lncRNAs; (b) In the Klf1 KO condition, the protein-coding transcripts (∼37.63) also show slightly higher expression level than the lncRNAs (∼34.06), but significantly higher expression than the lincRNAs (∼9.6). In addition, the protein-coding transcripts and the novel lncRNAs represent similar median expression in either WT (10.29 vs 9.509) or Klf1 KO (9.421 vs 7.722) condition. All means are marked by red points

Figure 9

Figure 9

Differential expression of transcripts between WT and Klf1 KO. The three volcano plots illustrate the differential expression (DE) between the WT and Klf1 KO samples at either gene or transcript level: (a) DE of all genes. At the gene level, Klf1 globally appears to be an activator since more genes are significantly repressed (334, red points over the positive x-axis) than the activated ones (250, red points over the negative x-axis) after Klf1 is knocked out; (b) DE of all transcripts. At the transcript/isoform level, Klf1 also behaves like an activator since more transcripts are significantly repressed (262) than activated ones (147) after Klf1 is knocked out; (c) DE of the novel lncRNAs. For the 13 DE significant lncRNA transcripts, Klf1 still functions like an activator since 10 lncRNAs are repressed and 3 ones are activated after Klf1 is knocked out. The DE significant transcripts are all represented by red points

References

    1. Mercer TR, Dinger ME, Mattick JS. Long non-coding RNAs: insights into functions. Nat Rev Genet. 2009;10(3):155–159. doi: 10.1038/nrg2521. [10.1038/nrg2521] - DOI - PubMed
    1. Amaral PP, Dinger ME, Mercer TR, Mattick JS. The Eukaryotic Genome as an RNA Machine. Science. 2008;319(5871):1787–1789. doi: 10.1126/science.1155472. - DOI - PubMed
    1. Baker M. Long noncoding RNAs: the search for function. Nat Meth. 2011;8(5):379–383. doi: 10.1038/nmeth0511-379. [10.1038/nmeth0511-379] - DOI
    1. Kapranov P, Cheng J, Dike S, Nix DA, Duttagupta R, Willingham AT, Stadler PF, Hertel J, Hackermüller J, Hofacker IL, Bell I, Cheung E, Drenkow J, Dumais E, Patel S, Helt G, Ganesh M, Ghosh S, Piccolboni A, Sementchenko V, Tammana H, Gingeras TR. RNA Maps Reveal New RNA Classes and a Possible Function for Pervasive Transcription. Science. 2007;316(5830):1484–1488. doi: 10.1126/science.1138341. - DOI - PubMed
    1. Bertone P, Stolc V, Royce TE, Rozowsky JS, Urban AE, Zhu X, Rinn JL, Tongprasit W, Samanta M, Weissman S, Gerstein M, Snyder M. Global Identification of Human Transcribed Sequences with Genome Tiling Arrays. Science. 2004;306(5705):2242–2246. doi: 10.1126/science.1103388. - DOI - PubMed

Publication types

MeSH terms

Substances

LinkOut - more resources