A Beginner's Guide to Analysis of RNA Sequencing Data - PubMed (original) (raw)

A Beginner's Guide to Analysis of RNA Sequencing Data

Clarissa M Koch et al. Am J Respir Cell Mol Biol. 2018 Aug.

Abstract

Since the first publications coining the term RNA-seq (RNA sequencing) appeared in 2008, the number of publications containing RNA-seq data has grown exponentially, hitting an all-time high of 2,808 publications in 2016 (PubMed). With this wealth of RNA-seq data being generated, it is a challenge to extract maximal meaning from these datasets, and without the appropriate skills and background, there is risk of misinterpretation of these data. However, a general understanding of the principles underlying each step of RNA-seq data analysis allows investigators without a background in programming and bioinformatics to critically analyze their own datasets as well as published data. Our goals in the present review are to break down the steps of a typical RNA-seq analysis and to highlight the pitfalls and checkpoints along the way that are vital for bench scientists and biomedical researchers performing experiments that use RNA-seq.

Keywords: RNA sequencing; bioinformatics; data analysis; transcriptomics.

PubMed Disclaimer

Figures

Figure 1.

Figure 1.

Assessing inter- and intragroup variability. (A) Principal component (PC) analysis plot displaying all 12 samples along PC1 and PC2, which describe 68.1% and 20.3% of the variability, respectively, within the expression data set. PC analysis was applied to normalized (reads per kilobases of transcript per 1 million mapped reads) and log-transformed count data. (B) Pearson’s correlation plot visualizing the correlation (r) values between samples. Scale bar represents the range of the correlation coefficients (r) displayed.

Figure 2.

Figure 2.

Determining a low count threshold. (A) The number of genes at a given reads per kilobases of transcript per 1 million mapped reads (RPKM) value for each sample (bins = 120; bin size = 0.1). Inset box enlarged at right highlights a subsection of the figure that was used to define an RPKM cutoff of 1 (bin size = 0.1). (B_–_D) Scatterplots comparing the expression of individual genes between two samples for (B) most correlated samples within a group (r = 0.9989), (C) least correlated samples within a group (r = 0.978), and (D) least correlated samples within the data set (r = 0.9089). Data are plotted on a log2 scale.

Figure 3.

Figure 3.

The effect of group size and intragroup variance on ability to identify differentially expressed genes. MA plots showing average logarithmically transformed counts per million (CPM) versus the log2 fold change for pairwise comparisons between the Transplant 2H versus Naive (top row), Transplant 24H versus Naive (middle row), and Transplant 24H versus Transplant 2H (bottom row) groups. Pairwise comparisons were run using (A) all four replicates per group, (B) the two most correlated replicates, (C) the two least correlated replicates, or (D) randomized data in which two replicates from the Naive group and two replicates from the Transplant 2H group were combined into each group. Up- and downregulated differentially expressed genes with a false discovery rate less than 0.05 are shown in blue and red, respectively.

Figure 4.

Figure 4.

Distribution of ANOVA P values for (A) all (n = 4), (B) most correlated (n = 2), and (C) least correlated (n = 2) replicates. P values were distributed into 100 bins between 0 and 1, with each bar representing a 0.01 increase.

Figure 5.

Figure 5.

Effect of group size and intragroup variance on ability to identify gene clusters. Hierarchical clustering performed on differentially expressed genes defined by ANOVA with a false discovery rate less than 0.05. (A) Using all replicates per group, 7,166 genes were clustered. (B) Most and (C) least correlated samples resulted in input lists of 2,150 and 862 genes, respectively. The _z_-score scale bar represents relative expression ±2 SD from the mean.

Figure 6.

Figure 6.

_k_-Means clustering and Gene Ontology (GO) enrichment analysis using the top differentially expressed genes. _k_-Means clustering was performed on the data set containing all samples (n = 4/group), and the top GO process from each cluster is shown.

Figure 7.

Figure 7.

Individual gene analysis. RPKM expression values for the Cdk2, Il1b, and Ccl2 genes are shown for the datasets containing (A) all samples (n = 4/group), (B) most correlated replicates (n = 2/group), and (C) least correlated replicates (n = 2/group). Although all three genes were identified as differentially expressed genes (DEGs) from the full (n = 4) dataset in Figure 6, Ccl2 was not among the DEGs in the “most correlated” comparison, owing to an ANOVA false discovery rate greater than 0.05, and neither Il1b nor Ccl2 was a DEG in the “least correlated” comparison. Genes that were not DEGs in the designated dataset are displayed in gray.

None

None

None

Similar articles

Cited by

References

    1. Lister R, O’Malley RC, Tonti-Filippini J, Gregory BD, Berry CC, Millar AH, et al. Highly integrated single-base resolution maps of the epigenome in Arabidopsis. Cell. 2008;133:523–536. - PMC - PubMed
    1. Nagalakshmi U, Wang Z, Waern K, Shou C, Raha D, Gerstein M, et al. The transcriptional landscape of the yeast genome defined by RNA sequencing. Science. 2008;320:1344–1349. - PMC - PubMed
    1. Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods. 2008;5:621–628. - PubMed
    1. Holt RA, Jones SJ. The new paradigm of flow cell sequencing. Genome Res. 2008;18:839–846. - PubMed
    1. Zheng Z, Chiu S, Akbarpour M, Sun H, Reyfman PA, Anekalla KR, et al. Donor pulmonary intravascular nonclassical monocytes recruit recipient neutrophils and mediate primary lung allograft dysfunction. Sci Transl Med. 2017;9:eaal4508. - PMC - PubMed

Publication types

MeSH terms

Grants and funding

LinkOut - more resources