A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor - PubMed (original) (raw)

A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor

Aaron T L Lun et al. F1000Res. 2016.

Abstract

Single-cell RNA sequencing (scRNA-seq) is widely used to profile the transcriptome of individual cells. This provides biological resolution that cannot be matched by bulk RNA sequencing, at the cost of increased technical noise and data complexity. The differences between scRNA-seq and bulk RNA-seq data mean that the analysis of the former cannot be performed by recycling bioinformatics pipelines for the latter. Rather, dedicated single-cell methods are required at various steps to exploit the cellular resolution while accounting for technical noise. This article describes a computational workflow for low-level analyses of scRNA-seq data, based primarily on software packages from the open-source Bioconductor project. It covers basic steps including quality control, data exploration and normalization, as well as more complex procedures such as cell cycle phase assignment, identification of highly variable and correlated genes, clustering into subpopulations and marker gene detection. Analyses were demonstrated on gene-level count data from several publicly available datasets involving haematopoietic stem cells, brain-derived cells, T-helper cells and mouse embryonic stem cells. This will provide a range of usage scenarios from which readers can construct their own analysis pipelines.

Keywords: Bioconductor; RNA-seq; Single cell; bioinformatics; workflow.

PubMed Disclaimer

Conflict of interest statement

Competing interests: No competing interests were disclosed.

Figures

Figure 1.

Figure 1.. Histograms of library sizes (left) and number of expressed genes (right) for all cells in the HSC dataset.

Figure 2.

Figure 2.. Histogram of the proportion of reads mapped to mitochondrial genes (left) or spike-in transcripts (right) across all cells in the HSC dataset.

Figure 3.

Figure 3.. PCA plot for cells in the HSC dataset, constructed using quality metrics.

The first and second components are shown on each axis, along with the percentage of total variance explained by each component. Bars represent the coordinates of the cells on each axis.

Figure 4.

Figure 4.. Cell cycle phase scores from applying the pair-based classifier on the HSC dataset, where each point represents a cell.

Figure 5.

Figure 5.. Histogram of log-average counts for all genes in the HSC dataset.

The filter threshold is represented by the blue line.

Figure 6.

Figure 6.. Percentage of total counts assigned to the top 50 most highly-abundant features in the HSC dataset.

For each feature, each bar represents the percentage assigned to that feature for a single cell, while the circle represents the average across all cells. Bars are coloured by the total number of expressed features in each cell, while circles are coloured according to whether the feature is labelled as a control feature.

Figure 7.

Figure 7.. Number of expressing cells against the log-mean expression for each gene in the HSC dataset.

Spike-in transcripts are highlighted in red.

Figure 8.

Figure 8.. Size factors from deconvolution, plotted against library sizes for all cells in the HSC dataset.

Axes are shown on a log-scale.

Figure 9.

Figure 9.. Density plot of the percentage of variance explained by the (log-transformed) total spike-in counts across all genes in the HSC dataset.

For each gene, the percentage of the variance of the normalized log-expression values across cells that is explained by each factor is calculated. Each curve corresponds to one factor and represents the distribution of percentages across all genes.

Figure 10.

Figure 10.. Variance of normalized log-expression values for each gene in the HSC dataset, plotted against the mean log-expression.

The blue line represents the mean-dependent trend fitted to the variances of the endogenous genes. Variance estimates for spike-in transcripts are highlighted in red.

Figure 11.

Figure 11.. Violin plots of normalized log-expression values for the top 10 HVGs in the HSC dataset.

Each point represents the log-expression value in a single cell.

Figure 12.

Figure 12.. Heatmap of mean-centred normalized log-expression values for correlated HVGs in the HSC dataset.

Dendrograms are formed by hierarchical clustering on the Euclidean distances between genes (row) or cells (column).

Figure 13.

Figure 13.. PCA plot constructed from normalized log-expression values of correlated HVGs, where each point represents a cell in the HSC dataset.

First and second components are shown, along with the percentage of variance explained. Bars represent the coordinates of the cells on each axis. Each cell is coloured according to its total number of expressed features.

Figure 14.

Figure 14.. _t_-SNE plots constructed from normalized log-expression values of correlated HVGs, using a range of perplexity values.

In each plot, each point represents a cell in the HSC dataset. Bars represent the coordinates of the cells on each axis. Each cell is coloured according to its total number of expressed features.

Figure 15.

Figure 15.. Histograms of library sizes (left) and number of expressed genes (right) for all cells in the brain dataset.

Figure 16.

Figure 16.. Histogram of the proportion of UMIs assigned to mitochondrial genes (left) or spike-in transcripts (right) across all cells in the brain dataset.

Figure 17.

Figure 17.. Cell cycle phase scores from applying the pair-based classifier on the brain dataset, where each point represents a cell.

Figure 18.

Figure 18.. Histogram of log-average counts for all genes in the brain dataset.

The filter threshold is represented by the blue line.

Figure 19.

Figure 19.. Size factors from deconvolution, plotted against library sizes for all cells in the brain dataset.

Axes are shown on a log-scale.

Figure 20.

Figure 20.. Density plot of the percentage of variance explained by each factor across all genes in the brain dataset.

For each gene, the percentage of the variance of the normalized log-expression values that is explained by the (log-transformed) total spike-in counts, the sex or age of the mouse, or the tissue of origin is calculated. Each curve corresponds to one factor and represents the distribution of percentages across all genes.

Figure 21.

Figure 21.. Variance of normalized log-expression values for each gene in the brain dataset, plotted against the mean log-expression.

The red line represents the mean-dependent trend in the technical variance of the spike-in transcripts (also highlighted as red points).

Figure 22.

Figure 22.. Violin plots of normalized log-expression values for the top 10 HVGs in the brain dataset.

For each gene, each point represents the log-expression value for an individual cell.

Figure 23.

Figure 23.. _t_-SNE plots constructed from the normalized and corrected log-expression values of correlated HVGs for cells in the brain dataset.

Each point represents a cell and is coloured according to its expression of the top HVG (left) or_Mog_ (right).

Figure 24.

Figure 24.. PCA plots constructed from the normalized and corrected log-expression values of correlated HVGs for cells in the brain dataset.

Each point represents a cell and is coloured according to its expression of the top HVG (left) or_Mog_ (right).

Figure 25.

Figure 25.. Heatmap of mean-centred normalized and corrected log-expression values for correlated HVGs in the brain dataset.

Dendrograms are formed by hierarchical clustering on the Euclidean distances between genes (row) or cells (column). Column colours represent the cluster to which each cell is assigned after a dynamic tree cut.

Figure 26.

Figure 26.. Heatmap of mean-centred normalized and corrected log-expression values for the top set of markers for cluster 1 in the brain dataset.

Column colours represent the cluster to which each cell is assigned, as indicated by the legend.

Figure 27.

Figure 27.. Size factors from spike-in normalization, plotted against the size factors from deconvolution for all cells in the mESC/MEF dataset.

Axes are shown on a log-scale, and cells are coloured according to their identity. Deconvolution size factors were computed with small pool sizes owing to the low number of cells of each type.

Figure 28.

Figure 28.. Cell cycle phase scores from applying the pair-based classifier on the T H2 dataset, where each point represents a cell.

Figure 29.

Figure 29.. PCA plots before (left) and after (right) removal of the cell cycle effect in the T H2 dataset.

Each cell is represented by a point with colour and size determined by the G1 and G2/M scores, respectively. Only HVGs were used to construct each plot.

Figure 30.

Figure 30.. A diffusion map for the T H2 dataset, where each cell is coloured by its expression of Gata3.

Similar articles

Cited by

References

    1. Anders S, Huber W: Differential expression analysis for sequence count data. Genome Biol. 2010;11(10):R106. 10.1186/gb-2010-11-10-r106 - DOI - PMC - PubMed
    1. Angerer P, Haghverdi L, Büttner M, et al. : destiny: diffusion maps for large-scale single-cell data in R. Bioinformatics. 2016;32(8):1241–1243. 10.1093/bioinformatics/btv715 - DOI - PubMed
    1. Bertoli C, Skotheim JM, de Bruin RA: Control of cell cycle transcription during G1 and S phases. Nat Rev Mol Cell Biol. 2013;14(8):518–528. 10.1038/nrm3629 - DOI - PMC - PubMed
    1. Bourgon R, Gentleman R, Huber W: Independent filtering increases detection power for high-throughput experiments. Proc Natl Acad Sci U S A. 2010;107(21):9546–9551. 10.1073/pnas.0914005107 - DOI - PMC - PubMed
    1. Bray NL, Pimentel H, Melsted P, et al. : Near-optimal probabilistic RNA-seq quantification. Nat Biotechnol. 2016;34(5):525–527. 10.1038/nbt.3519 - DOI - PubMed

LinkOut - more resources