SoupX removes ambient RNA contamination from droplet-based single-cell RNA sequencing data - PubMed (original) (raw)

SoupX removes ambient RNA contamination from droplet-based single-cell RNA sequencing data

Matthew D Young et al. Gigascience. 2020.

Abstract

Background: Droplet-based single-cell RNA sequence analyses assume that all acquired RNAs are endogenous to cells. However, any cell-free RNAs contained within the input solution are also captured by these assays. This sequencing of cell-free RNA constitutes a background contamination that confounds the biological interpretation of single-cell transcriptomic data.

Results: We demonstrate that contamination from this "soup" of cell-free RNAs is ubiquitous, with experiment-specific variations in composition and magnitude. We present a method, SoupX, for quantifying the extent of the contamination and estimating "background-corrected" cell expression profiles that seamlessly integrate with existing downstream analysis tools. Applying this method to several datasets using multiple droplet sequencing technologies, we demonstrate that its application improves biological interpretation of otherwise misleading data, as well as improving quality control metrics.

Conclusions: We present SoupX, a tool for removing ambient RNA contamination from droplet-based single-cell RNA sequencing experiments. This tool has broad applicability, and its application can improve the biological utility of existing and future datasets.

Keywords: decontamination; pre-processing; scRNA-seq.

© The Author(s) 2020. Published by Oxford University Press GigaScience.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Figure 1:

Figure 1:

A visual summary of the SoupX method, using data from the PBMC dataset.

Figure 2:

Figure 2:

The properties of the cell-free mRNA soup as determined using species-mixing datasets. A, The log10 ratio of the number of UMIs mapping to human and mouse mRNAs for each droplet in the species-mixing dataset (10X). Droplets determined to contain cells by cellranger are marked in black. B, The correlation of the counts in the background compared to counts averaged across cells for each gene. Counts have been subsampled so that the total number of counts in the background and averaged cell population are the same. C, The estimated contamination fraction as a function of number of UMIs in each droplet in individual cells in the species-mixing dataset. Red and blue dots represent cells from the 10X/DropSeq experiments, respectively. The distribution on the left shows the marginal distribution across all cells. D, The fractional change in contaminating and genuine expression levels after applying SoupX for the 2 technologies. The distribution across cells is summarized by box plots, where the central line is the median, box boundaries are the first and third quartiles, and the whiskers extend to 1.5 times the interquartile range.

Figure 3:

Figure 3:

The PBMC dataset and how it changes when background correction is applied. A, A tSNE representation of the data, with cluster boundaries shown by density contours and shaded according to the cell type they represent. MNP: mononuclear phagocytes; NK: natural killer cells. B, The same tSNE representation, but cells are now coloured by their rate of expression of immunoglobulin (IG) genes compared to the rate at which IG is expressed in the background on a log10 scale. Positive values correspond to higher IG expression in a cell than in the background, with values significantly >0 only possible if the cell endogenously expresses IG. The density contours of the clusters with no cell that endogenously expresses IG (as determined by a Poisson test) are marked in boldface and used to estimate the global contamination ratio. C, The fraction of cells shared between clusters determined with the same parameters before and after application of SoupX. D, The improvement in marker specificity following application of SoupX. All genes that are markers of a cluster either before or after correction are identified and their expression log fold change (FC) relative to the clusters they do not mark is calculated before and after correction. The y-axis of this plot shows the fractional change in log FC after applying SoupX for all genes. Genes are grouped into bins for ease of representation, with the number of genes in each bin given by the colour scale. The marginal distribution across all genes is shown on the right and the dotted line corresponds to no change in marker specificity after correction. E, The improvement in marker sensitivity for the gene LYZ, which is a marker for mononuclear phagocytes (MNPs). The corrected and uncorrected expression levels are shown split by cells labelled as MNPs and all others. F, This same change in expression shown on the tSNE map, where the colour scale represents the fraction of LYZ expression that has been removed by SoupX.

Figure 4:

Figure 4:

The application of SoupX to complex, multi-channel data. A, A tSNE representation of the data, with cluster boundaries shown by density contours and shaded according to the cell type they represent. ccRCC: clear-cell renal cell carcinoma cells; pRCC: papillary cell renal cell carcinoma cells; RBC: red blood cells; MNP: mononuclear phagocytes. B, The fraction of cells shared between clusters determined with the same parameters before and after application of SoupX. C, The improvement in marker sensitivity for the gene HBB, which is a marker for red blood cells. The colour scale represents the fraction of HBB expression that has been removed by SoupX. D, Same as C but for COL1A1. E, The cross-batch entropy before and after SoupX has been applied. The entropy measures the level of local mixing (100 nearest neighbours) for 100 cells selected from each cluster [20]. F, The distribution of HBB expression (y-axis, log scale) in the fetal liver data by cell type (x-axis), with the erythroid lineage marked in boldface. For each cell type, the expression distribution is shown before (right) and after (left) application of SoupX. Dots represent individual cells and box plots show the distribution of expression values where the central line is the median, box boundaries are the first and third quartiles, and the whiskers extend to 1.5 times the interquartile range.

Similar articles

Cited by

References

    1. Zilionis R, Nainys J, Veres A, et al. Single-cell barcoding and sequencing using droplet microfluidics. Nat Protoc. 2016;12(1):44–73. - PubMed
    1. Zheng GXY, Terry JM, Belgrader P, et al. Massively parallel digital transcriptional profiling of single cells. Nat Commun. 2017;8:14049. - PMC - PubMed
    1. Hashimoto S, Tabuchi Y, Yurino H, et al. Comprehensive single-cell transcriptome analysis reveals heterogeneity in endometrioid adenocarcinoma tissues. Sci Rep. 2017;7(1):14225. - PMC - PubMed
    1. Bach K, Pensa S, Grzelak M, et al. Differentiation dynamics of mammary epithelial cells revealed by single-cell RNA sequencing. Nat Commun. 2017;8(1):2128. - PMC - PubMed
    1. Daniszewski M, Senabouth A, Nguyen QH, et al. Single cell RNA sequencing of stem cell-derived retinal ganglion cells. Sci Data. 2018;5:180013. - PMC - PubMed

Publication types

MeSH terms

Substances

LinkOut - more resources