Bioconductor Code: tidySingleCellExperiment (original) (raw)
tidySingleCellExperiment - part of tidytranscriptomics ================ [\](https://www.tidyverse.org/lifecycle/#maturing) [\](https://github.com/stemangiola/tidySingleCellExperiment/actions) **Brings SingleCellExperiment to the tidyverse!** Website: [tidySingleCellExperiment](https://stemangiola.github.io/tidySingleCellExperiment/articles/introduction.html) Please also have a look at - [tidySummarizedExperiment](https://stemangiola.github.io/tidySummarizedExperiment/) for tidy manipulation of SummarizedExperiment objects) - [tidyseurat](https://stemangiola.github.io/tidyseurat/) for tidy manipulation of Seurat objects - [tidybulk](https://stemangiola.github.io/tidybulk/) for tidy bulk RNA-seq data analysis - [tidygate](https://github.com/stemangiola/tidygate) for adding custom gate information to your tibble - [tidyHeatmap](https://stemangiola.github.io/tidyHeatmap/) for heatmaps produced with tidy principles # Introduction tidySingleCellExperiment provides a bridge between Bioconductor single-cell packages \[@amezquita2019orchestrating\] and the tidyverse \[@wickham2019welcome\]. It enables viewing the Bioconductor *SingleCellExperiment* object as a tidyverse tibble, and provides SingleCellExperiment-compatible *dplyr*, *tidyr*, *ggplot* and *plotly* functions. This allows users to get the best of both Bioconductor and tidyverse worlds. ## Functions/utilities available | SingleCellExperiment-compatible Functions | Description | |-------------------------------------------|------------------------------------------------------------------------------------| | `all` | After all `tidySingleCellExperiment` is a SingleCellExperiment object, just better | | tidyverse Packages | Description | |--------------------|----------------------------------------------------| | `dplyr` | All `dplyr` tibble functions (e.g. `select`) | | `tidyr` | All `tidyr` tibble functions (e.g. `pivot_longer`) | | `ggplot2` | `ggplot` (`ggplot`) | | `plotly` | `plot_ly` (`plot_ly`) | | Utilities | Description | |-------------------|------------------------------------------------------------------| | `as_tibble` | Convert cell-wise information to a `tbl_df` | | `join_features` | Add feature-wise information, returns a `tbl_df` | | `aggregate_cells` | Aggregate cell gene-transcription abundance as pseudobulk tissue | ## Installation ``` r if (!requireNamespace("BiocManager", quietly=TRUE)) install.packages("BiocManager") BiocManager::install("tidySingleCellExperiment") ``` Load libraries used in this vignette. ``` r # Bioconductor single-cell packages library(scater) library(scran) library(SingleR) library(SingleCellSignalR) # Tidyverse-compatible packages library(purrr) library(magrittr) library(tidyHeatmap) # Both library(tidySingleCellExperiment) ``` # Data representation of `tidySingleCellExperiment` This is a *SingleCellExperiment* object but it is evaluated as a tibble. So it is compatible both with SingleCellExperiment and tidyverse. ``` r data(pbmc_small, package="tidySingleCellExperiment") ``` **It looks like a tibble** ``` r pbmc_small ``` ## # A SingleCellExperiment-tibble abstraction: 80 × 17 ## # [90mFeatures=230 | Cells=80 | Assays=counts, logcounts[0m ## .cell orig.ident nCount_RNA nFeature_RNA RNA_snn_res.0.8 letter.idents groups ## ## 1 ATGC… SeuratPro… 70 47 0 A g2 ## 2 CATG… SeuratPro… 85 52 0 A g1 ## 3 GAAC… SeuratPro… 87 50 1 B g2 ## 4 TGAC… SeuratPro… 127 56 0 A g2 ## 5 AGTC… SeuratPro… 173 53 0 A g2 ## 6 TCTG… SeuratPro… 70 48 0 A g1 ## 7 TGGT… SeuratPro… 64 36 0 A g1 ## 8 GCAG… SeuratPro… 72 45 0 A g1 ## 9 GATA… SeuratPro… 52 36 0 A g1 ## 10 AATG… SeuratPro… 100 41 0 A g1 ## # ℹ 70 more rows ## # ℹ 10 more variables: RNA_snn_res.1 , file , ident , ## # PC_1 , PC_2 , PC_3 , PC_4 , PC_5 , tSNE_1 , ## # tSNE_2 **But it is a SingleCellExperiment object after all** ``` r assay(pbmc_small, "counts")[1:5, 1:5] ``` ## 5 x 5 sparse Matrix of class "dgCMatrix" ## ATGCCAGAACGACT CATGGCCTGTGCAT GAACCTGATGAACC TGACTGGATTCTCA ## MS4A1 . . . . ## CD79B 1 . . . ## CD79A . . . . ## HLA-DRA . 1 . . ## TCL1A . . . . ## AGTCAGACTGCACA ## MS4A1 . ## CD79B . ## CD79A . ## HLA-DRA 1 ## TCL1A . The `SingleCellExperiment` object’s tibble visualisation can be turned off, or back on at any time. ``` r # Turn off the tibble visualisation options("restore_SingleCellExperiment_show" = TRUE) pbmc_small ``` ## class: SingleCellExperiment ## dim: 230 80 ## metadata(0): ## assays(2): counts logcounts ## rownames(230): MS4A1 CD79B ... SPON2 S100B ## rowData names(5): vst.mean vst.variance vst.variance.expected ## vst.variance.standardized vst.variable ## colnames(80): ATGCCAGAACGACT CATGGCCTGTGCAT ... GGAACACTTCAGAC ## CTTGATTGATCTTC ## colData names(9): orig.ident nCount_RNA ... file ident ``` r # Turn on the tibble visualisation options("restore_SingleCellExperiment_show" = FALSE) ``` # Annotation polishing We may have a column that contains the directory each run was taken from, such as the “file” column in `pbmc_small`. ``` r pbmc_small$file[1:5] ``` ## [1] "../data/sample2/outs/filtered_feature_bc_matrix/" ## [2] "../data/sample1/outs/filtered_feature_bc_matrix/" ## [3] "../data/sample2/outs/filtered_feature_bc_matrix/" ## [4] "../data/sample2/outs/filtered_feature_bc_matrix/" ## [5] "../data/sample2/outs/filtered_feature_bc_matrix/" We may want to extract the run/sample name out of it into a separate column. Tidyverse `extract` can be used to convert a character column into multiple columns using regular expression groups. ``` r # Create sample column pbmc_small_polished <- pbmc_small |> extract(file, "sample", "../data/([a-z0-9]+)/outs.+", remove=FALSE) # Reorder to have sample column up front pbmc_small_polished |> select(sample, everything()) ``` ## # A SingleCellExperiment-tibble abstraction: 80 × 18 ## # [90mFeatures=230 | Cells=80 | Assays=counts, logcounts[0m ## .cell sample orig.ident nCount_RNA nFeature_RNA RNA_snn_res.0.8 letter.idents ## ## 1 ATGC… sampl… SeuratPro… 70 47 0 A ## 2 CATG… sampl… SeuratPro… 85 52 0 A ## 3 GAAC… sampl… SeuratPro… 87 50 1 B ## 4 TGAC… sampl… SeuratPro… 127 56 0 A ## 5 AGTC… sampl… SeuratPro… 173 53 0 A ## 6 TCTG… sampl… SeuratPro… 70 48 0 A ## 7 TGGT… sampl… SeuratPro… 64 36 0 A ## 8 GCAG… sampl… SeuratPro… 72 45 0 A ## 9 GATA… sampl… SeuratPro… 52 36 0 A ## 10 AATG… sampl… SeuratPro… 100 41 0 A ## # ℹ 70 more rows ## # ℹ 11 more variables: groups , RNA_snn_res.1 , file , ## # ident , PC_1 , PC_2 , PC_3 , PC_4 , PC_5 , ## # tSNE_1 , tSNE_2 # Preliminary plots Set colours and theme for plots. ``` r # Use colourblind-friendly colours friendly_cols <- dittoSeq::dittoColors() # Set theme custom_theme <- list( scale_fill_manual(values=friendly_cols), scale_color_manual(values=friendly_cols), theme_bw() + theme( panel.border=element_blank(), axis.line=element_line(), panel.grid.major=element_line(size=0.2), panel.grid.minor=element_line(size=0.1), text=element_text(size=12), legend.position="bottom", aspect.ratio=1, strip.background=element_blank(), axis.title.x=element_text(margin=margin(t=10, r=10, b=10, l=10)), axis.title.y=element_text(margin=margin(t=10, r=10, b=10, l=10)) ) ) ``` ## Warning: The `size` argument of `element_line()` is deprecated as of ggplot2 3.4.0. ## ℹ Please use the `linewidth` argument instead. ## This warning is displayed once every 8 hours. ## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was ## generated. We can treat `pbmc_small_polished` as a tibble for plotting. Here we plot number of features per cell. ``` r pbmc_small_polished |> ggplot(aes(nFeature_RNA, fill=groups)) + geom_histogram() + custom_theme ``` ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.  Here we plot total features per cell. ``` r pbmc_small_polished |> ggplot(aes(groups, nCount_RNA, fill=groups)) + geom_boxplot(outlier.shape=NA) + geom_jitter(width=0.1) + custom_theme ```  Here we plot abundance of two features for each group. ``` r pbmc_small_polished |> join_features(features=c("HLA-DRA", "LYZ")) |> ggplot(aes(groups, .abundance_counts + 1, fill=groups)) + geom_boxplot(outlier.shape=NA) + geom_jitter(aes(size=nCount_RNA), alpha=0.5, width=0.2) + scale_y_log10() + custom_theme ``` ## tidySingleCellExperiment says: join_features produces duplicate cell names to accomadate the long data format. For this reason, a data frame is returned for independent data analysis. Assay feature abundance is appended as .abundance_counts and .abundance_logcounts.  # Preprocess the dataset We can also treat `pbmc_small_polished` as a *SingleCellExperiment* object and proceed with data processing with Bioconductor packages, such as *scran* \[@lun2016pooling\] and *scater* \[@mccarthy2017scater\]. ``` r # Identify variable genes with scran variable_genes <- pbmc_small_polished |> modelGeneVar() |> getTopHVGs(prop=0.1) # Perform PCA with scater pbmc_small_pca <- pbmc_small_polished |> runPCA(subset_row=variable_genes) ``` ## Warning in check_numbers(k = k, nu = nu, nv = nv, limit = min(dim(x)) - : more ## singular values/vectors requested than available ## Warning in (function (A, nv = 5, nu = nv, maxit = 1000, work = nv + 7, reorth = ## TRUE, : You're computing too large a percentage of total singular values, use a ## standard svd instead. ## Warning in (function (A, nv = 5, nu = nv, maxit = 1000, work = nv + 7, reorth = ## TRUE, : did not converge--results might be invalid!; try increasing work or ## maxit ``` r pbmc_small_pca ``` ## # A SingleCellExperiment-tibble abstraction: 80 × 18 ## # [90mFeatures=230 | Cells=80 | Assays=counts, logcounts[0m ## .cell orig.ident nCount_RNA nFeature_RNA RNA_snn_res.0.8 letter.idents groups ## ## 1 ATGC… SeuratPro… 70 47 0 A g2 ## 2 CATG… SeuratPro… 85 52 0 A g1 ## 3 GAAC… SeuratPro… 87 50 1 B g2 ## 4 TGAC… SeuratPro… 127 56 0 A g2 ## 5 AGTC… SeuratPro… 173 53 0 A g2 ## 6 TCTG… SeuratPro… 70 48 0 A g1 ## 7 TGGT… SeuratPro… 64 36 0 A g1 ## 8 GCAG… SeuratPro… 72 45 0 A g1 ## 9 GATA… SeuratPro… 52 36 0 A g1 ## 10 AATG… SeuratPro… 100 41 0 A g1 ## # ℹ 70 more rows ## # ℹ 11 more variables: RNA_snn_res.1 , file , sample , ## # ident , PC1 , PC2 , PC3 , PC4 , PC5 , ## # tSNE_1 , tSNE_2 If a tidyverse-compatible package is not included in the tidySingleCellExperiment collection, we can use `as_tibble` to permanently convert `tidySingleCellExperiment` into a tibble. ``` r # Create pairs plot with GGally pbmc_small_pca |> as_tibble() |> select(contains("PC"), everything()) |> GGally::ggpairs(columns=1:5, ggplot2::aes(colour=groups)) + custom_theme ``` ## Registered S3 method overwritten by 'GGally': ## method from ## +.gg ggplot2  # Identify clusters We can proceed with cluster identification with *scran*. ``` r pbmc_small_cluster <- pbmc_small_pca # Assign clusters to the 'colLabels' of the SingleCellExperiment object colLabels(pbmc_small_cluster) <- pbmc_small_pca |> buildSNNGraph(use.dimred="PCA") |> igraph::cluster_walktrap() %$% membership |> as.factor() ``` ## Warning in (function (to_check, X, clust_centers, clust_info, dtype, nn, : ## detected tied distances to neighbors, see ?'BiocNeighbors-ties' ``` r # Reorder columns pbmc_small_cluster |> select(label, everything()) ``` ## # A SingleCellExperiment-tibble abstraction: 80 × 19 ## # [90mFeatures=230 | Cells=80 | Assays=counts, logcounts[0m ## .cell label orig.ident nCount_RNA nFeature_RNA RNA_snn_res.0.8 letter.idents ## ## 1 ATGCC… 2 SeuratPro… 70 47 0 A ## 2 CATGG… 2 SeuratPro… 85 52 0 A ## 3 GAACC… 2 SeuratPro… 87 50 1 B ## 4 TGACT… 1 SeuratPro… 127 56 0 A ## 5 AGTCA… 2 SeuratPro… 173 53 0 A ## 6 TCTGA… 2 SeuratPro… 70 48 0 A ## 7 TGGTA… 1 SeuratPro… 64 36 0 A ## 8 GCAGC… 2 SeuratPro… 72 45 0 A ## 9 GATAT… 2 SeuratPro… 52 36 0 A ## 10 AATGT… 2 SeuratPro… 100 41 0 A ## # ℹ 70 more rows ## # ℹ 12 more variables: groups , RNA_snn_res.1 , file , ## # sample , ident , PC1 , PC2 , PC3 , PC4 , ## # PC5 , tSNE_1 , tSNE_2 And interrogate the output as if it was a regular tibble. ``` r # Count number of cells for each cluster per group pbmc_small_cluster |> count(groups, label) ``` ## tidySingleCellExperiment says: A data frame is returned for independent data analysis. ## # A tibble: 8 × 3 ## groups label n ## ## 1 g1 1 12 ## 2 g1 2 14 ## 3 g1 3 14 ## 4 g1 4 4 ## 5 g2 1 10 ## 6 g2 2 11 ## 7 g2 3 10 ## 8 g2 4 5 We can identify and visualise cluster markers combining SingleCellExperiment, tidyverse functions and tidyHeatmap \[@mangiola2020tidyheatmap\] ``` r # Identify top 10 markers per cluster marker_genes <- pbmc_small_cluster |> findMarkers(groups=pbmc_small_cluster$label) |> as.list() |> map(~ .x |> head(10) |> rownames()) |> unlist() # Plot heatmap pbmc_small_cluster |> join_features(features=marker_genes) |> group_by(label) |> heatmap(.feature, .cell, .abundance_counts, .scale="column") ``` ## tidySingleCellExperiment says: join_features produces duplicate cell names to accomadate the long data format. For this reason, a data frame is returned for independent data analysis. Assay feature abundance is appended as .abundance_counts and .abundance_logcounts. ## tidyHeatmap says: (once per session) from release 1.7.0 the scaling is set to "none" by default. Please use scale = "row", "column" or "both" to apply scaling ## Warning: The `.scale` argument of `heatmap()` is deprecated as of tidyHeatmap 1.7.0. ## ℹ Please use scale (without dot prefix) instead: heatmap(scale = ...) ## This warning is displayed once every 8 hours. ## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was ## generated.  # Reduce dimensions We can calculate the first 3 UMAP dimensions using the SingleCellExperiment framework and *scater*. ``` r pbmc_small_UMAP <- pbmc_small_cluster |> runUMAP(ncomponents=3) ``` And we can plot the result in 3D using plotly. ``` r pbmc_small_UMAP |> plot_ly( x=~`UMAP1`, y=~`UMAP2`, z=~`UMAP3`, color=~label, colors=friendly_cols[1:4] ) ```