tidyomics (original) (raw)

Get Started

tidyomics is an open project to develop and integrate software and documentation to enable atidy dataanalysis framework for omics data objects. tidyomics enables the use of familiartidyverse verbs (select, filter, mutate, etc.) to manipulate rich data objects in theBioconductor ecosystem. Importantly, the data objects are not modified, but tidyomics provides a tidy interface to work on the native objects, leveraging existing Bioconductor classes and algorithms.

tidyomics is a set of R packages by an international group of developers.

tidyomics allows for code such as the following:

single_cell_data |>
  filter(Phase == "G1") |>
  ggplot(aes(UMAP_1, UMAP_2, color=score)) + 
  geom_point()

(filter single cells in G1 phase and plot UMAP coordinates)

chip_seq_peaks |>
  filter(FDR < 0.01) |>
  join_overlap_inner(promoters) |>
  group_by(promoter_type) |>
  summarize(ave_score = mean(score))

(compute average score by the type of promoter overlap for significant peaks)

Installer

Core tidyomics packages can be installed and loaded with the_tidyomics_ package. See the following URL for details and instructions:

https://github.com/tidyomics/tidyomics

Below find links to:

Key Tidyomics Packages
Comparison to base R
Comparison to Bioconductor
Tutorials
News
Talks
Tidyomics paper
Getting Help
Get Involved

Key Tidyomics Packages

Here we list the packages that provide a tidy data interface to manipulate native Bioconductor objects. The tidyomics project also involves other convenience packages listed below.

Package	Intro	GitHub	Description
tidySummarizedExperiment	Vignette	GitHub	Tidy manipulation of SummarizedExperiment objects
tidySingleCellExperiment	Vignette	GitHub	Tidy manipulation of SingleCellExperiment objects
tidySeurat	Vignette	GitHub	Tidy manipulation of Seurat objects
tidySpatialExperiment	Vignette	GitHub	Tidy manipulation of SpatialExperiment objects
tidytof	Vignette	GitHub	Tidy manipulation of high-dimensional cytometry data
plyranges	Vignette	GitHub	Tidy manipulation of genomics ranges
plyinteractions	Vignette	GitHub	Tidy manipulation of genomic interactions
plyxp	Vignette	GitHub	Data-masking-based interface to experiment data

Consult each package homepage for a description of recent changes.

Note that many of these packages have more than one vignette, which you can find by navigating the package main page.

Convenience packages

Package	Intro	GitHub	Description
tidybulk	Vignette	GitHub	Tidy bulk RNA-seq data analysis
nullranges	Vignette	GitHub	Generation of null genomic range sets
easylift	Vignette	GitHub	Perform genomic liftover

Comparison to base R

As the tidyomics packages offer an interface to underlying R/Bioconductor function evaluations, operations carried out in tidyomics can also be performed with base R/Bioconductor. The benefit from the tidyomics approach is often in readability, interpretability, and extensability of code, gained through elimination of temporary variables, square bracket indexing ([...,...]) and control code (e.g. for, if/else, apply/sapply, etc.).

For example, a filtering and grouping operation on a SummarizedExperiment data in tidyomics would look like:

data |>
  filter(score > 0) |>
  group_by(gene_class) |>
  summarize(mean_count = mean(counts))

In comparison, we can obtain the same with base R/Bioconductor, but with more variables and some control code:

subdata <- data[rowData(data)$score > 0,]
gene_classes <- levels(rowData(subdata)$gene_class)
mean_count <- numeric(length(gene_classes))
for (i in seq_along(gene_classes)) {
  tmp_idx <- rowData(subdata)$gene_class == gene_classes[i]
  mean_count[i] <- mean(assay(subdata, "counts")[tmp_idx,])
}

This can be improved a bit if you know some more base R functions. Here is a base R alternative making use of subset andaggregate, h/t Martin Morgan:

subdata <- subset(data, score > 0)
aggregate(
  as.vector(assay(subdata)),
  list(rep(rowData(subdata)$gene_class, ncol(subdata))),
  mean
)

Even still, the tidyomics version above (filter, group_by,summarize) is likely the easiest to read and extend if the analyst wants to do additional operations, and is the easiest to directly pipe into a plot or a printed table.

For exploring this example, you can define data as follows:

set.seed(5)
data <- SummarizedExperiment(
  assay=list(counts =
    matrix(rnorm(100),10,10, 
    dimnames=list(letters[1:10],letters[1:10]))
  ), 
  rowData = DataFrame(
    score=rnorm(10), 
    gene_class=factor(rep(1:3,c(3,3,4)))
  )
)

Comparison to Bioconductor

A key innovation in Bioconductor is the use of object-oriented programming and specific data structures. As described inGentleman et al 2004,

An exprSet is a data structure that binds together array-based expression measurements with covariate and administrative data for a collection of [experiments]... [its] design facilitates a three-tier architecture for providing analysis tools for new microarray platforms: low-level data are bridged to high-level analysis manipulations via the exprSet structure.

In Bioconductor, rich, structured data about experiments is maintained throughout analyses by passing data objects from one method to another. E.g. estimateDispersions adds dispersion information to therowData slot of a DESeqDataSet which is a sub-class of aSummarizedExperiment therefore inheriting the structure and methods of that class. The structure of the data is preserved after running the function (like many Biodonctor methods, it is an _endomorphic_function).

The goal of tidyomics is to preserve the object-oriented programming style and stucture of Bioconductor data objects, while allowing users to manipulate these data objects with expressive commands, familiar to tidyverse users.

Tidyomics aims to allow users to flexibly explore and plot biological datasets, by combining simple functions with human-readable names in a modular fashion to perform complex operations, including grouping and summarization tasks. Operations should still be performed with comparable efficiency to the underlying base R/Bioconductor code.

Tutorials

BioC workshop covering single cell transcriptomics and genomics: Tidy single-cell analyses
BioC workshop covering genomic ranges and interactions: Investigating chromatin composition and architecture
Online book covering tidy manipulation of GRanges and more: Tidy ranges tutorial
Quarto lecture notes introducing the concepts of tidyomics for expression and ranges: Tidy intro talk
Short tutorial showing overlaps of GWAS SNPs with scATAC-seq peaks T1D GWAS SNPs and CD4+ peaks
Workflow showing RNA-seq and ATAC-seq integration with plyranges: Fluent genomics workflow
Bulk RNA-seq tutorial contributed by Maria Doyle RNAseq-R-tidyverse 2022
More to come...

News

Tidyomics blog

Talks

Tidyomics paper

The tidyomics ecosystem: Enhancing omic data analyses

Getting Help

We value community feedback and collaboration, and are happy to help you get started. Join the ongoing discussion, or you can ask specific questions about code on the support site.

Join our Slack Channel,#tidiness_in_biocfor general discussion or pointers
For specific coding help you can reach out on theBioconductor support site

Get Involved

The tidyomics organization is open to new members and contributions; it is an effort ofmany developersin the Bioconductor community and beyond.

See our tidyomics open challengesproject to see what we are currently working on
Issues tagged withgood first issueare those that developers think would be good for a new developer to start working on
Read over our Guidelines for contributing
Read over our Code of Conduct
As with new users, for new developers please consider joining our Slack Channel,#tidiness_in_bioc. Most of the tidyomics developers are active there and we are happy to talk through updates, PRs, or give guidance on your development of a new package in this space.