tidyomics (original) (raw)
Get Started
tidyomics is an open project to develop and integrate software and documentation to enable atidy dataanalysis framework for omics data objects. tidyomics enables the use of familiartidyverse verbs (select
, filter
, mutate
, etc.) to manipulate rich data objects in theBioconductor ecosystem. Importantly, the data objects are not modified, but tidyomics provides a tidy interface to work on the native objects, leveraging existing Bioconductor classes and algorithms.
tidyomics is a set of R packages by an international group of developers.
tidyomics allows for code such as the following:
single_cell_data |>
filter(Phase == "G1") |>
ggplot(aes(UMAP_1, UMAP_2, color=score)) +
geom_point()
(filter single cells in G1 phase and plot UMAP coordinates)
or
chip_seq_peaks |>
filter(FDR < 0.01) |>
join_overlap_inner(promoters) |>
group_by(promoter_type) |>
summarize(ave_score = mean(score))
(compute average score by the type of promoter overlap for significant peaks)
Installer
Core tidyomics packages can be installed and loaded with the_tidyomics_ package. See the following URL for details and instructions:
https://github.com/tidyomics/tidyomics
Below find links to:
- Key Tidyomics Packages
- Comparison to base R
- Comparison to Bioconductor
- Tutorials
- News
- Talks
- Tidyomics paper
- Getting Help
- Get Involved
Key Tidyomics Packages
Here we list the packages that provide a tidy data interface to manipulate native Bioconductor objects. The tidyomics project also involves other convenience packages listed below.
Package | Intro | GitHub | Description |
---|---|---|---|
tidySummarizedExperiment | Vignette | GitHub | Tidy manipulation of SummarizedExperiment objects |
tidySingleCellExperiment | Vignette | GitHub | Tidy manipulation of SingleCellExperiment objects |
tidySeurat | Vignette | GitHub | Tidy manipulation of Seurat objects |
tidySpatialExperiment | Vignette | GitHub | Tidy manipulation of SpatialExperiment objects |
tidytof | Vignette | GitHub | Tidy manipulation of high-dimensional cytometry data |
plyranges | Vignette | GitHub | Tidy manipulation of genomics ranges |
plyinteractions | Vignette | GitHub | Tidy manipulation of genomic interactions |
plyxp | Vignette | GitHub | Data-masking-based interface to experiment data |
Consult each package homepage for a description of recent changes.
Note that many of these packages have more than one vignette, which you can find by navigating the package main page.
Convenience packages
Package | Intro | GitHub | Description |
---|---|---|---|
tidybulk | Vignette | GitHub | Tidy bulk RNA-seq data analysis |
nullranges | Vignette | GitHub | Generation of null genomic range sets |
easylift | Vignette | GitHub | Perform genomic liftover |
Comparison to base R
As the tidyomics packages offer an interface to underlying R/Bioconductor function evaluations, operations carried out in tidyomics can also be performed with base R/Bioconductor. The benefit from the tidyomics approach is often in readability, interpretability, and extensability of code, gained through elimination of temporary variables, square bracket indexing ([...,...]
) and control code (e.g. for
, if
/else
, apply
/sapply
, etc.).
For example, a filtering and grouping operation on a SummarizedExperiment data
in tidyomics would look like:
data |>
filter(score > 0) |>
group_by(gene_class) |>
summarize(mean_count = mean(counts))
In comparison, we can obtain the same with base R/Bioconductor, but with more variables and some control code:
subdata <- data[rowData(data)$score > 0,]
gene_classes <- levels(rowData(subdata)$gene_class)
mean_count <- numeric(length(gene_classes))
for (i in seq_along(gene_classes)) {
tmp_idx <- rowData(subdata)$gene_class == gene_classes[i]
mean_count[i] <- mean(assay(subdata, "counts")[tmp_idx,])
}
This can be improved a bit if you know some more base R functions. Here is a base R alternative making use of subset
andaggregate
, h/t Martin Morgan:
subdata <- subset(data, score > 0)
aggregate(
as.vector(assay(subdata)),
list(rep(rowData(subdata)$gene_class, ncol(subdata))),
mean
)
Even still, the tidyomics version above (filter
, group_by
,summarize
) is likely the easiest to read and extend if the analyst wants to do additional operations, and is the easiest to directly pipe into a plot or a printed table.
For exploring this example, you can define data
as follows:
set.seed(5)
data <- SummarizedExperiment(
assay=list(counts =
matrix(rnorm(100),10,10,
dimnames=list(letters[1:10],letters[1:10]))
),
rowData = DataFrame(
score=rnorm(10),
gene_class=factor(rep(1:3,c(3,3,4)))
)
)
Comparison to Bioconductor
A key innovation in Bioconductor is the use of object-oriented programming and specific data structures. As described inGentleman et al 2004,
An
exprSet
is a data structure that binds together array-based expression measurements with covariate and administrative data for a collection of [experiments]... [its] design facilitates a three-tier architecture for providing analysis tools for new microarray platforms: low-level data are bridged to high-level analysis manipulations via theexprSet
structure.
In Bioconductor, rich, structured data about experiments is maintained throughout analyses by passing data objects from one method to another. E.g. estimateDispersions
adds dispersion information to therowData
slot of a DESeqDataSet
which is a sub-class of aSummarizedExperiment
therefore inheriting the structure and methods of that class. The structure of the data is preserved after running the function (like many Biodonctor methods, it is an _endomorphic_function).
The goal of tidyomics is to preserve the object-oriented programming style and stucture of Bioconductor data objects, while allowing users to manipulate these data objects with expressive commands, familiar to tidyverse users.
Tidyomics aims to allow users to flexibly explore and plot biological datasets, by combining simple functions with human-readable names in a modular fashion to perform complex operations, including grouping and summarization tasks. Operations should still be performed with comparable efficiency to the underlying base R/Bioconductor code.
Tutorials
- BioC workshop covering single cell transcriptomics and genomics: Tidy single-cell analyses
- BioC workshop covering genomic ranges and interactions: Investigating chromatin composition and architecture
- Online book covering tidy manipulation of GRanges and more: Tidy ranges tutorial
- Quarto lecture notes introducing the concepts of tidyomics for expression and ranges: Tidy intro talk
- Short tutorial showing overlaps of GWAS SNPs with scATAC-seq peaks T1D GWAS SNPs and CD4+ peaks
- Workflow showing RNA-seq and ATAC-seq integration with plyranges: Fluent genomics workflow
- Bulk RNA-seq tutorial contributed by Maria Doyle RNAseq-R-tidyverse 2022
- More to come...
News
Talks
Tidyomics paper
Getting Help
We value community feedback and collaboration, and are happy to help you get started. Join the ongoing discussion, or you can ask specific questions about code on the support site.
- Join our Slack Channel,#tidiness_in_biocfor general discussion or pointers
- For specific coding help you can reach out on theBioconductor support site
Get Involved
The tidyomics organization is open to new members and contributions; it is an effort ofmany developersin the Bioconductor community and beyond.
- See our tidyomics open challengesproject to see what we are currently working on
- Issues tagged withgood first issueare those that developers think would be good for a new developer to start working on
- Read over our Guidelines for contributing
- Read over our Code of Conduct
- As with new users, for new developers please consider joining our Slack Channel,#tidiness_in_bioc. Most of the tidyomics developers are active there and we are happy to talk through updates, PRs, or give guidance on your development of a new package in this space.