MultiAssayExperiment API (original) (raw)

Executive summary

The MultiAssayExperiment class can be used to manage results of diverse assays on a collection of samples. Currently the class can handle assays that are organized as instances of RangedSummarizedExperiment, ExpressionSet(legacy), matrix, RaggedExperiment (inherits from GRangesList), andRangedVcfStack (defined in the GenomicFiles package). Create newMultiAssayExperiment instances with the eponymous constructor, minimally with the argument ExperimentList, potentially also with the arguments colDataand sampleMap.

Other data classes can be used in the MultiAssayExperiment, as long as they provide three methods: dimnames(), [i, j], and dim(). See theExperimentList section for details on requirements for incorporating new data classes.

Note: For a brief visual summary of classes and methods involved in the package, please see theMultiAssayShiny package.

Note: For an essential overview of the methods from a user perspective, please see the MultiAssayExperiment cheat sheet

Overview

The most important class exported by this package is the MultiAssayExperimentfor coordinated representation of multiple experiments on partially overlapping samples, with associated metadata at the level of entire study and the level of "biological unit". The biological unit may be a patient, plant, yeast strain, etc. This package is designed around the following hierarchy of information:

study (highest level). The study can encompass several different types of experiments performed on one set of biological units, for example cancer patients. A MultiAssayExperiment represents a whole study, containing:

metadata about the study as a whole
metadata about each biological unit: for example, age, grade, stage for cancer patients
results from a set of experiments performed on the biological units
a map for matching data from the experiments back to the corresponding biological units.

experiment. A set of assays of a single type performed on some or all of the biological units. It is permissible that an experiment may be performed only on a subset of the biological units, and may be performed in duplicate on some of the biological units. For example, an experiment could be somatic mutation calls for some or all of the biological units.

Data from multiple experiments are stored in a list object called theExperimentList, which provides flexibility for partially overlapping samples (column names) and features (row names), while keeping samples correctly matched to study-level metadata and to other experiments on the same samples.

Experiments may be ID-based, where measurements are indexed identifiers of genes, microRNA, proteins, microbes, etc. Alternatively, experiments may be_range-based_, where measurements correspond to genomic ranges that can be represented as GRanges objects, such as gene expression or copy number. Note that for ID-based experiments, there is no requirement that the same IDs be present for different experiments. For range-based experiments, there is also no requirement that the same ranges be present for different experiments; furthermore, it is possible for different samples within an experiment to be represented by different ranges. Note however that even ranged-based features must be named, so that genomic features can be referred to by character IDs. The following data classes have so far been tested to work as elements ofExperimentList:

matrix: the most basic class for ID-based datasets, could be used for example for gene expression summarized per-gene, microRNA, metabolomics, or microbiome data.
SummarizedExperiment: A richer representation for ID-based datasets, could be used for the same types of data as matrix, but storing additional assay-level metadata.
RangedSummarizedExperiment: For rectangular range-based datasets, meaning that one set of genomic ranges are assayed for multiple samples. Could be used for gene expression, methylation, or other data types referring to genomic positions.
ExpressionSet: Another rich representation for ID-based datasets, supported only for legacy reasons as the SummarizedExperiment class already provides numerous improvements over the ExpressionSet structure.
RaggedExperiment: For non-rectangular (ragged) ranged-based datasets, meaning that a potentially different set of genomic ranges are assayed for each sample. A typical example would be segmented copy number, where segmentation of copy number alterations occurs and different genomic locations in each sample.
RangedVcfStack: For VCF archives broken up by chromosome (see VcfStackclass defined in the GenomicFiles package)
DelayedMatrix: An on-disk representation of matrix-like objects for large datasets. It reduces memory usage and optimizes performance with delayed operations. This class is part of the DelayedArray package.

samples (lowest level). An individual set of measurements performed on a single biological unit. These measurements must be indexed by character IDs, however datasets may be ID-based (such as matrix or SummarizedExperiment) or range-based (such as RangedSummarizedExperiment). In the experimental datasets, columns refer to samples, and rows refer to genomic features that are represented by IDs or ranges.

`MultiAssayExperiment` class

Overview

The MultiAssayExperiment class is the main representation of multiple experiment data. It contains all information required to subset and match sample identifiers with clinical records.

Structure

ExperimentList - slot of class ExperimentList containing data for each experiment/assay
- contains "SimpleList" class from S4Vectors
- access using "experiments"
colData - slot of class DataFrame describing the clinical data available across all experiments
sampleMap - slot of class DataFrame of translatable identifiers of samples and participants
metadata - slot of any class providing additional information about theMultiAssayExperiment object
drops - slot of class list to keep a log of all residuals from subset operations

Validity

ExperimentList
1. ExperimentList length should be the same as the unique length of thesampleMap "assay" column.
2. Element names of the ExperimentList should be found in the sampleMap"assay" column.
3. For each ExperimentList element (say for an element named "assay X"), the colnames of that element must be identical to the sorted character string found in the "colname" column of the sampleMap within the rows where the "assay" equals the name of that ExperimentList element (in this example, "assay X"). The order does not need to be the same.
colData
1. Ensure that this slot is of class DataFrame
sampleMap - validity checks include checks for consistency between thesampleMap and the colData primary (or phenotype) data slot
1. all names in the sampleMap "primary" column must be found in the rownames of the colData DataFrame.
2. Within rows of sampleMap corresponding to a single value in the "assay" column, there can be no duplicated values in the "colname" column.

Note. These validity checks only apply when at least an ExperimentList slot is provided at MultiAssayExperiment object creation.

`updateObject` method and `API` changes

The updateObject method attempts to repair previously serialized instances of the MultiAssayExperiment so that it conforms with the updated API. It is advised to run updateObject on old instances of the MultiAssayExperiment and reserialize the object. This should be a one-time operation.

Recent changes to the API include changing the name of the workhorse container class from Elist to ExperimentList with an accessor function namedexperiments. Other changes include, renaming and reordering of the sampleMap columns fromprimary, assay, and assayname to assay (previously "assayname"),primary, and colname (previously "assay"), respectively.

`ExperimentList` class

Overview

The ExperimentList slot and class is the driver for theMultiAssayExperiment class as it contains necessary data from experiments and sample identifiers. The purpose of the ExperimentList is to store results from a set of experiments, as a SimpleList. Each element in the ExperimentList represents an experiment performed. All ExperimentList elements should be named.

Structure

ExperimentList - inherits from SimpleList with no additions. Contains separate validity checks and a show method.

Validity

ExperimentList elements
1. For data classes stored in each ExperimentList element, ensure that method functions [ (bracket), dimnames, and dim are possible.
2. For each ExperimentList element, ensure that dimensions of non-zero length in each ExperimentList element have non-null colnames.
3. Ensure ExperimentList elements are appropriate for the API warn whenDataFrame or data.frame present

Rationale

ExperimentList element requirements
1. The requirement of methods [ (bracket), dimnames, and dimallow for predictable subsetting operations and metadata acquisition.
2. Standard subsetting by columns match character vectors to the colnames, so any ExperimentList element with more than zero columns must have non-NULL colnames.
3. Rectangular objects that allow multiple data types and nested lists within their columns are discouraged and may interfere with data manipulation operations; therefore, matrix-based assays are preferred.

Any data class that provides the following methods can be used as an element ofExperimentList. RangedSummarizedExperiment provides the template behavior for ExperimentList elements, as follows. These are "template" behavior, but not explicit requirements:

dimnames(), by returning a list of character vectors for sample and feature identifiers (genes, proteins, etc.)
[i, j], by returning the restriction of the instance to rows i and columns j
dim(), by returning integer vector of length two for the number of rows and columns

A Note on `RaggedExperiment`

The RaggedExperiment class is an extension of the GRangesList Bioconductor class. It is intended to handle segmented copy number data. The package aims to represent this type of data as a table where columns are samples and rows are ranges. Please see the RaggedExperiment package for more information.

Optional class checks for developers

`hasAssay` function

The standard assay functionality allows the user to obtain a numeric matrix of data. The current hasAssay function includes a "soft" check that ensures all classes in an existing MultiAssayExperiment class object have listed assay methods via the hasMethods function. For convenience, the argument passed to thehasAssay function can either be a MultiAssayExperiment or a list class object.

`hasRowRanges` function

This helper function checks to see whether any elements in the ExperimentListsupport the rowRanges method. This is important for future expansion of methods where operations involve genomic ranges. The requirement for this check is that all qualifying objects should return a GRanges class from arowRanges method.

Subset methods

Overview

A couple of methods for subsetting were created for the MultiAssayExperimentwith a user-friendly interface in mind. Both the bracket notation [ andsubset methods are available. Each allows for subsetting via numeric,character, and logical vectors. Additional support for list and Listobjects is available.

Bracket `[` subsetting

Users are able to subset by:

rows
columns
assays

respectively, within the bracket notation and seperated by commas ,. When subsetting a MultiAssayExperiment via a numeric vector, all rows andcolumns of each element in the ExperimentList will be subset by that vector. When subsetting by a character vector, the vector will be matched against either the rownames, colnames or assays of theMultiAssayExperiment. Logical vectors can be passed to all dimensions of theMultiAssayExperiment (i.e., rows, columns, assays) and recycled if necessary, following standard R language practice.

Subsetting assays follows the list-like methods closely.

with `list`/`List` objects

Subsetting with list-like and List-like objects is allowed as long as element names in said lists match the experiment names in theMultiAssayExperiment/ExperimentList. Subsetting with list and Listis only available for rows and columns.

Examples of List classes include but are not limited to:

CharacterList
LogicalList
IntegerList
SimpleList (in S4Vectors)

These classes can be found in the IRanges package.

Use of `GRanges` object to subset

The subsetting of experiments with genomic ranges is possible when a GRangesobject is introduced to the subsetting operation. Example classes that contain genomic ranges and support the current API are: 1) RangedSummarizedExperiment2) RaggedExperiment. Additional arguments may be passed on to either thesubsetByRow function or to the bracket [ funnction notation.

`drop` argument

The drop argument indicates whether to keep assays with zero dimensions (after subsetting) in the ExperimentList class object.

MultiAssayExperiment API (original) (raw)

Executive summary

Overview

MultiAssayExperiment class

Overview

Structure

Validity

updateObject method and API changes

ExperimentList class

Overview

Structure

Validity

A Note on RaggedExperiment

Optional class checks for developers

hasAssay function

hasRowRanges function

Subset methods

Overview

Bracket [ subsetting

with list/List objects

Use of GRanges object to subset

drop argument

`MultiAssayExperiment` class

`updateObject` method and `API` changes

`ExperimentList` class

A Note on `RaggedExperiment`

`hasAssay` function

`hasRowRanges` function

Bracket `[` subsetting

with `list`/`List` objects

Use of `GRanges` object to subset

`drop` argument