anndata.AnnData (original) (raw)

anndata.AnnData#

class anndata.AnnData(X=None, obs=None, var=None, uns=None, obsm=None, varm=None, layers=None, raw=None, dtype=None, shape=None, filename=None, filemode=None, asview=False, *, obsp=None, varp=None, oidx=None, vidx=None)[source]#

An annotated data matrix.

AnnData stores a data matrix X together with annotations of observations obs (obsm, obsp), variables var (varm, varp), and unstructured annotations uns.

An AnnData object adata can be sliced like aDataFrame, for instance adata_subset = adata[:, list_of_variable_names].AnnData’s basic structure is similar to R’s ExpressionSet[Huber15]. If setting an .h5ad-formatted HDF5 backing file .filename, data remains on the disk but is automatically loaded into memory if needed.

Parameters:

A #observations × #variables data matrix. A view of the data is used if the data type matches, otherwise, a copy is made.

obs DataFrame | Mapping[str, Iterable[Any]] | None (default: None)

Key-indexed one-dimensional observations annotation of length #observations.

var DataFrame | Mapping[str, Iterable[Any]] | None (default: None)

Key-indexed one-dimensional variables annotation of length #variables.

uns Mapping[str, Any] | None (default: None)

Key-indexed unstructured annotation.

obsm ndarray | Mapping[str, Sequence[Any]] | None (default: None)

Key-indexed multi-dimensional observations annotation of length #observations. If passing a ndarray, it needs to have a structured datatype.

varm ndarray | Mapping[str, Sequence[Any]] | None (default: None)

Key-indexed multi-dimensional variables annotation of length #variables. If passing a ndarray, it needs to have a structured datatype.

Key-indexed multi-dimensional arrays aligned to dimensions of X.

shape tuple[int, int] | None (default: None)

Shape tuple (#observations, #variables). Can only be provided if X is None.

filename PathLike[str] | str | None (default: None)

Name of backing file. See h5py.File.

filemode Optional[Literal['r', 'r+']] (default: None)

Open mode of backing file. See h5py.File.

Notes

AnnData stores observations (samples) of variables/features in the rows of a matrix. This is the convention of the modern classics of statistics [Hastie09]and machine learning [Murphy12], the convention of dataframes both in R and Python and the established statistics and machine learning packages in Python (statsmodels, scikit-learn).

Single dimensional annotations of the observation and variables are stored in the obs and var attributes as DataFrames. This is intended for metrics calculated over their axes. Multi-dimensional annotations are stored in obsm and varm, which are aligned to the objects observation and variable dimensions respectively. Square matrices representing graphs are stored in obsp and varp, with both of their own dimensions aligned to their associated axis. Additional measurements across both observations and variables are stored inlayers.

Indexing into an AnnData object can be performed by relative position with numeric indices (like pandas’ iloc()), or by labels (like loc()). To avoid ambiguity with numeric indexing into observations or variables, indexes of the AnnData object are converted to strings by the constructor.

Subsetting an AnnData object by indexing into it will also subset its elements according to the dimensions they were aligned to. This means an operation like adata[list_of_obs, :] will also subset obs,obsm, and layers.

Subsetting an AnnData object returns a view into the original object, meaning very little additional memory is used upon subsetting. This is achieved lazily, meaning that the constituent arrays are subset on access. Copying a view causes an equivalent “real” AnnData object to be generated. Attempting to modify a view (at any attribute except X) is handled in a copy-on-modify manner, meaning the object is initialized in place. Here’s an example:

batch1 = adata[adata.obs["batch"] == "batch1", :] batch1.obs["value"] = 0 # This makes batch1 a “real” AnnData object

At the end of this snippet: adata was not modified, and batch1 is its own AnnData object with its own data.

Similar to Bioconductor’s ExpressionSet and scipy.sparse matrices, subsetting an AnnData object retains the dimensionality of its constituent arrays. Therefore, unlike with the classes exposed by pandas, numpy, and xarray, there is no concept of a one dimensional AnnData object. AnnDatas always have two inherent dimensions, obs and var. Additionally, maintaining the dimensionality of the AnnData object allows for consistent handling of scipy.sparse matrices and numpy arrays.

Attributes

Methods