GitHub - tatami-inc/mattress: Python interface to tatami representations (original) (raw)
Python bindings for tatami
Overview
The mattress package implements Python bindings to the tatami C++ library for matrix representations. Downstream packages can use mattress to develop C++ extensions that are interoperable with many different matrix classes, e.g., dense, sparse, delayed or file-backed.mattress is inspired by the beachmat Bioconductor package, which does the same thing for R packages.
Instructions
mattress is published to PyPI, so installation is simple:
mattress is intended for Python package developers writing C++ extensions that operate on matrices. The aim is to allow package C++ code to accept all types of matrix representations without requiring re-compilation of the associated code. To achive this:
- Add
mattress.includes()andassorthead.includes()to the compiler's include path. This can be done throughinclude_dirs=of theExtension()definition insetup.pyor by adding atarget_include_directories()in CMake, depending on the build system. - Call
mattress.initialize()on a Python matrix object to wrap it in a tatami-compatible C++ representation. This returns anInitializedMatrixwith aptrproperty that contains a pointer to the C++ matrix. - Pass
ptrto C++ code as auintptr_treferencing atatami::Matrix, which can be interrogated as described in the tatami documentation.
So, for example, the C++ code in our downstream package might look like the code below:
#include "mattress.h"
int do_something(uintptr_t ptr) { const auto& mat_ptr = mattress::cast(ptr)->ptr; // Do something with the tatami interface. return 1; }
// Assuming we're using pybind11, but any framework that can accept a uintptr_t is fine. PYBIND11_MODULE(lib_downstream, m) { m.def("do_something", &do_something); }
Which can then be called from Python:
from . import lib_downstream as lib from mattress import initialize
def do_something(x): tmat = initialize(x) return lib.do_something(tmat.ptr)
Check out the included header for more definitions.
Supported matrices
Dense numpy matrices of varying numeric type:
import numpy as np from mattress import initialize x = np.random.rand(1000, 100) init = initialize(x)
ix = (x * 100).astype(np.uint16) init2 = initialize(ix)
Compressed sparse matrices from scipy with varying index/data types:
from scipy import sparse as sp from mattress import initialize
xc = sp.random(100, 20, format="csc") init = initialize(xc)
xr = sp.random(100, 20, format="csc", dtype=np.uint8) init2 = initialize(xr)
Delayed arrays from the delayedarray package:
from delayedarray import DelayedArray from scipy import sparse as sp from mattress import initialize import numpy
xd = DelayedArray(sp.random(100, 20, format="csc")) xd = numpy.log1p(xd * 5)
init = initialize(xd)
Sparse arrays from delayedarray are also supported:
import delayedarray from numpy import float64, int32 from mattress import initialize sa = delayedarray.SparseNdarray((50, 20), None, dtype=float64, index_dtype=int32) init = initialize(sa)
See below to extend initialize() to custom matrix representations.
Utility methods
The InitializedMatrix instance returned by initialize() provides a few Python-visible methods for querying the C++ matrix.
init.nrow() // number of rows init.column(1) // contents of column 1 init.sparse() // whether the matrix is sparse.
It also has a few methods for computing common statistics:
init.row_sums() init.column_variances(num_threads = 2)
grouping = [i%3 for i in range(init.ncol())] init.row_medians_by_group(grouping)
init.row_nan_counts() init.column_ranges()
These are mostly intended for non-intensive work or testing/debugging. It is expected that any serious computation should be performed by iterating over the matrix in C++.
Operating on an existing pointer
If we already have a InitializedMatrix, we can easily apply additional operations by wrapping it in the relevant delayedarray layers and calling initialize() afterwards. For example, if we want to add a scalar, we might do:
from delayedarray import DelayedArray from mattress import initialize import numpy
x = numpy.random.rand(1000, 10) init = initialize(x)
wrapped = DelayedArray(init) + 1 init2 = initialize(wrapped)
This is more efficient as it re-uses the InitializedMatrix already generated from x. It is also more convenient as we don't have to carry around x to generate init2.
Extending to custom matrices
Developers can extend mattress to custom matrix classes by registering new methods with the initialize() generic. This should return a InitializedMatrix object containing a uintptr_t cast from a pointer to a tatami::Matrix (see the included header). Once this is done, all calls to initialize() will be able to handle matrices of the newly registered types.
from . import lib_downstream as lib import mattress
@mattress.initialize.register def _initialize_my_custom_matrix(x: MyCustomMatrix): data = x.some_internal_data return mattress.InitializedMatrix(lib.initialize_custom(data))
If the initialized tatami::Matrix contains references to Python-managed data, e.g., in NumPy arrays, we must ensure that the data is not garbage-collected during the lifetime of the tatami::Matrix. This is achieved by storing a reference to the data in the original member of the mattress::BoundMatrix.