Alternative chunked array types (original) (raw)

Xarray can wrap chunked dask arrays (see Parallel Computing with Dask), but can also wrap any other chunked array type that exposes the correct interface. This allows us to support using other frameworks for distributed and out-of-core processing, with user code still written as xarray commands. In particular xarray also supports wrapping cubed.Array objects (see Cubed’s documentation and the cubed-xarray package).

The basic idea is that by wrapping an array that has an explicit notion of .chunks, xarray can expose control over the choice of chunking scheme to users via methods like DataArray.chunk() whilst the wrapped array actually implements the handling of processing all of the chunks.

Chunked array methods and “core operations”#

A chunked array needs to meet all the requirements for normal duck arrays, but must also implement additional features.

Chunked arrays have additional attributes and methods, such as .chunks and .rechunk. Furthermore, Xarray dispatches chunk-aware computations across one or more chunked arrays using special functions known as “core operations”. Examples include map_blocks, blockwise, and apply_gufunc.

The core operations are generalizations of functions first implemented in dask.array. The implementation of these functions is specific to the type of arrays passed to them. For example, when applying themap_blocks core operation, dask.array.Array objects must be processed by dask.array.map_blocks(), whereas cubed.Array objects must be processed by cubed.map_blocks().

In order to use the correct implementation of a core operation for the array type encountered, xarray dispatches to the corresponding subclass of ChunkManagerEntrypoint, also known as a “Chunk Manager”. Therefore a full list of the operations that need to be defined is set by the API of the ChunkManagerEntrypoint abstract base class. Note that chunked array methods are also currently dispatched using this class.

Chunked array creation is also handled by this class. As chunked array objects have a one-to-one correspondence with in-memory numpy arrays, it should be possible to create a chunked array from a numpy array by passing the desired chunking pattern to an implementation of from_array`.

Note

The ChunkManagerEntrypoint abstract base class is mostly just acting as a namespace for containing the chunked-aware function primitives. Ideally in the future we would have an API standard for chunked array types which codified this structure, making the entrypoint system unnecessary.

class xarray.namedarray.parallelcompat.ChunkManagerEntrypoint[source]#

Interface between a particular parallel computing framework and xarray.

This abstract base class must be subclassed by libraries implementing chunked array types, and registered via the chunkmanagers entrypoint.

Abstract methods on this class must be implemented, whereas non-abstract methods are only required in order to enable a subset of xarray functionality, and by default will raise a NotImplementedError if called.

array_cls#

Type of the array class this parallel computing framework provides.

Parallel frameworks need to provide an array class that supports the array API standard. This attribute is used for array instance type checking at runtime.

Type:

type[xarray.namedarray.parallelcompat.T_ChunkedArray]

abstract apply_gufunc(func, signature, *args, axes=None, keepdims=False, output_dtypes=None, vectorize=None, **kwargs)[source]#

Apply a generalized ufunc or similar python function to arrays.

signature determines if the function consumes or produces core dimensions. The remaining dimensions in given input arrays (*args) are considered loop dimensions and are required to broadcast naturally against each other.

In other terms, this function is like np.vectorize, but for the blocks of chunked arrays. If the function itself shall also be vectorized use vectorize=True for convenience.

Called inside xarray.apply_ufunc, which is called internally for most xarray operations. Therefore this method must be implemented for the vast majority of xarray computations to be supported.

Parameters:

func (callable()) – Function to call like func(*args, **kwargs) on input arrays (*args) that returns an array or tuple of arrays. If multiple arguments with non-matching dimensions are supplied, this function is expected to vectorize (broadcast) over axes of positional arguments in the style of NumPy universal functions [1] (if this is not the case, set vectorize=True). If this function returns multiple outputs,output_core_dims has to be set as well.
signature (string) – Specifies what core dimensions are consumed and produced by func. According to the specification of numpy.gufunc signature [2]
*args (numeric) – Input arrays or scalars to the callable function.
axes (List of tuples, optional, keyword only) – A list of tuples with indices of axes a generalized ufunc should operate on. For instance, for a signature of "(i,j),(j,k)->(i,k)" appropriate for matrix multiplication, the base elements are two-dimensional matrices and these are taken to be stored in the two last axes of each argument. The corresponding axes keyword would be [(-2, -1), (-2, -1), (-2, -1)]. For simplicity, for generalized ufuncs that operate on 1-dimensional arrays (vectors), a single integer is accepted instead of a single-element tuple, and for generalized ufuncs for which all outputs are scalars, the output tuples can be omitted.
keepdims (bool, optional, keyword only) – If this is set to True, axes which are reduced over will be left in the result as a dimension with size one, so that the result will broadcast correctly against the inputs. This option can only be used for generalized ufuncs that operate on inputs that all have the same number of core dimensions and with outputs that have no core dimensions , i.e., with signatures like "(i),(i)->()" or "(m,m)->()". If used, the location of the dimensions in the output can be controlled with axes and axis.
output_dtypes (Optional, dtype or list of dtypes, keyword only) – Valid numpy dtype specification or list thereof. If not given, a call of func with a small set of data is performed in order to try to automatically determine the output dtypes.
vectorize (bool, keyword only) – If set to True, np.vectorize is applied to func for convenience. Defaults to False.
**kwargs (dict) – Extra keyword arguments to pass to func

Returns:

Single chunked array or tuple of chunked arrays

References

property array_api#

Return the array_api namespace following the python array API standard.

See https://data-apis.org/array-api/latest/ . Currently used to access the array API functionfull_like, which is called within the xarray constructors xarray.full_like, xarray.ones_like,xarray.zeros_like, etc.

Registering a new ChunkManagerEntrypoint subclass#

Rather than hard-coding various chunk managers to deal with specific chunked array implementations, xarray uses an entrypoint system to allow developers of new chunked array implementations to register their corresponding subclass ofChunkManagerEntrypoint.

To register a new entrypoint you need to add an entry to the setup.cfg like this:

[options.entry_points] xarray.chunkmanagers = dask = xarray.namedarray.daskmanager:DaskManager

See also cubed-xarray for another example.

To check that the entrypoint has worked correctly, you may find it useful to display the available chunkmanagers using the internal function list_chunkmanagers().

xarray.namedarray.parallelcompat.list_chunkmanagers()[source]#

Return a dictionary of available chunk managers and their ChunkManagerEntrypoint subclass objects.

Returns:

chunkmanagers (dict) – Dictionary whose values are registered ChunkManagerEntrypoint subclass instances, and whose values are the strings under which they are registered.

User interface#

Once the chunkmanager subclass has been registered, xarray objects wrapping the desired array type can be created in 3 ways:

By manually passing the array type to the DataArray constructor, see the examples for numpy-like arrays,
Calling chunk(), passing the keyword arguments chunked_array_type and from_array_kwargs,
Calling open_dataset(), passing the keyword arguments chunked_array_type and from_array_kwargs.

The latter two methods ultimately call the chunkmanager’s implementation of .from_array, to which they pass the from_array_kwargs dict. The chunked_array_type kwarg selects which registered chunkmanager subclass to dispatch to. It defaults to 'dask'if Dask is installed, otherwise it defaults to whichever chunkmanager is registered if only one is registered. If multiple chunkmanagers are registered, the chunk_manager configuration option (which can be set using set_options()) will be used to determine which chunkmanager to use, defaulting to 'dask'.

Parallel processing without chunks#

To use a parallel array type that does not expose a concept of chunks explicitly, none of the information on this page is theoretically required. Such an array type (e.g. Ramba orArkouda) could be wrapped using xarray’s existing support fornumpy-like “duck” arrays.