NEP 18 — A dispatch mechanism for NumPy’s high level array functions — NumPy Enhancement Proposals (original) (raw)
Author:
Stephan Hoyer <shoyer@google.com>
Author:
Matthew Rocklin <mrocklin@gmail.com>
Author:
Marten van Kerkwijk <mhvk@astro.utoronto.ca>
Author:
Hameer Abbasi <hameerabbasi@yahoo.com>
Author:
Eric Wieser <wieser.eric@gmail.com>
Status:
Final
Type:
Standards Track
Created:
2018-05-29
Updated:
2019-05-25
Resolution:
https://mail.python.org/pipermail/numpy-discussion/2018-August/078493.html
Abstract#
We propose the __array_function__
protocol, to allow arguments of NumPy functions to define how that function operates on them. This will allow using NumPy as a high level API for efficient multi-dimensional array operations, even with array implementations that differ greatly fromnumpy.ndarray
.
Detailed description#
NumPy’s high level ndarray API has been implemented several times outside of NumPy itself for different architectures, such as for GPU arrays (CuPy), Sparse arrays (scipy.sparse, pydata/sparse) and parallel arrays (Dask array) as well as various NumPy-like implementations in the deep learning frameworks, like TensorFlow and PyTorch.
Similarly there are many projects that build on top of the NumPy API for labeled and indexed arrays (XArray), automatic differentiation (Autograd, Tangent), masked arrays (numpy.ma), physical units (astropy.units, pint, unyt), etc. that add additional functionality on top of the NumPy API. Most of these project also implement a close variation of NumPy’s level high API.
We would like to be able to use these libraries together, for example we would like to be able to place a CuPy array within XArray, or perform automatic differentiation on Dask array code. This would be easier to accomplish if code written for NumPy ndarrays could also be used by other NumPy-like projects.
For example, we would like for the following code example to work equally well with any NumPy-like array object:
def f(x): y = np.tensordot(x, x.T) return np.mean(np.exp(y))
Some of this is possible today with various protocol mechanisms within NumPy.
- The
np.exp
function checks the__array_ufunc__
protocol - The
.T
method works using Python’s method dispatch - The
np.mean
function explicitly checks for a.mean
method on the argument
However other functions, like np.tensordot
do not dispatch, and instead are likely to coerce to a NumPy array (using the __array__
) protocol, or err outright. To achieve enough coverage of the NumPy API to support downstream projects like XArray and autograd we want to support almost all functions within NumPy, which calls for a more reaching protocol than just __array_ufunc__
. We would like a protocol that allows arguments of a NumPy function to take control and divert execution to another function (for example a GPU or parallel implementation) in a way that is safe and consistent across projects.
Implementation#
We propose adding support for a new protocol in NumPy,__array_function__
.
This protocol is intended to be a catch-all for NumPy functionality that is not covered by the __array_ufunc__
protocol for universal functions (like np.exp
). The semantics are very similar to __array_ufunc__
, except the operation is specified by an arbitrary callable object rather than a ufunc instance and method.
A prototype implementation can be found inthis notebook.
Warning
The __array_function__
protocol, and its use on particular functions, is experimental. We plan to retain an interface that makes it possible to override NumPy functions, but the way to do so for particular functionscan and will change with little warning. If such reduced backwards compatibility guarantees are not accepted to you, do not rely upon overrides of NumPy functions for non-NumPy arrays. See “Non-goals” below for more details.
Note
Dispatch with the __array_function__
protocol has been implemented but is not yet enabled by default:
- In NumPy 1.16, you need to set the environment variable
NUMPY_EXPERIMENTAL_ARRAY_FUNCTION=1
before importing NumPy to test NumPy function overrides. - In NumPy 1.17, the protocol will be enabled by default, but can be disabled with
NUMPY_EXPERIMENTAL_ARRAY_FUNCTION=0
. - Eventually, expect to
__array_function__
to always be enabled.
The interface#
We propose the following signature for implementations of__array_function__
:
def array_function(self, func, types, args, kwargs)
func
is an arbitrary callable exposed by NumPy’s public API, which was called in the formfunc(*args, **kwargs)
.types
is a collectionof unique argument types from the original NumPy function call that implement__array_function__
.- The tuple
args
and dictkwargs
are directly passed on from the original call.
Unlike __array_ufunc__
, there are no high-level guarantees about the type of func
, or about which of args
and kwargs
may contain objects implementing the array API.
As a convenience for __array_function__
implementers, types
provides all argument types with an '__array_function__'
attribute. This allows implementers to quickly identify cases where they should defer to__array_function__
implementations on other arguments. The type of types
is intentionally vague:frozenset
would most closely match intended use, but we may use tuple
instead for performance reasons. In any case, __array_function__
implementations should not rely on the iteration order of types
, which would violate a well-defined “Type casting hierarchy” (as described inNEP-13).
Example for a project implementing the NumPy API#
Most implementations of __array_function__
will start with two checks:
- Is the given function something that we know how to overload?
- Are all arguments of a type that we know how to handle?
If these conditions hold, __array_function__
should return the result from calling its implementation for func(*args, **kwargs)
. Otherwise, it should return the sentinel value NotImplemented
, indicating that the function is not implemented by these types. This is preferable to raising TypeError
directly, because it gives other arguments the opportunity to define the operations.
There are no general requirements on the return value from__array_function__
, although most sensible implementations should probably return array(s) with the same type as one of the function’s arguments. If/when Python gainstyping support for protocolsand NumPy adds static type annotations, the @overload
implementation for SupportsArrayFunction
will indicate a return type of Any
.
It may also be convenient to define a custom decorators (implements
below) for registering __array_function__
implementations.
HANDLED_FUNCTIONS = {}
class MyArray: def array_function(self, func, types, args, kwargs): if func not in HANDLED_FUNCTIONS: return NotImplemented # Note: this allows subclasses that don't override # array_function to handle MyArray objects if not all(issubclass(t, MyArray) for t in types): return NotImplemented return HANDLED_FUNCTIONS[func](*args, **kwargs)
def implements(numpy_function): """Register an array_function implementation for MyArray objects.""" def decorator(func): HANDLED_FUNCTIONS[numpy_function] = func return func return decorator
@implements(np.concatenate) def concatenate(arrays, axis=0, out=None): ... # implementation of concatenate for MyArray objects
@implements(np.broadcast_to) def broadcast_to(array, shape): ... # implementation of broadcast_to for MyArray objects
Note that it is not required for __array_function__
implementations to include all of the corresponding NumPy function’s optional arguments (e.g., broadcast_to
above omits the irrelevant subok
argument). Optional arguments are only passed in to __array_function__
if they were explicitly used in the NumPy function call.
Note
Just like the case for builtin special methods like __add__
, properly written __array_function__
methods should always returnNotImplemented
when an unknown type is encountered. Otherwise, it will be impossible to correctly override NumPy functions from another object if the operation also includes one of your objects.
Necessary changes within the NumPy codebase itself#
This will require two changes within the NumPy codebase:
- A function to inspect available inputs, look for the
__array_function__
attribute on those inputs, and call those methods appropriately until one succeeds. This needs to be fast in the common all-NumPy case, and have acceptable performance (no worse than linear time) even if the number of overloaded inputs is large (e.g., as might be the case for np.concatenate).
This is one additional function of moderate complexity. - Calling this function within all relevant NumPy functions.
This affects many parts of the NumPy codebase, although with very low complexity.
Finding and calling the right __array_function__
#
Given a NumPy function, *args
and **kwargs
inputs, we need to search through *args
and **kwargs
for all appropriate inputs that might have the __array_function__
attribute. Then we need to select among those possible methods and execute the right one. Negotiating between several possible implementations can be complex.
Finding arguments#
Valid arguments may be directly in the *args
and **kwargs
, such as in the case for np.tensordot(left, right, out=out)
, or they may be nested within lists or dictionaries, such as in the case ofnp.concatenate([x, y, z])
. This can be problematic for two reasons:
- Some functions are given long lists of values, and traversing them might be prohibitively expensive.
- Some functions may have arguments that we don’t want to inspect, even if they have the
__array_function__
method.
To resolve these issues, NumPy functions should explicitly indicate which of their arguments may be overloaded, and how these arguments should be checked. As a rule, this should include all arguments documented as eitherarray_like
or ndarray
.
We propose to do so by writing “dispatcher” functions for each overloaded NumPy function:
- These functions will be called with the exact same arguments that were passed into the NumPy function (i.e.,
dispatcher(*args, **kwargs)
), and should return an iterable of arguments to check for overrides. - Dispatcher functions are required to share the exact same positional, optional and keyword-only arguments as their corresponding NumPy functions. Otherwise, valid invocations of a NumPy function could result in an error when calling its dispatcher.
- Because default values for keyword arguments do not have
__array_function__
attributes, by convention we set all default argument values toNone
. This reduces the likelihood of signatures falling out of sync, and minimizes extraneous information in the dispatcher. The only exception should be cases where the argument value in some way effects dispatching, which should be rare.
An example of the dispatcher for np.concatenate
may be instructive:
def _concatenate_dispatcher(arrays, axis=None, out=None): for array in arrays: yield array if out is not None: yield out
The concatenate dispatcher is written as generator function, which allows it to potentially include the value of the optional out
argument without needing to create a new sequence with the (potentially long) list of objects to be concatenated.
Trying __array_function__
methods until the right one works#
Many arguments may implement the __array_function__
protocol. Some of these may decide that, given the available inputs, they are unable to determine the correct result. How do we call the right one? If several are valid then which has precedence?
For the most part, the rules for dispatch with __array_function__
match those for __array_ufunc__
(seeNEP-13). In particular:
- NumPy will gather implementations of
__array_function__
from all specified inputs and call them in order: subclasses before superclasses, and otherwise left to right. Note that in some edge cases involving subclasses, this differs slightly from thecurrent behavior of Python. - Implementations of
__array_function__
indicate that they can handle the operation by returning any value other thanNotImplemented
. - If all
__array_function__
methods returnNotImplemented
, NumPy will raiseTypeError
.
If no __array_function__
methods exist, NumPy will default to calling its own implementation, intended for use on NumPy arrays. This case arises, for example, when all array-like arguments are Python numbers or lists. (NumPy arrays do have a __array_function__
method, given below, but it always returns NotImplemented
if any argument other than a NumPy array subclass implements __array_function__
.)
One deviation from the current behavior of __array_ufunc__
is that NumPy will only call __array_function__
on the first argument of each unique type. This matches Python’srule for calling reflected methods, and this ensures that checking overloads has acceptable performance even when there are a large number of overloaded arguments. To avoid long-term divergence between these two dispatch protocols, we shouldalso update __array_ufunc__
to match this behavior.
The __array_function__
method on numpy.ndarray
#
The use cases for subclasses with __array_function__
are the same as those with __array_ufunc__
, so numpy.ndarray
also defines a__array_function__
method:
def array_function(self, func, types, args, kwargs): if not all(issubclass(t, ndarray) for t in types): # Defer to any non-subclasses that implement array_function return NotImplemented
# Use NumPy's private implementation without __array_function__
# dispatching
return func._implementation(*args, **kwargs)
This method matches NumPy’s dispatching rules, so for most part it is possible to pretend that ndarray.__array_function__
does not exist. The private _implementation
attribute, defined below in thearray_function_dispatch
decorator, allows us to avoid the special cases for NumPy arrays that were needed in the __array_ufunc__
protocol.
The __array_function__
protocol always calls subclasses before superclasses, so if any ndarray
subclasses are involved in an operation, they will get the chance to override it, just as if any other argument overrides __array_function__
. But the default behavior in an operation that combines a base NumPy array and a subclass is different: if the subclass returns NotImplemented
, NumPy’s implementation of the function will be called instead of raising an exception. This is appropriate since subclasses are expected to be substitutable.
We still caution authors of subclasses to exercise caution when relying upon details of NumPy’s internal implementations. It is not always possible to write a perfectly substitutable ndarray subclass, e.g., in cases involving the creation of new arrays, not least because NumPy makes use of internal optimizations specialized to base NumPy arrays, e.g., code written in C. Even if NumPy’s implementation happens to work today, it may not work in the future. In these cases, your recourse is to re-implement top-level NumPy functions via__array_function__
on your subclass.
Changes within NumPy functions#
Given a function defining the above behavior, for now call itimplement_array_function
, we now need to call that function from within every relevant NumPy function. This is a pervasive change, but of fairly simple and innocuous code that should complete quickly and without effect if no arguments implement the __array_function__
protocol.
To achieve this, we define a array_function_dispatch
decorator to rewrite NumPy functions. The basic implementation is as follows:
def array_function_dispatch(dispatcher, module=None): """Wrap a function for dispatch with the array_function protocol.""" def decorator(implementation): @functools.wraps(implementation) def public_api(*args, **kwargs): relevant_args = dispatcher(*args, **kwargs) return implement_array_function( implementation, public_api, relevant_args, args, kwargs) if module is not None: public_api.module = module # for ndarray.array_function public_api._implementation = implementation return public_api return decorator
example usage
def _broadcast_to_dispatcher(array, shape, subok=None): return (array,)
@array_function_dispatch(_broadcast_to_dispatcher, module='numpy') def broadcast_to(array, shape, subok=False): ... # existing definition of np.broadcast_to
Using a decorator is great! We don’t need to change the definitions of existing NumPy functions, and only need to write a few additional lines for the dispatcher function. We could even reuse a single dispatcher for families of functions with the same signature (e.g., sum
and prod
). For such functions, the largest change could be adding a few lines to the docstring to note which arguments are checked for overloads.
It’s particularly worth calling out the decorator’s use offunctools.wraps
:
- This ensures that the wrapped function has the same name and docstring as the wrapped NumPy function.
- On Python 3, it also ensures that the decorator function copies the original function signature, which is important for introspection based tools such as auto-complete.
- Finally, it ensures that the wrapped functioncan be pickled.
The example usage illustrates several best practices for writing dispatchers relevant to NumPy contributors:
- We passed the
module
argument, which in turn sets the__module__
attribute on the generated function. This is for the benefit of better error messages, here for errors raised internally by NumPy when no implementation is found, e.g.,TypeError: no implementation found for 'numpy.broadcast_to'
. Setting__module__
to the canonical location in NumPy’s public API encourages users to use NumPy’s public API for identifying functions in__array_function__
. - The dispatcher is a function that returns a tuple, rather than an equivalent (and equally valid) generator using
yield
:
example usage
def broadcast_to(array, shape, subok=None):
yield array
This is no accident: NumPy’s implementation of dispatch for__array_function__
is fastest when dispatcher functions return a builtin sequence type (tuple
or list
).
On a related note, it’s perfectly fine for dispatchers to return arguments even if in some cases you know that they cannot have an__array_function__
method. This can arise for functions with default arguments (e.g., None
) or complex signatures. NumPy’s dispatching logic sorts out these cases very quickly, so it generally is not worth the trouble of parsing them on your own.
Note
The code for array_function_dispatch
above has been updated from the original version of this NEP to match the actualimplementation in NumPy.
Extensibility#
An important virtue of this approach is that it allows for adding new optional arguments to NumPy functions without breaking code that already relies on __array_function__
.
This is not a theoretical concern. NumPy’s older, haphazard implementation of overrides within functions like np.sum()
necessitated some awkward gymnastics when we decided to add new optional arguments, e.g., the newkeepdims
argument is only passed in cases where it is used:
def sum(array, ..., keepdims=np._NoValue): kwargs = {} if keepdims is not np._NoValue: kwargs['keepdims'] = keepdims return array.sum(..., **kwargs)
For __array_function__
implementers, this also means that it is possible to implement even existing optional arguments incrementally, and only in cases where it makes sense. For example, a library implementing immutable arrays would not be required to explicitly include an unsupported out
argument in the function signature. This can be somewhat onerous to implement properly, e.g.,
def my_sum(array, ..., out=None): if out is not None: raise TypeError('out argument is not supported') ...
We thus avoid encouraging the tempting shortcut of adding catch-all**ignored_kwargs
to the signatures of functions called by NumPy, which fails silently for misspelled or ignored arguments.
Performance#
Performance is always a concern with NumPy, even though NumPy users have already prioritized usability over pure speed with their choice of the Python language itself. It’s important that this new __array_function__
protocol not impose a significant cost in the typical case of NumPy functions acting on NumPy arrays.
Our microbenchmark resultsshow that a pure Python implementation of the override machinery described above adds roughly 2-3 microseconds of overhead to each NumPy function call without any overloaded arguments. For context, typical NumPy functions on small arrays have a runtime of 1-10 microseconds, mostly determined by what fraction of the function’s logic is written in C. For example, one microsecond is about the difference in speed between the ndarray.sum()
method (1.6 us) andnumpy.sum()
function (2.6 us).
Fortunately, we expect significantly less overhead with a C implementation ofimplement_array_function
, which is where the bulk of the runtime is. This would leave the array_function_dispatch
decorator and dispatcher function on their own adding about 0.5 microseconds of overhead, for perhaps ~1 microsecond of overhead in the typical case.
In our view, this level of overhead is reasonable to accept for code written in Python. We’re pretty sure that the vast majority of NumPy users aren’t concerned about performance differences measured in microsecond(s) on NumPy functions, because it’s difficult to do anything in Python in less than a microsecond.
Use outside of NumPy#
Nothing about this protocol that is particular to NumPy itself. Should we encourage use of the same __array_function__
protocol third-party libraries for overloading non-NumPy functions, e.g., for making array-implementation generic functionality in SciPy?
This would offer significant advantages (SciPy wouldn’t need to invent its own dispatch system) and no downsides that we can think of, because every function that dispatches with __array_function__
already needs to be explicitly recognized. Libraries like Dask, CuPy, and Autograd already wrap a limited subset of SciPy functionality (e.g.,scipy.linalg
) similarly to how they wrap NumPy.
If we want to do this, we should expose at least the decoratorarray_function_dispatch()
and possibly also the lower levelimplement_array_function()
as part of NumPy’s public API.
Non-goals#
We are aiming for basic strategy that can be relatively mechanistically applied to almost all functions in NumPy’s API in a relatively short period of time, the development cycle of a single NumPy release.
We hope to get both the __array_function__
protocol and all specific overloads right on the first try, but our explicit aim here is to get something that mostly works (and can be iterated upon), rather than to wait for an optimal implementation. The price of moving fast is that for now this protocol should be considered strictly experimental. We reserve the right to change the details of this protocol and how specific NumPy functions use it at any time in the future – even in otherwise bug-fix only releases of NumPy. In practice, once initial issues with __array_function__
are worked out, we will use abbreviated deprecation cycles as short as a single major NumPy release (e.g., as little as four months).
In particular, we don’t plan to write additional NEPs that list all specific functions to overload, with exactly how they should be overloaded. We will leave this up to the discretion of committers on individual pull requests, trusting that they will surface any controversies for discussion by interested parties.
However, we already know several families of functions that should be explicitly exclude from __array_function__
. These will need their own protocols:
- universal functions, which already have their own protocol.
array
andasarray
, because they are explicitly intended for coercion to actualnumpy.ndarray
object.- dispatch for methods of any kind, e.g., methods on
np.random.RandomState
objects.
We also expect that the mechanism for overriding specific functions that will initially use the __array_function__
protocol can and will change in the future. As a concrete example of how we expect to break behavior in the future, some functions such as np.where
are currently not NumPy universal functions, but conceivably could become universal functions in the future. When/if this happens, we will change such overloads from using __array_function__
to the more specialized __array_ufunc__
.
Backward compatibility#
This proposal does not change existing semantics, except for those arguments that currently have __array_function__
attributes, which should be rare.
Alternatives#
Specialized protocols#
We could (and should) continue to develop protocols like__array_ufunc__
for cohesive subsets of NumPy functionality.
As mentioned above, if this means that some functions that we overload with __array_function__
should switch to a new protocol instead, that is explicitly OK for as long as __array_function__
retains its experimental status.
Switching to a new protocol should use an abbreviated version of NumPy’s normal deprecation cycle:
- For a single major release, after checking for any new protocols, NumPy should still check for
__array_function__
methods that implement the given function. If any argument returns a value other thanNotImplemented
from__array_function__
, a descriptiveFutureWarning
should be issued. - In the next major release, the checks for
__array_function__
will be removed.
Separate namespace#
A separate namespace for overloaded functions is another possibility, either inside or outside of NumPy.
This has the advantage of alleviating any possible concerns about backwards compatibility and would provide the maximum freedom for quick experimentation. In the long term, it would provide a clean abstraction layer, separating NumPy’s high level API from default implementations onnumpy.ndarray
objects.
The downsides are that this would require an explicit opt-in from all existing code, e.g., import numpy.api as np
, and in the long term would result in the maintenance of two separate NumPy APIs. Also, many functions from numpy
itself are already overloaded (but inadequately), so confusion about high vs. low level APIs in NumPy would still persist.
Alternatively, a separate namespace, e.g., numpy.array_only
, could be created for a non-overloaded version of NumPy’s high level API, for cases where performance with NumPy arrays is a critical concern. This has most of the same downsides as the separate namespace.
Multiple dispatch#
An alternative to our suggestion of the __array_function__
protocol would be implementing NumPy’s core functions asmulti-methods. Although one of us wrote a multiple dispatch library for Python, we don’t think this approach makes sense for NumPy in the near term.
The main reason is that NumPy already has a well-proven dispatching mechanism with __array_ufunc__
, based on Python’s own dispatching system for arithmetic, and it would be confusing to add another mechanism that works in a very different way. This would also be more invasive change to NumPy itself, which would need to gain a multiple dispatch implementation.
It is possible that multiple dispatch implementation for NumPy’s high level API could make sense in the future. Fortunately,__array_function__
does not preclude this possibility, because it would be straightforward to write a shim for a default__array_function__
implementation in terms of multiple dispatch.
Implementations in terms of a limited core API#
The internal implementation of some NumPy functions is extremely simple. For example:
np.stack()
is implemented in only a few lines of code by combining indexing withnp.newaxis
,np.concatenate
and theshape
attribute.np.mean()
is implemented internally in terms ofnp.sum()
,np.divide()
,.astype()
and.shape
.
This suggests the possibility of defining a minimal “core” ndarray interface, and relying upon it internally in NumPy to implement the full API. This is an attractive option, because it could significantly reduce the work required for new array implementations.
However, this also comes with several downsides:
- The details of how NumPy implements a high-level function in terms of overloaded functions now becomes an implicit part of NumPy’s public API. For example, refactoring
stack
to usenp.block()
instead ofnp.concatenate()
internally would now become a breaking change. - Array libraries may prefer to implement high level functions differently than NumPy. For example, a library might prefer to implement a fundamental operations like
mean()
directly rather than relying onsum()
followed by division. More generally, it’s not clear yet what exactly qualifies as core functionality, and figuring this out could be a large project. - We don’t yet have an overloading system for attributes and methods on array objects, e.g., for accessing
.dtype
and.shape
. This should be the subject of a future NEP, but until then we should be reluctant to rely on these properties.
Given these concerns, we think it’s valuable to support explicit overloading of nearly every public function in NumPy’s API. This does not preclude the future possibility of rewriting NumPy functions in terms of simplified core functionality with __array_function__
and a protocol and/or base class for ensuring that arrays expose methods and properties like numpy.ndarray
. However, to work well this would require the possibility of implementing_some_ but not all functions with __array_function__
, e.g., as described in the next section.
Partial implementation of NumPy’s API#
With the current design, classes that implement __array_function__
to overload at least one function implicitly declare an intent to implement the entire NumPy API. It’s not possible to implement only np.concatenate()
on a type, but fall back to NumPy’s default behavior of casting with np.asarray()
for all other functions.
This could present a backwards compatibility concern that would discourage libraries from adopting __array_function__
in an incremental fashion. For example, currently most numpy functions will implicitly convert pandas.Series
objects into NumPy arrays, behavior that assuredly many pandas users rely on. If pandas implemented__array_function__
only for np.concatenate
, unrelated NumPy functions like np.nanmean
would suddenly break on pandas objects by raising TypeError.
Even libraries that reimplement most of NumPy’s public API sometimes rely upon using utility functions from NumPy without a wrapper. For example, both CuPy and JAX simply use an alias tonp.result_type
, which already supports duck-types with a dtype
attribute.
With __array_ufunc__
, it’s possible to alleviate this concern by casting all arguments to numpy arrays and re-calling the ufunc, but the heterogeneous function signatures supported by __array_function__
make it impossible to implement this generic fallback behavior for__array_function__
.
We considered three possible ways to resolve this issue, but none were entirely satisfactory:
- Change the meaning of all arguments returning
NotImplemented
from__array_function__
to indicate that all arguments should be coerced to NumPy arrays and the operation should be retried. However, many array libraries (e.g., scipy.sparse) really don’t want implicit conversions to NumPy arrays, and often avoid implementing__array__
for exactly this reason. Implicit conversions can result in silent bugs and performance degradation.
Potentially, we could enable this behavior only for types that implement__array__
, which would resolve the most problematic cases like scipy.sparse. But in practice, a large fraction of classes that present a high level API like NumPy arrays already implement__array__
. This would preclude reliable use of NumPy’s high level API on these objects. - Use another sentinel value of some sort, e.g.,
np.NotImplementedButCoercible
, to indicate that a class implementing part of NumPy’s higher level array API is coercible as a fallback. If all arguments returnNotImplementedButCoercible
, arguments would be coerced and the operation would be retried.
Unfortunately, correct behavior after encounteringNotImplementedButCoercible
is not always obvious. Particularly challenging is the “mixed” case where some arguments returnNotImplementedButCoercible
and others returnNotImplemented
. Would dispatching be retried after only coercing the “coercible” arguments? If so, then conceivably we could end up looping through the dispatching logic an arbitrary number of times. Either way, the dispatching rules would definitely get more complex and harder to reason about. - Allow access to NumPy’s implementation of functions, e.g., in the form of a publicly exposed
__skip_array_function__
attribute on the NumPy functions. This would allow for falling back to NumPy’s implementation by usingfunc.__skip_array_function__
inside__array_function__
methods, and could also potentially be used to be used to avoid the overhead of dispatching. However, it runs the risk of potentially exposing details of NumPy’s implementations for NumPy functions that do not callnp.asarray()
internally. Seethis notefor a summary of the full discussion.
These solutions would solve real use cases, but at the cost of additional complexity. We would like to gain experience with how __array_function__
is actually used before making decisions that would be difficult to roll back.
A magic decorator that inspects type annotations#
In principle, Python 3 type annotations contain sufficient information to automatically create most dispatcher
functions. It would be convenient to use these annotations to dispense with the need for manually writing dispatchers, e.g.,
@array_function_dispatch def broadcast_to(array: ArrayLike shape: Tuple[int, ...], subok: bool = False): ... # existing definition of np.broadcast_to
This would require some form of automatic code generation, either at compile or import time.
We think this is an interesting possible extension to consider in the future. We don’t think it makes sense to do so now, because code generation involves tradeoffs and NumPy’s experience with type annotations is stillquite limited. Even if NumPy was Python 3 only (which will happensometime in 2019), we aren’t ready to annotate NumPy’s codebase directly yet.
Support for implementation-specific arguments#
We could allow __array_function__
implementations to add their own optional keyword arguments by including **ignored_kwargs
in dispatcher functions, e.g.,
def _concatenate_dispatcher(arrays, axis=None, out=None, **ignored_kwargs): ... # same implementation of _concatenate_dispatcher as above
Implementation-specific arguments are somewhat common in libraries that otherwise emulate NumPy’s higher level API (e.g., dask.array.sum()
addssplit_every
and tensorflow.reduce_sum()
adds name
). Supporting them in NumPy would be particularly useful for libraries that implement new high-level array functions on top of NumPy functions, e.g.,
def mean_squared_error(x, y, **kwargs): return np.mean((x - y) ** 2, **kwargs)
Otherwise, we would need separate versions of mean_squared_error
for each array implementation in order to pass implementation-specific arguments tomean()
.
We wouldn’t allow adding optional positional arguments, because these are reserved for future use by NumPy itself, but conflicts between keyword arguments should be relatively rare.
However, this flexibility would come with a cost. In particular, it implicitly adds **kwargs
to the signature for all wrapped NumPy functions without actually including it (because we use functools.wraps
). This means it is unlikely to work well with static analysis tools, which could report invalid arguments. Likewise, there is a price in readability: these optional arguments won’t be included in the docstrings for NumPy functions.
It’s not clear that this tradeoff is worth it, so we propose to leave this out for now. Adding implementation-specific arguments will require using those libraries directly.
Other possible choices for the protocol#
The array function __array_function__
includes only two arguments, func
and types
, that provide information about the context of the function call.
func
is part of the protocol because there is no way to avoid it: implementations need to be able to dispatch by matching a function to NumPy’s public API.
types
is included because we can compute it almost for free as part of collecting __array_function__
implementations to call inimplement_array_function
. We also think it will be used by many __array_function__
methods, which otherwise would need to extract this information themselves. It would be equivalently easy to provide single instances of each type, but providing only types seemed cleaner.
Taking this even further, it was suggested that __array_function__
should be a classmethod
. We agree that it would be a little cleaner to remove the redundant self
argument, but feel that this minor clean-up would not be worth breaking from the precedence of __array_ufunc__
.
There are two other arguments that we think might be important to pass to__array_ufunc__
implementations:
- Access to the non-dispatched implementation (i.e., before wrapping with
array_function_dispatch
) inndarray.__array_function__
would allow us to drop special case logic for that method fromimplement_array_function
. - Access to the
dispatcher
function passed intoarray_function_dispatch()
would allow__array_function__
implementations to determine the list of “array-like” arguments in a generic way by callingdispatcher(*args, **kwargs)
. This could be useful for__array_function__
implementations that dispatch based on the value of an array attribute (e.g.,dtype
orunits
) rather than directly on the array type.
We have left these out for now, because we don’t know that they are necessary. If we want to include them in the future, the easiest way to do so would be to update the array_function_dispatch
decorator to add them as function attributes.
Callable objects generated at runtime#
NumPy has some APIs that define callable objects dynamically, such asvectorize
and methods on random.RandomState
object. Examples can also be found in other core libraries in the scientific Python stack, e.g., distribution objects in scipy.stats and model objects in scikit-learn. It would be nice to be able to write overloads for such callables, too. This presents a challenge for the __array_function__
protocol, because unlike the case for functions there is no public object in the numpy
namespace to pass into the func
argument.
We could potentially handle this by establishing an alternative convention for how the func
argument could be inspected, e.g., by usingfunc.__self__
to obtain the class object and func.__func__
to return the unbound function object. However, some caution is in order, because this would immesh what are currently implementation details as a permanent features of the interface, such as the fact that vectorize
is implemented as a class rather than closure, or whether a method is implemented directly or using a descriptor.
Given the complexity and the limited use cases, we are also deferring on this issue for now, but we are confident that __array_function__
could be expanded to accommodate these use cases in the future if need be.
Discussion#
Various alternatives to this proposal were discussed in a few GitHub issues:
Additionally it was the subject of a blogpost. Following this it was discussed at a NumPy developer sprint at the UC Berkeley Institute for Data Science (BIDS).
Detailed discussion of this proposal itself can be found on thethe mailing list and relevant pull requests (1,2,3)
Copyright#
This document has been placed in the public domain.