[Numpy-discussion] Proposal to accept NEP-18, array_function protocol (original) (raw)

Stephan Hoyer shoyer at gmail.com
Wed Aug 15 12:45:10 EDT 2018

Previous message (by thread): [Numpy-discussion] Proposal to accept NEP-18, __array_function__ protocol
Next message (by thread): [Numpy-discussion] Proposal to accept NEP-18, __array_function__ protocol
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Nathaniel,

Thanks for raising these thoughtful concerns. Your independent review of this proposal is greatly appreciated!

See my responses inline below:

On Mon, Aug 13, 2018 at 2:44 AM Nathaniel Smith <njs at pobox.com> wrote:

The other approach would be to incrementally add clean, well-defined dunder methods like arrayufunc, arrayconcatenate, etc. This way we end up putting some thought into each interface, making sure that it's something we can support, protecting downstream libraries from unnecessary complexity (e.g. they can implement arrayconcatenate instead of hstack, vstack, rowstack, columnstack, ...), or avoiding adding new APIs entirely (e.g., by converting existing functions into ufuncs so arrayufunc starts automagically working). And in the end we get a clean list of dunder methods that new array container implementations have to define. It's plausible to imagine a generic test suite for array containers. (I suspect that every library that tries to implement arrayfunction will end up with accidental behavioral differences, just because the numpy API is so vast and contains so many corner cases.) So the clean-well-defined-dunders approach has lots of upsides. The big downside is that this is a much longer road to go down.

RE: accidental differences in behavior:

I actually think that the array_function approach is less prone to accidental differences in behavior, because we require implementing every function directly (or it raises an error).

This avoids a classic subclassing problem that has plagued NumPy for years, where overriding the behavior of method A causes apparently unrelated method B to break, because it relied on method A internally. In NumPy, this constrained our implementation of np.median(), because it needed to call np.mean() in order for subclasses implementing units to work properly.

There will certainly be accidental differences in behavior for third-party code that uses NumPy, but this is basically inevitable for any proposal to allow's NumPy's public API to be overloaded. It's also avoided by default by third-party libraries that follow the current best practice of casting all input arrays with np.asarray().

RE: a hypothetical simplified interface:

The need to implement everything you want to use in NumPy's public API could certainly be onerous, but on the other hand there are a long list of projects that have already done this today -- and these are the projects that most need array_function.

I'm sure there are cases were simplification would be warranted, but in particular I don't think array_concatenate has significant advantages over simply implementing array_function for np.concatenate. It's a slightly different way of spelling, but it basically does the same thing. The level of complexity to implement hstack, vstack, row_stack and column_stack in terms of np.concatenate is pretty minimal. array_function implementors could easily copy and paste code from NumPy or use a third-party helpers library (like NDArrayOperatorsMixin) that provides such implementations.

I also have other concerns about the "simplified API" approach beyond the difficulty of figuring it out, but those are already mentioned in the NEP: http://www.numpy.org/neps/nep-0018-array-function-protocol.html#implementations-in-terms-of-a-limited-core-api

But... this is wishful thinking. No matter what the NEP says, I simply

don't believe that we'll actually go break dask, sparse arrays, xarray, and sklearn in a numpy point release. Or any numpy release. Nor should we. If we're serious about keeping this experimental – and I think that's an excellent idea for now! – then IMO we need to do something more to avoid getting trapped by backwards compatibility.

I agree, but to be clear, development for dask, sparse and xarray (and even broadly supported machine learning libraries like TensorFlow) still happens at a much faster pace than is currently the case for "core" projects in the SciPy stack like NumPy. It would not be a big deal to encounter breaking changes in a "major" NumPy release (i.e., 1.X -> 1.(X+1)).

(Side note: sklearn doesn't directly implement any array types, so I don't think it would make use of array_function in any way, except possibly to implement overloadable functions.)

My suggestion: at numpy import time, check for an envvar, like say NUMPYEXPERIMENTALARRAYFUNCTION=1. If it's not set, then all the arrayfunction dispatches turn into no-ops. This lets interested downstream libraries and users try this out, but makes sure that we won't have a hundred thousand end users depending on it without realizing.

makes it easy for end-users to check how much overhead this adds (by

running their code with it enabled vs disabled) - if/when we decide to commit to supporting it for real, we just remove the envvar.

I'm slightly concerned that the cost of reading an environment variable with os.environ could exaggerate the performance cost of array_function. It takes about 1 microsecond to read an environment variable on my laptop, which is comparable to the full overhead of array_function. So we may want to switch to an explicit Python API instead, e.g., np.enable_experimental_array_function().

My bigger concern is when/how we decide to graduate array_function from requiring an explicit opt-in. We don't need to make a final decision now, but it would be good to clear about what specifically we are waiting for.

I see three types of likely scenarios for changing array_function:

We decide that the overloading the NumPy namespace in general is a bad idea, based on either performance or predictability consequences for third-party libraries. In this case, I imagine we would probably keep array_function, but revert to a separate namespace for explicitly overloaded functions, e.g., numpy.api.
We want to keep array_function, but need a breaking change to the interface (and we're really attached to keeping the name array_function).
We decide that specific functions should use a different interface (e.g., switch from array_function to array_ufunc).

(1) and (2) are the sort of major concerns that in my mind would warrant hiding a feature behind an experimental flag. For the most part, I expect (1) could be resolved relatively quickly by running benchmark suites after we have a working version of array_function. To be honest, I don't see either of these rollback scenarios as terribly likely, but the downside risk is large enough that we may want to protect ourselves for a major release or two (6-12 months).

(3) will be a much longer process, likely to stretch out over years at the current pace of NumPy development. I don't think we'll want to keep an opt-in flag for this long of a period. Rather, we may want to accept a shorter deprecation cycle than usual. In most cases, I suspect we could incrementally switch to new overloads while preserving the array_function overload for a release or two.

I don't really understand the 'types' frozenset. The NEP says "it will

be used by most arrayfunction methods, which otherwise would need to extract this information themselves"... but they still need to extract the information themselves, because they still have to examine each object and figure out what type it is. And, simply creating a frozenset costs ~0.2 µs on my laptop, which is overhead that we can't possibly optimize later...

The most flexible alternative would be to just say that we provide an fixed-length iterable, and return a tuple object. (In my microbenchmarks, it's faster to make a tuple than a list or set.) In an early draft of the NEP, I proposed exactly this, but the speed difference seemed really marginal to me.

I included 'types' in the interface because I really do think it's something that almost all array_function implementations should use use. It preserves a nice separation of concerns between dispatching logic and implementations for a new type. At least as long as array_function is experimental, I don't think we should be encouraging people to write functions that could return NotImplemented directly and to rely entirely on the NumPy interface.

Many but not all implementations will need to look at argument types. This is only really essential for cases where mixed operations between NumPy arrays and another type are allowed. If you only implement the NumPy interface for MyArray objects, then in the usual Python style you wouldn't need isinstance checks.

It's also important from an ecosystem perspective. If we don't make it easy to get type information, my guess is that many array_function authors wouldn't bother to return NotImplemented for unexpected types, which means that array_function will break in weird ways when used with objects from unrelated libraries.

Cheers, Stephan -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20180815/5800cfba/attachment-0001.html>

Previous message (by thread): [Numpy-discussion] Proposal to accept NEP-18, __array_function__ protocol
Next message (by thread): [Numpy-discussion] Proposal to accept NEP-18, __array_function__ protocol
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the NumPy-Discussion mailing list