[Numpy-discussion] Proposal to accept NEP-18, array_function protocol (original) (raw)
Charles R Harris charlesr.harris at gmail.com
Mon Aug 20 11:26:21 EDT 2018
- Previous message (by thread): [Numpy-discussion] Proposal to accept NEP-18, __array_function__ protocol
- Next message (by thread): [Numpy-discussion] Proposal to accept NEP-18, __array_function__ protocol
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Ping to finish up this discussion so we can come to a conclusion. I'm in
favor of the NEP, as I don't see it as orthogonal to Nathaniel's concerns.
However, we might want to be selective as to which functions we expose via
the __array_function__
method.
On Wed, Aug 15, 2018 at 10:45 AM, Stephan Hoyer <shoyer at gmail.com> wrote:
Nathaniel,
Thanks for raising these thoughtful concerns. Your independent review of this proposal is greatly appreciated! See my responses inline below: On Mon, Aug 13, 2018 at 2:44 AM Nathaniel Smith <njs at pobox.com> wrote:
The other approach would be to incrementally add clean, well-defined dunder methods like arrayufunc, arrayconcatenate, etc. This way we end up putting some thought into each interface, making sure that it's something we can support, protecting downstream libraries from unnecessary complexity (e.g. they can implement arrayconcatenate instead of hstack, vstack, rowstack, columnstack, ...), or avoiding adding new APIs entirely (e.g., by converting existing functions into ufuncs so arrayufunc starts automagically working). And in the end we get a clean list of dunder methods that new array container implementations have to define. It's plausible to imagine a generic test suite for array containers. (I suspect that every library that tries to implement arrayfunction will end up with accidental behavioral differences, just because the numpy API is so vast and contains so many corner cases.) So the clean-well-defined-dunders approach has lots of upsides. The big downside is that this is a much longer road to go down.
RE: accidental differences in behavior: I actually think that the arrayfunction approach is less prone to accidental differences in behavior, because we require implementing every function directly (or it raises an error). This avoids a classic subclassing problem that has plagued NumPy for years, where overriding the behavior of method A causes apparently unrelated method B to break, because it relied on method A internally. In NumPy, this constrained our implementation of np.median(), because it needed to call np.mean() in order for subclasses implementing units to work properly. There will certainly be accidental differences in behavior for third-party code that uses NumPy, but this is basically inevitable for any proposal to allow's NumPy's public API to be overloaded. It's also avoided by default by third-party libraries that follow the current best practice of casting all input arrays with np.asarray(). -------------- RE: a hypothetical simplified interface: The need to implement everything you want to use in NumPy's public API could certainly be onerous, but on the other hand there are a long list of projects that have already done this today -- and these are the projects that most need arrayfunction. I'm sure there are cases were simplification would be warranted, but in particular I don't think arrayconcatenate has significant advantages over simply implementing arrayfunction for np.concatenate. It's a slightly different way of spelling, but it basically does the same thing. The level of complexity to implement hstack, vstack, rowstack and columnstack in terms of np.concatenate is pretty minimal. arrayfunction implementors could easily copy and paste code from NumPy or use a third-party helpers library (like NDArrayOperatorsMixin) that provides such implementations. I also have other concerns about the "simplified API" approach beyond the difficulty of figuring it out, but those are already mentioned in the NEP: http://www.numpy.org/neps/nep-0018-array-function-protocol. html#implementations-in-terms-of-a-limited-core-api But... this is wishful thinking. No matter what the NEP says, I simply don't believe that we'll actually go break dask, sparse arrays, xarray, and sklearn in a numpy point release. Or any numpy release. Nor should we. If we're serious about keeping this experimental – and I think that's an excellent idea for now! – then IMO we need to do something more to avoid getting trapped by backwards compatibility. I agree, but to be clear, development for dask, sparse and xarray (and even broadly supported machine learning libraries like TensorFlow) still happens at a much faster pace than is currently the case for "core" projects in the SciPy stack like NumPy. It would not be a big deal to encounter breaking changes in a "major" NumPy release (i.e., 1.X -> 1.(X+1)). (Side note: sklearn doesn't directly implement any array types, so I don't think it would make use of arrayfunction in any way, except possibly to implement overloadable functions.)
Here is Travis Oliphant's talk at PyBay <https://speakerdeck.com/teoliphant/ml-in-python>, where he talks about the proliferation of arrays and interfaces in the ML/AI ecosystem among other things. I think that we should definitely try to get NumPy out there as an option in the near future.
My suggestion: at numpy import time, check for an envvar, like say NUMPYEXPERIMENTALARRAYFUNCTION=1. If it's not set, then all the arrayfunction dispatches turn into no-ops. This lets interested downstream libraries and users try this out, but makes sure that we won't have a hundred thousand end users depending on it without realizing.
- makes it easy for end-users to check how much overhead this adds (by running their code with it enabled vs disabled) - if/when we decide to commit to supporting it for real, we just remove the envvar. I'm slightly concerned that the cost of reading an environment variable with os.environ could exaggerate the performance cost of arrayfunction. It takes about 1 microsecond to read an environment variable on my laptop, which is comparable to the full overhead of arrayfunction. So we may want to switch to an explicit Python API instead, e.g., np.enableexperimentalarrayfunction(). My bigger concern is when/how we decide to graduate arrayfunction from requiring an explicit opt-in. We don't need to make a final decision now, but it would be good to clear about what specifically we are waiting for. I see three types of likely scenarios for changing arrayfunction: 1. We decide that the overloading the NumPy namespace in general is a bad idea, based on either performance or predictability consequences for third-party libraries. In this case, I imagine we would probably keep arrayfunction, but revert to a separate namespace for explicitly overloaded functions, e.g., numpy.api. 2. We want to keep arrayfunction, but need a breaking change to the interface (and we're really attached to keeping the name arrayfunction). 3. We decide that specific functions should use a different interface (e.g., switch from arrayfunction to arrayufunc). (1) and (2) are the sort of major concerns that in my mind would warrant hiding a feature behind an experimental flag. For the most part, I expect (1) could be resolved relatively quickly by running benchmark suites after we have a working version of arrayfunction. To be honest, I don't see either of these rollback scenarios as terribly likely, but the downside risk is large enough that we may want to protect ourselves for a major release or two (6-12 months). (3) will be a much longer process, likely to stretch out over years at the current pace of NumPy development. I don't think we'll want to keep an opt-in flag for this long of a period. Rather, we may want to accept a shorter deprecation cycle than usual. In most cases, I suspect we could incrementally switch to new overloads while preserving the arrayfunction overload for a release or two. I don't really understand the 'types' frozenset. The NEP says "it will be used by most arrayfunction methods, which otherwise would need to extract this information themselves"... but they still need to extract the information themselves, because they still have to examine each object and figure out what type it is. And, simply creating a frozenset costs ~0.2 µs on my laptop, which is overhead that we can't possibly optimize later... The most flexible alternative would be to just say that we provide an fixed-length iterable, and return a tuple object. (In my microbenchmarks, it's faster to make a tuple than a list or set.) In an early draft of the NEP, I proposed exactly this, but the speed difference seemed really marginal to me. I included 'types' in the interface because I really do think it's something that almost all arrayfunction implementations should use use. It preserves a nice separation of concerns between dispatching logic and implementations for a new type. At least as long as arrayfunction is experimental, I don't think we should be encouraging people to write functions that could return NotImplemented directly and to rely entirely on the NumPy interface. Many but not all implementations will need to look at argument types. This is only really essential for cases where mixed operations between NumPy arrays and another type are allowed. If you only implement the NumPy interface for MyArray objects, then in the usual Python style you wouldn't need isinstance checks. It's also important from an ecosystem perspective. If we don't make it easy to get type information, my guess is that many arrayfunction authors wouldn't bother to return NotImplemented for unexpected types, which means that arrayfunction will break in weird ways when used with objects from unrelated libraries.
Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20180820/42e87f46/attachment.html>
- Previous message (by thread): [Numpy-discussion] Proposal to accept NEP-18, __array_function__ protocol
- Next message (by thread): [Numpy-discussion] Proposal to accept NEP-18, __array_function__ protocol
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]