[Python-Dev] Expose the array interface in Python 2.5? (original) (raw)

Travis E. Oliphant oliphant.travis at ieee.org
Fri Mar 17 11:40:38 CET 2006


Nick Coghlan wrote:

Travis E. Oliphant wrote:

Would it be possible to add at least the C-struct array interface to the Python arrayobject in time for Python 2.5? Do you mean simply adding an arrayshape attribute that consists of a tuple with the array length, and an arraytype attribute set to 'O'? Or trying to expose the array object's data?

I was thinking more the array_struct (in particular the C-structure that defines it).

The former seems fairly pointless, and the latter difficult (since it has implications for moving the data store when the array gets resized).

Sure, it's the same problem as exposing through the buffer protocol. Since, we already have that problem, why try to pretend we don't?

I've spent a fair bit of time looking at this interface, and while I'm a big fan of the basic idea, I'm not convinced that it makes sense to include the interface in the core without also adopting a common convention for multi-dimensional fixed shape indexing (e.g. by introducing a simple dimensioned array type as something like array.dimarray).

True, such a thing would be great, but it could also be written in Python fairly quickly building on top of the array and serve as a simple example.

My big quest is to get PIL, PyVox, WxPython, PyOpenGL, and so forth to be able to use the same interface. Blessing the interface by including it in the Python core would help. I'm also just wanting people in py-dev to get the concept of an array interface on their radar, as discussions of new bytes types emerges.

Sometimes, there is not enough cross-talk between numpy-discussions and pydev. This is our fault, of course, but we're often swamped (I know I am...), and it can take some effort for us "array" people to figure out what's going on in the depths of Python sufficiently to comprehend some of the discussions here.

The fact that array.array is a mutable sequence rather than a fixed shape array means that it doesn't mesh particularly well with the ideas behind the array interface. numpy arrays can have their shape changed via reshape, but they impose the rule that the total number of elements can't change so that the allocated memory doesn't need to be moved - the standard library's array type has no such limitation.

This is not really a limitation of numpy arrays either. Check the resize method... But, I understand your point that array.array's are more-like lists. Of course, when they behave that way, their buffer interface is presently broken. So, maybe the array.array is sufficiently broken to not be worth "fixing", but what else should be done?

I'm kind of tired of this problem dragging on and on. The Numeric header (essentially what the array_struct exposes) is now basically unchanged for over 10 years and yet it's direct support by Python is still not their. The Python community has been very helpful over the years, but we need more direct discussion with Python developers to help things along. I'm grateful Nick has responded. If anyone else has any interest in these ideas, please sound off.

Aside from the obvious (the use of Ellipsis and permitting multiple dimensions), there are a number of ways in which the semantics of numpy array subscripts differ from normal sequence subcripts, and which of these should be part of the common multi-dimensional indexing conventions needs to be thrashed out in a PEP:

While these are interesting academic issues. The problem with most of these comments is that you will get load voices of disapproval if any of these conventions changes significantly from what has become standard via Numeric's use over 10 years.

I think no one is up to the task of trying to re-concile Numeric behavior with Python-dev opinions of what 'ought' to be, unless the basic usage does not change too much.

- numpy array slices are views that permit mutation of the original object (slicing a sequence creates a copy of the sliced section)

Not really open for discussion among Numeric Python users as it's been debated for years always coming to the same (keep the current behavior) conclusion.

- assignment to slices is not allowed to change the shape of a numpy array (assigning to a slice of a normal sequence may change the total length)

People might be open to this idea, as it adds a new feature and doesn't signficantly change other usages.

- deletion of slices is not permitted by numpy arrays (deleting a slice of a sequence changes the total length)

Also something people might accept.

- NewAxis is a novel use of subscript notation

True, but not something we can really change.

- there are sophisticated rules to try to align numpy array shapes

You are speaking of broadcasting. These could of course be discussed, but current behavior is "entrenched"

- assignment of a sequence to a numpy array section is rather disconcerting, as the checks to determine what should and should not be repeated to fit into the available space are type based

I'm not sure what this means... Please elaborate.

For something in the standard library, much of the complexity should be stripped out, with the clever bits of programmer convenience left for numpy to provide. However, decided which bits to remove and which to keep is a non-trivial task.

I agree. I suppose your itemization above was really to come to this conclusion as well. But, I think a stripped-down array that doesn't try to guess what to do with these interfaces is a good start. In other words, I disagree that you need to implement multidimensional indexing in order for Python to support the array interface. All you need is a simple object that supports the buffer protocol and has the array_struct method and has a C-structure very similar to the current NumPy array (which is very similar to the old Numeric C-structure).

If such a thing were in Python, then NumPy could inherit from it (as could other array-like objects), with the big advantage that there is at least one common memory model for arrays. Others could still exist, of course, but at least there would be a very useful common one.

Given that even the bytes type has been deferred to 2.6 to allow further consideration of the appropriate API, my vote is to do the same for an array.dimarray type and allow more time to figure out the appropriate Python interface.

I was afraid of that. But, unless people in pydev actually care to discuss these matters, I fear that yet again nothing will be done. The problem is that for most of us array users, it's only community outreach and a desire to get people using Python talking the same array language that makes us really care about these things. The NumPy library works fine for what we really need it to do, and it's hard to get motivated to convince people that haven't used an array-language like IDL or MATLAB in the past to understand the reasons for NumPy's behavior.

The big difference with the bytes type, is that Numeric has 10 years of history behind it. There is a lot of experience with an appropriate array type. It's not like we just came up with this a few days ago :-)

As the bytes type is developed please keep in mind it's uses as the memory for an N-dimensional array. Perhaps the bytes object could be a default way (or built on a default way) to allocate memory. A simple reference-counted memory object would certainly belay the problems of the buffer interface that the array object currently has problems with.

In other words, the array object should not malloc it's own memory but create a memory object which is nothing more than a reference-counted pointer to memory. Surely this has been talked about. Is there a reason it has not been implemented? It would not be that hard.

Even something like that would be a first step.

Thanks for the comments. I'm glad there is another voice here that cares about the issues involved.

-Travis

Regards, Nick.



More information about the Python-Dev mailing list