[Python-Dev] PEP 3118: Extended buffer protocol (new version) (original) (raw)
Travis Oliphant oliphant.travis at ieee.org
Thu Apr 19 06:40:28 CEST 2007
- Previous message: [Python-Dev] PEP 3118: Extended buffer protocol (new version)
- Next message: [Python-Dev] PEP 3118: Extended buffer protocol (new version)
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Carl Banks wrote:
Ok, I've thought quite a bit about this, and I have an idea that I think will be ok with you, and I'll be able to drop my main objection. It's not a big change, either. The key is to explicitly say whether the flag allows or requires. But I made a few other changes as well. I'm good with using an identifier to differentiate between an "allowed" flag and a "require" flag. I'm not a big fan of VERY_LONG_IDENTIFIER_NAMES though. Just enough to understand what it means but not so much that it takes forever to type and uses up horizontal real-estate.
We use flags in NumPy quite a bit, and I'm obviously trying to adapt some of this to the general case here, but I'm biased by my 10 years of experience with the way I think about NumPy arrays.
Thanks for helping out and offering your fresh approach. I like a lot of what you've come up with. There are a few modifications I would make, though.
First of all, let me define how I'm using the word "contiguous": it's a single buffer with no gaps. So, if you were to do this: "memset(bufinfo->buf,0,bufinfo->len)", you would not touch any data that isn't being exported.
Sure, we call this NPY_ONESEGMENT in NumPy-speak, though, because contiguous could be NPY_C_CONTIGUOUS or NPY_F_CONTIGUOUS. We also don't use the terms ROW_MAJOR and COLUMN_MAJOR and so I'm not a big fan of bringing them up in the Python space because the NumPy community has already learned the C_ and F_ terminology which also generalizes to multiple-dimensions more clearly without using 2-d concepts.
Without further ado, here is my proposal:
------ With no flags, the PyObjectGetBuffer will raise an exception if the buffer is not direct, contiguous, and one-dimensional. Here are the flags and how they affect that:
I'm not sure what you mean by "direct" here. But, this looks like the
Py_BUF_SIMPLE case (which was a named-constant for 0) in my proposal.
The exporter receiving no flags would need to return a simple buffer
(and it wouldn't need to fill in the format character either ---
valuable information for the exporter to know).
PyBUFREQUIREWRITABLE - Raise exception if the buffer isn't writable. WRITEABLE is an alternative spelling and the one that NumPy uses. So, either include both of these as alternatives or just use WRITEABLE. PyBUFREQUIREREADONLY - Raise excpetion if the buffer is writable. Or if the object memory can't be made read-only if it is writeable. PyBUFALLOWNONCONTIGUOUS - Allow noncontiguous buffers. (This turns on "shape" and "strides".) Fine. PyBUFALLOWMULTIDIMENSIONAL - Allow multidimensional buffers. (Also turns on "shape" and "strides".) Just use ND instead of MULTIDIMENSIONAL and only turn on shape if it is present. (Neither of the above two flags implies the other.)
PyBUFALLOWINDIRECT - Allow indirect buffers. Implies PyBUFALLOWNONCONTIGUOUS and PyBUFALLOWMULTIDIMENSIONAL. (Turns on "shape", "strides", and "suboffsets".) If we go with this consumer-oriented naming scheme, I like indirect also.
PyBUFREQUIRECONTIGUOUSCARRAY or PyBUFREQUIREROWMAJOR - Raise an exception if the array isn't a contiguous array with in C (row-major) format. PyBUFREQUIRECONTIGUOUSFORTRANARRAY or PyBUFREQUIRECOLUMNMAJOR - Raise an exception if the array isn't a contiguous array with in Fortran (column-major) format. Just name them C_CONTIGUOUS and F_CONTIGUOUS like in NumPy. PyBUFALLOWNONCONTIGUOUS, PyBUFREQUIRECONTIGUOUSCARRAY, and PyBUFREQUIRECONTIGUOUSFORTRANARRAY all conflict with each other, and an exception should be raised if more than one are set. (I would go with ROWMAJOR and COLUMNMAJOR: even though the terms only make sense for 2D arrays, I believe the terms are commonly generalized to other dimensions.) As I mentioned there is already a well-established history with NumPy.
We've dealt with this issue already. Possible pseudo-flags: PyBUFSIMPLE = 0; PyBUFALLOWSTRIDED = PyBUFALLOWNONCONTIGUOUS | PyBUFALLOWMULTIDIMENSIONAL; ------ Now, for each flag, there should be an associated function to test the condition, given a bufferinfo struct. (Though I suppose they don't necessarily have to map one-to-one, I'll do that here.) int PyBufferInfoIsReadonly(struct bufferinfo*); int PyBufferInfoIsWritable(struct bufferinfo*); int PyBufferInfoIsContiguous(struct bufferinfo*); int PyBufferInfoIsMultidimensional(struct bufferinfo*); int PyBufferInfoIsIndirect(struct bufferinfo*); int PyBufferInfoIsRowMajor(struct bufferinfo*); int PyBufferInfoIsColumnMajor(struct bufferinfo*); The function PyObjectGetBuffer then has a pretty obvious implementation. Here is an except: if ((flags & PyBUFREQUIREREADONLY) && !PyBufferInfoIsReadonly(&bufinfo)) { PyExcSetString(PyErrBufferError,"buffer not read-only"); return 0; } Pretty straightforward, no? Now, here is a key point: for these functions to work (indeed, for PyObjectGetBuffer to work at all), you need enough information in bufinfo to figure it out. The bufferinfo struct should be self-contained; you should not need to know what flags were passed to PyObjectGetBuffer in order to know exactly what data you're looking at. Naturally.
Therefore, format must always be supplied by getbuffer. You cannot tell if an array is contiguous without the format string. (But see below.)
No, I don't think this is quite true. You don't need to know what "kind" of data you are looking at if you don't get strides. If you use the SIMPLE interface, then both consumer and exporter know the object is looking at "bytes" which always has an itemsize of 1.
And even if the consumer isn't asking for a contiguous buffer, it has to know the item size so it knows what data not to step on. (This is true even in your own proposal, BTW. If a consumer asks for a non-strided array in your proposal, PyObjectGetBuffer would have to know the item size to determine if the array is contiguous.) Yes, it is true, that getting strides requires that the format be specified as well. That was an oversight of the original proposal.
But, if strides are not needed, then format is also not needed.------ FAQ: Q. Why ALLOWNONCONTIGUOUS and ALLOWMULTIDIMENSIONAL instead of ALLOWSTRIDED and ALLOWSHAPED? A. It's more useful to the consumer that way. With ALLOWSTRIDED and ALLOWSHAPED, there's no way for a consumer to request a general one-dimensional array (it can only request a non-strided one-dimensional array), and requesting a SHAPED array but not a STRIDED one can only return a C-like (row-major) array, although a consumer might reasonably want a Fortran-like (column-major) array. This approach maps more directly to the consumer's needs, is more flexible, and still maintains the same functionality of ALLOWSHAPED and ALLOWSTRIDED. Q. Why call it ALLOWINDIRECT instead of ALLOWOFFSETS? A. It's just a name, and not too important to me, but I wanted to emphasize the consumer's usage, rather than the benefit to the exporter. The consumers, after all, are the ones setting the flags. Q. Why ALLOWNONCONTIGUOUS instead of REQUIRECONTIGUOUS? Two reasons: 1. Contiguous arrays are "simpler", so it's better to make the people who want more complex arrays to work harder, and 2. ALLOWNONCONTIGUOUS is closely tied to ALLOWMULTIDIMENSIONAL. If the negative is a problem, perhaps a name like ALLOWDISCONTINUOUS or ALLOWGAPS would be better? Q. What about PyBUFFORMAT? A. Ok, fine, if it's that imporant to you. I think it's totally superfluous, but it's not evil. But consider these things: 1. Require that it does not throw an exception. It's not the exporter's business to tell the consumer to how to use its data. Look, consumers that want to be "in-charge" can just ask for format data and ignore it. If an exporter wants to be persnickety about how its data is viewed, then it should be allowed to be. Perhaps it has good reason. It's just a matter of how much "work" it is to get the "wrong" view of the data. 2. Even if you don't supply the format string, you need to supply an itemsize in struct bufferinfo, otherwise there is no way for a consumer to determine if the array is contiguous, and or to know (in general) what data is being exported. The itemsize must ALWAYS be available. Only if strides is provided and format isn't is itemsize actually needed. But, we've added the itemsize field anyway. 3. Invert PyBUFFORMAT. Use PyBUFDONTNEEDFORMAT instead. Make the consumer that cares about performance ask for the optimization. (You admit yourself that PyBUFFORMAT is part of the least common denominator, so invert it.) Either way. I think the Py_BUF_FORMAT is easier because then Py_BUF_SIMPLE is just a numerical value of 0.
I'll update the PEP with my adaptation of your suggestions in a little while.
-Travis
- Previous message: [Python-Dev] PEP 3118: Extended buffer protocol (new version)
- Next message: [Python-Dev] PEP 3118: Extended buffer protocol (new version)
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]