Issue 15573: Support unknown formats in memoryview comparisons (original) (raw)
Created on 2012-08-07 11:55 by skrah, last changed 2022-04-11 14:57 by admin. This issue is now closed.
Messages (63)
Author: Stefan Krah (skrah) *
Date: 2012-08-07 11:55
Continuing the discussion from #13072. I hit a snag here:
Determining in full generality whether two format strings describe identical items is pretty complicated, see also #3132.
I'm attaching a best effort fmtcmp() function that should do the following:
recognize byte order specifiers at the start of the string.
recognize if an explicitly specified byte order happens to match the native byte order.
It won't catch:
byte order specifiers anywhere in the string.
C types that happen to be identical ('I', 'L' on a 32-bit platform). I'm also not sure if that is desirable in the first place.
???
So fmtcmp() will return false negatives (not equal), but should be correct for most format strings that are actually in use.
Mark, Meador: You did a lot of work on the struct module and of course on issue #3132. Does this look like a reasonable compromise? Did I miss obvious cases (see attachment)?
Author: Alyssa Coghlan (ncoghlan) *
Date: 2012-08-07 13:06
I confess I was thinking of an even simpler "format strings must be identical" fallback, but agree your way is better, since it reproduces the 3.2 behaviour in many more cases where ignoring the format string actually did the right thing.
The struct docs for the byte order specifier specifically say "the first character of the format string can be used to indicate the byte order, size and alignment of the packed data", so treating format strings that include byte order markers elsewhere in the string as differing from each other if those markers are in different locations sounds fine to me.
Author: Meador Inge (meador.inge) *
Date: 2012-08-08 02:52
I agree that the general case is complicated. It will get even more complicated if the full of PEP 3118 gets implemented since it turns into a tree comparison. In general, I think you will probably have to compute some canonical form and then compare the canonical forms.
Here are a few more cases that don't work out in the attached algorithm:
- Repeat characters - '2c' == 'cc'
- Whitespace - 'h h' == 'hh'
Also, currently the byte order specifiers are always at the beginning of the string. We discussed in scoping them per the nested structures, but decided to drop that unless somebody barks about it since it is fairly complicated without a clear benefit. So, I wouldn't worry about them being scattered through the string.
This seems like sort of a slippery slope. I need to think about it more, but my first impression is that coming up with some way to compare format strings is going to be nasty.
Author: Stefan Krah (skrah) *
Date: 2012-08-08 11:31
Right, byte order specifiers are always at the beginning of the string. That is at least something. I wonder if we should tighten PEP-3118 to demand a canonical form of format strings, such as (probably incomplete):
Whitespace is disallowed.
Except for 's', no zero count may be given.
A count of 1 (redundant) is disallowed.
Repeats must be specified in terms of count + single char.
That still leaves the '=I' != '=L' problem. Why are there two specifiers describing uint32_t?
Anyway, as Meador says, this can get tricky and I don't think this can be resolved before beta-2. I'm attaching a patch that should behave well for the restricted canonical form at least.
Author: Martin v. Löwis (loewis) *
Date: 2012-08-09 21:47
Can someone please explain why this is a release blocker? What is the specific regression from 3.2 that this deals with?
Author: STINNER Victor (vstinner) *
Date: 2012-08-09 22:05
What is the specific regression from 3.2 that this deals with?
I don't know if it must be called a regression, but at least the behaviour is different in Python 3.2 and 3.3. For example, an Unicode array is no more equal to its memoryview:
Python 3.3.0b1 (default:aaa68dce117e, Aug 9 2012, 22:45:00)
[GCC 4.6.3 20120306 (Red Hat 4.6.3-2)] on linux
Type "help", "copyright", "credits" or "license" for more information.
import array
a=array.array('u', 'abc') v=memoryview(a)
a == v
False
ned$ python3
Python 3.2.3 (default, Jun 8 2012, 05:40:07)
[GCC 4.6.3 20120306 (Red Hat 4.6.3-2)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
import array
a=array.array('u', 'abc') v=memoryview(a)
a == v
True
Author: Martin v. Löwis (loewis) *
Date: 2012-08-10 06:29
haypo: thanks for stating the issue.
ISTM that this classifies as an "obscure" bug: you have to use memoryviews, and you need to compare them for equality, and the comparison needs to be "non-trivial", where "trivial" is defined by "both are 1D byte arrays".
While this is a bug, I think it still can be fixed in a bug fix release of 3.3, so un-blocking.
I also think that as a first step, a specification needs to be drafted defining when exactly a memory view should compare equal with some other object. I can easily provide a specification that makes the current implementation "correct".
Author: Stefan Krah (skrah) *
Date: 2012-08-10 07:48
I can easily provide a specification that makes the current implementation "correct"
Yes, the current specification is: memoryview only attempts to compare arrays with known (single character native) formats and returns "not equal" otherwise.
The problem is that for backwards compatibility memoryview accepts arrays with arbitrary format strings. In operations like tolist() it's possible to raise NotImplemented, but for equality comparisons that's not possible.
Note that in 3.2 memoryview would return "equal" for arrays that simply aren't equal, if those arrays happen to have the same bit pattern.
One way to deal with this is to demand a strict canonical form of format strings for PEP-3118, see .
Author: Martin v. Löwis (loewis) *
Date: 2012-08-10 08:53
Note that in 3.2 memoryview would return "equal" for arrays that simply aren't equal, if those arrays happen to have the same bit pattern.
This is exactly my point. If a "memoryview" has the same "memory" as another, why are they then not rightfully considered equal? IOW, the 3.2 behavior looks fine to me.
You apparently have a vision that equality should mean something different for memoryviews - please explicitly state what that view is. A mathematical definition ("two memoryviews A and B are equal iff ...") would be appreciated.
One way to deal with this is to demand a strict canonical form of format strings for PEP-3118, see .
You are talking about the solution already - I still don't know what the problem is exactly (not that I need to understand the problem, but at a minimum, the documentation should state what the intended behavior is - better if people would also agree that the proposed behavior is "reasonable").
For 3.3, I see two approaches: either move backwards to the 3.2 behavior and defer this change to 3.4 - this would make it release-critical indeed. Or move forward to a yet-to-be specified equality operation which as of now may not be implemented correctly, treating any improvement to it as a bug fix.
Author: Stefan Krah (skrah) *
Date: 2012-08-10 09:53
PEP-3118 specifies strongly typed multi-dimensional arrays. The existing code in the 3.2 memoryview as well as numerous comments by Travis Oliphant make it clear that these capabilities were intended from the start for memoryview as well.
Perhaps the name "memoryview" is a misnomer, since the boundaries between memoryview and NumPy's ndarray become blurry. In fact, the small implementation of an ndarray in Modules/_testbuffer.c is also a memoryview in some sense, since it can grab a buffer from an exporter and expose it in the same manner as memoryview.
So what I implemented is certainly not only my vision. The PEP essentially describes NumPy arrays with an interchange format to convert between NumPy and PIL arrays.
It is perhaps unfortunate that the protocol was named "buffer" protocol, since it is actually an advanced "array" protocol.
NumPy arrays don't care about the raw memory. It is always the logical array that matters. For example, Fortran and C arrays have different bit patterns in memory but compare equal, a fact that the 3.2 implementation completely misses.
Arrays v and w are equal iff format and shape are equal and for all valid indices allowed by shape
memcmp((char *)PyBuffer_GetPointer(v, indices), (char *)PyBuffer_GetPointer(w, indices), itemsize) == 0.
Equal formats of course imply equal itemsize.
Author: Martin v. Löwis (loewis) *
Date: 2012-08-10 16:04
Can you please elaborate on the specification:
what does it mean that the formats of v and w are equal?
Victor's clarification about this issue isn't about comparing two arrays, but an array with a string object. So: when is an array equal to some other (non-array) object?
Author: Stefan Krah (skrah) *
Date: 2012-08-10 16:46
- what does it mean that the formats of v and w are equal?
I'm using array and Py_buffer interchangeably since a Py_buffer struct actually describes a multi-dimensional array. v and w are Py_buffer structs.
So v.format must equal w.format, where format is a format string in struct module syntax. The topic of this issue is to determine under what circumstances two strings in struct module syntax are considered equal.
- Victor's clarification about this issue isn't about comparing two arrays, but an array with a string object. So: when is an array equal to some other (non-array) object?
a=array.array('u', 'abc') v=memoryview(a) a == v False
memoryview can compare against any object with a getbufferproc, in this case array.array. memoryview_richcompare() calls PyObject_GetBuffer(other) and proceeds to compare its own internal Py_buffer v against the obtained Py_buffer w.
In the case of v.format == w.format the fix for unknown formats is trivial: Just allow the comparison using v.itemsize == w.itemsize.
However, the struct module format string syntax has multiple representations for the exact same formats, which makes a general fmtcmp() function tricky to write.
Hence my proposal to demand a strict canonical form for PEP-3118 format strings, which would be a proper subset of struct module format strings.
Example: "1Q 1h 1h 0c" must be written as "Q2h"
The attached patch should largely implement this proposal. A canonical form is perhaps not such a severe restriction, since format strings should usually come from the exporting object. E.g. NumPy must translate its own internal format to struct module syntax anyway.
Another option is to commit the patch that misses "1Q 1h 1h 0c" == "Q2h" now and aim for a completely general fmtcmp() function later.
IMO any general fmtcmp() function should also be reasonably fast.
Author: Martin v. Löwis (loewis) *
Date: 2012-08-10 17:47
So v.format must equal w.format, where format is a format string in struct module syntax. The topic of this issue is to determine under what circumstances two strings in struct module syntax are considered equal.
And that is exactly my question: We don't need a patch implementing it (yet), but a specification of what is to be implemented first.
I know when two strings are equal (regardless of their syntax): if they have the same length, and contain the same characters in the same order. Apparently, you have a different notion of "equal" for strings in mind, please be explicitly what that notion is.
memoryview can compare against any object with a getbufferproc, in this case array.array. memoryview_richcompare() calls PyObject_GetBuffer(other) and proceeds to compare its own internal Py_buffer v against the obtained Py_buffer w.
Can this be expressed on Python level as well? I.e. is it correct to say: an array/buffer/memoryview A is equal to an object O iff A is equal to memoryview(O)? Or could it be that these two equivalences might reasonably differ?
Hence my proposal to demand a strict canonical form for PEP-3118 format strings, which would be a proper subset of struct module format strings.
Can you kindly phrase this as a specification? Who is demanding what from whom?
Proposal: two format strings are equal if their canonical forms are equal strings. The canonical form C of a string S is created by ???
However, it appears that you may have something different in mind where things are rejected/fail to work if the canonical form isn't originally provided by somebody (whom?)
So another Proposal: two format strings are equal iff they are in both in canonical form and are equal strings.
This would imply that a format string which is not in canonical form is not equal to any other strings, not even to itself, so this may still not be what you want. But I can't guess what it is that you want.
Author: STINNER Victor (vstinner) *
Date: 2012-08-10 18:01
Can't we start with something simple (for ptyhon 3.3?), and elaborate later? In my specific example, both object have the same format string and the same content. So i expect that they are equal. Le 10 août 2012 19:47, "Martin v. Löwis" <report@bugs.python.org> a écrit :
Martin v. Löwis added the comment:
So v.format must equal w.format, where format is a format string in struct module syntax. The topic of this issue is to determine under what circumstances two strings in struct module syntax are considered equal.
And that is exactly my question: We don't need a patch implementing it (yet), but a specification of what is to be implemented first.
I know when two strings are equal (regardless of their syntax): if they have the same length, and contain the same characters in the same order. Apparently, you have a different notion of "equal" for strings in mind, please be explicitly what that notion is.
memoryview can compare against any object with a getbufferproc, in this case array.array. memoryview_richcompare() calls PyObject_GetBuffer(other) and proceeds to compare its own internal Py_buffer v against the obtained Py_buffer w.
Can this be expressed on Python level as well? I.e. is it correct to say: an array/buffer/memoryview A is equal to an object O iff A is equal to memoryview(O)? Or could it be that these two equivalences might reasonably differ?
Hence my proposal to demand a strict canonical form for PEP-3118 format strings, which would be a proper subset of struct module format strings.
Can you kindly phrase this as a specification? Who is demanding what from whom?
Proposal: two format strings are equal if their canonical forms are equal strings. The canonical form C of a string S is created by ???
However, it appears that you may have something different in mind where things are rejected/fail to work if the canonical form isn't originally provided by somebody (whom?)
So another Proposal: two format strings are equal iff they are in both in canonical form and are equal strings.
This would imply that a format string which is not in canonical form is not equal to any other strings, not even to itself, so this may still not be what you want. But I can't guess what it is that you want.
Python tracker <report@bugs.python.org> <http://bugs.python.org/issue15573>
Author: Martin v. Löwis (loewis) *
Date: 2012-08-10 19:18
Can't we start with something simple (for ptyhon 3.3?), and elaborate later?
Sure: someone would have to make a proposal what exactly that is. IMO, the 3.2 definition was simple, but apparently it was considered too simple. So either Stefan gets to define his view of equality, or somebody else needs to make a (specific) counter-proposal.
Author: Stefan Krah (skrah) *
Date: 2012-08-10 20:05
The ideal specification is:
Arrays v and w are equal iff format and shape are equal and for all valid indices allowed by shape
memcmp((char *)PyBuffer_GetPointer(v, indices), (char *)PyBuffer_GetPointer(w, indices), itemsize) == 0.
Two format strings s and t are equal if canonical(s) == canonical(t).
End ideal specification.
Purely to facilitate the implementation of a format comparison function, I suggested:
- An exporter must initialize the format field of a Py_buffer structure with canonical(s).
If all exporters obey 3), a format comparison function can simply call strcmp(s, t) (after sorting out the byte order specifier).
Specifically, if x and y are equal, then:
a) x == memoryview(x) == memoryview(y) == y
If x and y are equal and exporter x does not obey 3), but exporter y does, then:
b) x == memoryview(x) != memoryview(y) == y
Under rule 3) this would be the fault of exporter x.
For Python 3.3 it is also possible to state only 1) and 2), with a caveat in the documentation that case b) might occur until the format comparison function in memoryview implements the reductions to canonical forms.
The problem is that reductions to canonical forms might get very complicated if #3132 is implemented.
Author: Stefan Krah (skrah) *
Date: 2012-08-10 20:53
Martin v. Loewis <report@bugs.python.org> wrote:
Sure: someone would have to make a proposal what exactly that is. IMO, the 3.2 definition was simple, but apparently it was considered too simple.
It was simply broken in multiple ways. Example:
from numpy import * x = array([1,2,3,4,5], dtype='B') y = array([5,4,3,2,1], dtype='B') z = y[::-1]
x == z array([ True, True, True, True, True], dtype=bool) memoryview(x) == memoryview(z) False Segmentation fault
I'm not even talking about the segfault here. Note that x == z, but memoryview(x) != memoryview(z), because the logical structure is not taken into account.
Likewise, one could construct cases where one array contains a float NaN and the other an integer that happens to have the same bit pattern.
The arrays would not be equal, but their memoryviews would be equal.
So either Stefan gets to define his view of equality, or somebody else needs to make a (specific) counter-proposal.
The view is defined by the PEP that clearly models NumPy. I'm curious what counter-proposal will work with NumPy and PIL.
Author: Martin v. Löwis (loewis) *
Date: 2012-08-11 04:15
I find Stefan's proposed equality confusing. Why is it based on memcmp? Either it compares memory (i.e. internal representations), or it compares abstract values. If it compares abstract values, then it shouldn't be a requirement that the format strings are equal in any sense. Instead, the resulting values should be equal. So I propose this definition:
v == w iff v.shape() == w.shape() and v.tolist() == w.tolist() if either operation fails with an exception, the objects are not equal
Of course, the implementation doesn't need to literally call tolist; instead, behaving as-if it had been called is fine. However, as time is running out, I would actually propose this to be the implementation in 3.3.
In addition, I would propose to support the 'u' and 'w' codes in tolist, to resolve what Victor says the actual issue is.
I'm -1 on a definition that involves equivalence of format strings.
Author: Alyssa Coghlan (ncoghlan) *
Date: 2012-08-11 09:42
Stefan's proposed definition is correct. Shapes define array types, format strings define entry types, then the actual memory contents define the value.
It's not "Stefan's definition of equivalence", it's what a statically typed array means.
The 3.2 way is completely broken, as it considers arrays containing completely different data as equal if the memory layout happens to be the same by coincidence.
3.3 is currently also broken, as it considers arrays that do contain the same values to be different.
Stefan's patch fixes that by checking the shape and format first, and then checking the value. It's exactly the same as doing an instance check in an eq method.
The requirement for a canonical format is for sanity's sake: the general equivalence classes are too hard to figure out.
Author: Martin v. Löwis (loewis) *
Date: 2012-08-11 17:31
Nick: I still disagree. Would you agree that array.array constitutes a "statically typed array"? Yet
py> array.array('b',b'foo') == array.array('B',b'foo') True py> array.array('i',[1,2,3]) == array.array('L', [1,2,3]) True
So the array object (rightfully) performs comparison on abstract values, not on memory representation. In Python, a statically typed array still conceptually contains abstract values, not memory blocks (this is also what Stefan asserts for NumPy in ). The static typing only restricts the values you can store in the container, and defines the internal representation on the C level (plus it typically implies a value storage, instead of a reference storage).
With your and Stefan's proposed semantics, we would get the weird case that for two array.arrays a and b, it might happen that
a == b and memoryview(a) != memoryview(b)
Author: Martin v. Löwis (loewis) *
Date: 2012-08-11 17:41
"it might still happen", I should say, since this problem is exactly what Victor says this issue intends to solve (for comparison of two 'u' arrays).
Author: Stefan Krah (skrah) *
Date: 2012-08-11 18:13
So we have two competing proposals:
- Py_buffers are strongly typed arrays in the ML sense (e.g. array of float, array of int).
This is probably what I'd prefer on the C level, imagine a function like function like PyBuffer_Compare(v, w).
Backwards compatibility problem for users who were thinking in terms of value comparisons:
x = array.array('b', [127]) y = array.array('B', [127]) x == y True memoryview(x) == memoryview(y) False
- Compare by value, like NumPy arrays do:
x = numpy.array([1, 2, 3], dtype='i') y = numpy.array([1, 2, 3], dtype='f') x == y array([True, True, True], dtype=bool)
I concede that this is probably what users want to see on the Python level.
Backwards compatibility problem for users who were using complicated structs:
from _testbuffer import * x = ndarray([(1,1), (2,2), (3,3)], shape=[3], format='hQ') x == memoryview(x) False
Reason: While _testbuffer.ndarray already implements tolist() in full generality, memoryview does not:
x.tolist() [(1, 1), (2, 2), (3, 3)]
memoryview(x).tolist() Traceback (most recent call last): File "", line 1, in NotImplementedError: memoryview: unsupported format hQ
So while I'm beginning to like Martin's proposal, the implementation is certainly trickier and will always be quite slow for complicated format strings.
It would be possible to keep a fast path for the primitive C types and use the code from _testbuffer.tolist() for the slow cases.
Author: Martin v. Löwis (loewis) *
Date: 2012-08-11 18:22
Here is a patch doing the comparison on abstract values if the formats are different.
Author: Alyssa Coghlan (ncoghlan) *
Date: 2012-08-11 18:58
OK, I think I finally understand what Martin is getting at from a semantic point of view, and I think I can better explain the background of the issue and why Stefan's proposed solution is both necessary and correct.
The ideal definition of equivalence for memory view objects would actually be:
memoryview(x) == memoryview(y)
if (and only if)
memoryview(x).tolist() == memoryview(y).tolist()
Now, in practice, this approach cannot be implemented, because there are too many format definitions (whether valid or invalid) that memoryview doesn't understand (and perhaps will never understand) and because it would be completely infeasible on large arrays with complex format definitions.
Thus, we are forced to accept a constraint on memoryview's definition of equality: individual values are always compared via raw memory comparison, thus values stored using different sizes or layouts in memory will always compare as unequal, even if they would compare as equal in Python
This is an acceptable constraint as, in practice, you don't perform mixed format arithmetic and it's not a problem if there's no automatic coercion between sizes and layouts.
The Python 3.2 memoryview effectively uses memcmp() directly treating everything as a 1D array of bytes data, completely ignoring both shape and format data. Thus:
ab = array('b', [1, 2, 3]) ai = array('i', [1, 2, 3]) aL = array('L', [1, 2, 3]) ab == ai True ab == ai == aL True memoryview(ab) == memoryview(ai) False memoryview(ab) == memoryview(aL) False memoryview(ai) == memoryview(aL) False
This approach leads to some major false positives, such as a floating point value comparing equal to an integer that happens to share the same binary representation:
af = array('f', [1.1]) ai = array('i', [1066192077]) af == ai False memoryview(af) == memoryview(ai) True
The changes in 3.3 are aimed primarily at eliminating those false positives by taking into account the shape of the array and the format of the contained values. It is not about changing the fundamental constraint that memoryview operates at the level of raw memory, rather than Python objects, and thus cares about memory layout details that are irrelevant after passing through the Python abstraction layer.
This contrasts with the more limited scope of the array.array module, which does take into account the Python level abstractions. Thus, there will always be a discrepancy between the two definitions of equality, as memoryview cares about memory layout details, where array.array does not.
The problem at the moment is that Python 3.3 currently has spurious false negatives that aren't caused by that fundamental constraint that comparisons must occur based directly on memory contents. Instead, they're being caused by memoryview returning False for any equality comparison for a format it doesn't understand. That's unacceptable, and is what Stefan's patch is intended to fix.
Author: Alyssa Coghlan (ncoghlan) *
Date: 2012-08-11 19:03
Short version: 3.3 implemented new constraints on memoryview equality comparisons to eliminate cases that were incorrectly returning True when they should have returned False.
However, the format constraints are currently too restrictive, so some memoryview comparisons that correctly returned True in 3.2 are now also returning False. This patch fixes those comparisons to once again return True.
Moving back to deferred blocked status, since the spurious false negatives are a regression from 3.2. It's fine for beta2 to ship with this problem, but it needs to be fixed for rc1.
Author: Alyssa Coghlan (ncoghlan) *
Date: 2012-08-11 19:09
I can see value in adopting the abstraction approach, but it's not part of this issue. That problem already existed when using memoryview in 3.2, we can wait until 3.4 to fix it.
There's an actual, concrete regression in 3.3 that may break code, and comparing with memcmp() when the format strings match is enough to fix it.
Author: Alyssa Coghlan (ncoghlan) *
Date: 2012-08-11 19:12
And, to be honest, I'd be quite happy for "congratulations, you have reached the threshold where you need NumPy rather than memoryview" to be the long term answer to getting "by value" comparison semantics.
Author: Martin v. Löwis (loewis) *
Date: 2012-08-11 19:34
However, the format constraints are currently too restrictive, so some memoryview comparisons that correctly returned True in 3.2 are now also returning False. This patch fixes those comparisons to once again return True.
No, it doesn't - at least not in all cases. If you compare 'b' and 'B' arrays with values in range(128), then 3.2 rightfully returned True, but will still return False under the patch.
In any case, whatever the solution is (or even "whatever the problem is"), it needs to be clearly spelled out in the documentation, since it will be far from obvious.
Instead, they're being caused by memoryview returning False for any equality comparison for a format it doesn't understand. That's unacceptable, and is what Stefan's patch is intended to fix.
I fail to see why this is unacceptable. If you don't understand the format code, you cannot know how to compare the values - you cannot even know how to compare the memory representation of the values. If the memory compares unequal, it may still be plausible that the values are actually equal (e.g. if this is a numerator-denominator representation of rationals). There are even cases where the memory comparison can compare as equal, yet the values should compare non-equal (e.g. IEEE NaNs, or relative pointers).
I also think this is a really obscure problem: not many people use memoryview objects, very few desire to compare them for equality (most rather use them for efficient IO, and "poking" into structures), and of those, nearly nobody uses them with unsupported format codes - in particular not if "u" and "w" get implemented (since then standard Python will not have ways to create memoryview objects reasonably with unsupported codes). What specific example did you think of that regresses from 3.2?
So I still fail to see why this problem can manage to block the release of 3.3 - but that's for the release manager to decide.
Author: Martin v. Löwis (loewis) *
Date: 2012-08-11 19:45
And, to be honest, I'd be quite happy for "congratulations, you have reached the threshold where you need NumPy rather than memoryview" to be the long term answer to getting "by value" comparison semantics.
IMO, this threshold is already reached when you start comparing memoryview objects. PEP 3118 apparently introduced it to replace the buffer obejct, and specified that it should have getitem and setitem as magic methods; the PEP doesn't talk about comparison at all. So I wonder where the desire to support comparison for equality comes from (but I can accept that we have to deal with it as it is there now). IMO, it would be reasonable to declare to memory buffers as unequal if they denote different memory blocks in main memory (i.e. at different addresses).
Author: Alyssa Coghlan (ncoghlan) *
Date: 2012-08-11 19:52
Hmm, you're right. OK, here's a simpler proposal for 3.3:
Always look at shape first. If they're different, then they're not equal
If both formats are known, compare by unpacked value
If either format is unknown, compare by memory contents (just like 3.2)
The fallback case should then behave exactly like 3.2 (since 3.2 really couldn't handle anything other than 1D data and always ignored the format info).
I'd be happier if the compare-by-value didn't make complete copies of the entire array though.
Author: Antoine Pitrou (pitrou) *
Date: 2012-08-11 19:59
Le samedi 11 août 2012 à 19:52 +0000, Nick Coghlan a écrit :
I'd be happier if the compare-by-value didn't make complete copies of the entire array though.
Ditto. If a and b are bytes objects, comparing memoryview(a) and memoryview(b) should be as cheap as comparing a and b.
Author: Martin v. Löwis (loewis) *
Date: 2012-08-11 20:28
Am 11.08.12 21:59, schrieb Antoine Pitrou:
Le samedi 11 août 2012 à 19:52 +0000, Nick Coghlan a écrit :
I'd be happier if the compare-by-value didn't make complete copies of the entire array though.
Ditto. If a and b are bytes objects, comparing memoryview(a) and memoryview(b) should be as cheap as comparing a and b.
I agree with Antoine's requirement, generalizing it to "the simple cases should be efficient". I wonder why the procedure couldn't instead be
- compare shapes
- if the format strings are string-equal, compare the memory representation
- else unpack values
Then, comparing two 1D 'B' memoryviews would be a memcmp (i.e. it wouldn't do compare-by-value in this case).
For unpacking, I don't see any way to have it efficient and still maintainable, since mixed-type comparisons are quite tedious to write in C, and really best done with type normalization and multiple dispatch in the way that unpacking will do it.
Author: Martin v. Löwis (loewis) *
Date: 2012-08-11 21:46
Here is a more formal definition of my last proposal,
v and w are equal iff v.shape() == w.shape() and ((v.format == w.format and v.tobytes() == w.tobytes()) or v.tolist() == w.tolist()) if tolist raises an exception for unsupported codes, they are not equal
As usual, the implementation can deviate from the spec as long as it behaves "as-if" it would use this exact algorithm (assuming unlimited memory).
Author: Stefan Krah (skrah) *
Date: 2012-08-12 09:37
Martin v. L??wis <report@bugs.python.org> wrote:
Here is a more formal definition of my last proposal,
v and w are equal iff v.shape() == w.shape() and ((v.format == w.format and v.tobytes() == w.tobytes()) or v.tolist() == w.tolist()) if tolist raises an exception for unsupported codes, they are not equal
This wouldn't solve the NaN problem though:
nd = ndarray([(1,2)], shape=[1], format='dd', flags=ND_WRITABLE) nd[0] = (float('nan'), float('nan')) m = memoryview(nd) m == nd True
What's the policy on importing modules in the Object/* hierarchy? Can we import _struct?
The current unpack_cmp() function for identical single native format codes is very fast. The only thing that would be needed to make it general is to replace
default: return -1;
with
default: return unpack_cmp_struct();
The code is already there in _testbuffer.c. _testbuffer.c itself has 100% code coverage (including all error conditions with a special patch).
This means that all tests are already present in test_buffer.py.
Author: Stefan Krah (skrah) *
Date: 2012-08-12 09:47
To be specific, after a quick look the only function from _testbuffer.c that would be needed is unpack_single(). This really does not look very complex.
Author: Alyssa Coghlan (ncoghlan) *
Date: 2012-08-12 10:43
We do import modules in the Object/* code if we really need to (Grep for "PyImport_Import" in the Objects dir)
It seems to me like this case is rapidly reaching (or has even gone well past) the point where that's the sensible thing to do.
A couple of key issues of concern are:
- the number of modules implicitly imported at startup
- the risk of attempting to import a module before the import system (and the interpreter in general) is fully initialised
A simple trick for alleviating both concerns is to call PyImport_Import directly in the code path that needs it. It will be pretty quick if it gets a hit in the sys.modules cache, and most of the builtin objects aren't used until well after the interpreter is fully initialised (and, in the case of memoryview equality comparisons, often won't be used at all).
The main alternative is to invert the dependency: move the desired functionality into a private API function and have the standard library extension module access it that way. This can end up being harder to maintain, though, and certainly isn't appropriate in this case (since "_struct.unpack_from" needs the full struct machinery to do the job).
Author: Stefan Krah (skrah) *
Date: 2012-08-12 10:59
Thanks, Nick. I'll work on a struct module patch then. At this point for me this is the only satisfying solution.
Author: Stefan Krah (skrah) *
Date: 2012-08-14 10:07
Here is a patch implementing by-value comparisons for all format strings understood by the struct module. It is slightly longer than promised, since for larger arrays it is necessary to cache an unpacking object for acceptable performance. The fast path for identical single element native format strings is unchanged.
The new comparison rules are stated in the memoryview docs.
For Georg's benefit, here are the memoryobject.c changes and the reasons why I think the patch can go into 3.3:
o cmp_structure() is split into cmp_format() and cmp_shape(), with unchanged semantics.
o The new section "unpack using the struct module" is largely identical to existing parts of _testbuffer.c:
- struct_get_unpacker() ==> see _testbuffer.c:ndarray_as_list()
- struct_unpack_single() ==> see base case in _testbuffer.c:unpack_rec()
o The new code is only called in the previous default case of unpack_cmp().
o The new code has 100% coverage.
Performance:
Identical format, bytes:
$ ./python -m timeit -n 1000 -s "import array; x = array.array('B', [1]*10000); y = array.array('B', [1]*10000);" "x == y" 1000 loops, best of 3: 116 usec per loop
$ ./python -m timeit -n 1000 -s "import array; x = array.array('B', [1]*10000); y = array.array('B', [1]*10000); a = memoryview(x); b = memoryview(y)" "a == b" 1000 loops, best of 3: 49.1 usec per loop
Identical format, double:
$ ./python -m timeit -n 1000 -s "import array; x = array.array('d', [1.0]*10000); y = array.array('d', [1.0]*10000);" "x == y" 1000 loops, best of 3: 319 usec per loop
$ ./python -m timeit -n 1000 -s "import array; x = array.array('d', [1.0]*10000); y = array.array('d', [1.0]*10000); a = memoryview(x); b = memoryview(y)" "a == b" 1000 loops, best of 3: 65.7 usec per loop
Different format ('B', 'b'):
$ ./python -m timeit -n 100 -s "import array; x = array.array('B', [1]*10000); y = array.array('b', [1]*10000);" "x == y" 100 loops, best of 3: 131 usec per loop
$ ./python -m timeit -n 1000 -s "import array; x = array.array('B', [1]*10000); y = array.array('b', [1]*10000); a = memoryview(x); b = memoryview(y)" "a == b" 1000 loops, best of 3: 3.42 msec per loop
Different format ('d', 'f'):
$ ./python -m timeit -n 1000 -s "import array; x = array.array('d', [1.0]*10000); y = array.array('f', [1.0]*10000);" "x == y" 1000 loops, best of 3: 315 usec per loop
$ ./python -m timeit -n 1000 -s "import array; x = array.array('d', [1.0]*10000); y = array.array('f', [1.0]*10000); a = memoryview(x); b = memoryview(y)" "a == b" 1000 loops, best of 3: 3.59 msec per loop
Author: Stefan Krah (skrah) *
Date: 2012-08-14 14:38
I didn't get my own comments as mail, so this is just a notification that I've answered Nick's comments.
Author: Alyssa Coghlan (ncoghlan) *
Date: 2012-08-15 03:41
Yeah, as soon as I got your email I realised I'd only been searching for usage in the patch rather than the full file. Oops :(
More comments in the review.
Author: Stefan Krah (skrah) *
Date: 2012-08-15 14:37
Here is a new patch that hopefully addresses all comments.
Author: Stefan Krah (skrah) *
Date: 2012-08-17 11:48
There is still one corner case involving NaNs: Released memoryviews always compare equal. I took that over from the 3.2 implementation.
import array a = array.array('d', [float('nan')]) m = memoryview(a) m == m False m.release() m == m True
I guess we have to live with that, since it is of course impossible to access the values of a released view.
Author: Stefan Krah (skrah) *
Date: 2012-08-19 11:30
I think -struct-2.diff is ready to go and I'd rather commit sooner than later. Nick, can I interpret your last review comment as "go ahead"? :)
Author: Georg Brandl (georg.brandl) *
Date: 2012-08-19 11:36
Small nit: when you put "versionchanged:: 3.3" there should be a hint of what changed exactly :)
Author: Stefan Krah (skrah) *
Date: 2012-08-19 11:56
Right. I'm tracking all "versionchanged" issues in #15724.
Author: Martin v. Löwis (loewis) *
Date: 2012-08-24 14:22
Is this still blocking the release? If so, it should be resolved within the next twelve hours, or else it may block the release until September.
Author: Georg Brandl (georg.brandl) *
Date: 2012-08-24 21:05
I'm not very keen to hold up the release for long times again, especially for a patch of this size and lots of potential breakage.
Author: STINNER Victor (vstinner) *
Date: 2012-08-25 07:09
Even if I would prefer to see a fully working new implementation of the buffer interface, I agree with Georg: it's too late for such huge change in Python 3.3. Can' we wait Python 3.4 to change the equality operator?
It also took me three major releases to fix all Unicode issues (and it's probably not finished :-)). Le 24 août 2012 23:05, "Georg Brandl" <report@bugs.python.org> a écrit :
Georg Brandl added the comment:
I'm not very keen to hold up the release for long times again, especially for a patch of this size and lots of potential breakage.
Python tracker <report@bugs.python.org> <http://bugs.python.org/issue15573>
Author: Stefan Krah (skrah) *
Date: 2012-08-25 07:28
The effect of the change is pretty minimal though: Previously "not equal" was returned for unknown formats, now the struct module figures it out.
memoryobject.c has 100% coverage except for a couple of lines that are either impossible to reach or very hard to reach.
The new code is among the most heavily exercised lines, since in test_buffer.py the verify() function always asserts equality.
You can look at the attached memoryobject.c.gcov: All failure paths are taken since Python API functions are wrapped in macros that set PyExc_FailAPIError and return failure at predefined points the test suite.
Author: Alyssa Coghlan (ncoghlan) *
Date: 2012-08-25 07:39
With Stefan's major improvements to the test suite for 3.3, the risk is low and I really don't want to spend the life of the 3.3 series excusing the current comparison behaviour.
By contrast, explaining the shift in definition as moving from raw memory comparisons to by value comparisons is relatively easy.
Author: Georg Brandl (georg.brandl) *
Date: 2012-08-25 07:41
Well, I'm not against you committing it as long as you commit it right now. I'm about to cut the tag.
Author: Stefan Krah (skrah) *
Date: 2012-08-25 07:44
You can look at the attached memoryobject.c.gcov: All failure paths are taken since Python API functions are wrapped in macros that set PyExc_FailAPIError and return failure at predefined points the test suite.
That should read: With a patch that wraps Python API functions ...
Obviously the macros aren't in by default.
Author: Stefan Krah (skrah) *
Date: 2012-08-25 07:45
Great, give me 20 minutes or so.
Author: Alyssa Coghlan (ncoghlan) *
Date: 2012-08-25 07:47
I'm currently on IRC and double checking the patch locally.
Author: Alyssa Coghlan (ncoghlan) *
Date: 2012-08-25 07:49
Stefan, was there something specific you wanted to do before committing? Otherwise I can commit it as soon as this full test run finishes.
Author: Stefan Krah (skrah) *
Date: 2012-08-25 07:56
Except for Misc/NEWS the patch can be committed unchanged. Please go ahead!
[The remaining doc changes are tracked elsewhere.]
Author: Roundup Robot (python-dev)
Date: 2012-08-25 08:00
New changeset afa3dedfee18 by Nick Coghlan in branch 'default': Close #15573: use value-based memoryview comparisons (patch by Stefan Krah) http://hg.python.org/cpython/rev/afa3dedfee18
Author: Stefan Krah (skrah) *
Date: 2012-08-29 07:36
We overlooked one thing. Since hashing is defined in terms of tobytes(), the equality-hash invariant is now broken:
from _testbuffer import ndarray x = ndarray([1,2,3], shape=[3], format='f') y = ndarray([1,2,3], shape=[3], format='B') a = memoryview(x) b = memoryview(y) a == b True hash(a) == hash(b) False
This is one problem with the new equality definition. It puts "memoryview" firmly into array territory. I'm not saying that's wrong, I even think it was the intention of the PEP authors to have a zero copy "arrayview".
Both array.array and numpy.array sidestep the issue by not being hashable.
I don't really see a way around all this except doing slow element-wise hashing.
Author: Alyssa Coghlan (ncoghlan) *
Date: 2012-08-29 09:51
Perhaps the hash could just be based on a subset of the items? For example, hash the format, shape and the first and last items?
Yes, such an approach means you're more likely to get collisions if you do start hashing these, but that seems better than making hashing itself excessively slow.
Author: Stefan Krah (skrah) *
Date: 2012-08-29 15:23
I'm trying to think of an optimization for the native types. It should be possible to cast any native type element to unsigned char and use the truncated value for the bytes hash.
Well, for double probably it's required to go from double -> int64_t -> unsigned char, otherwise the first cast is technically undefined for negative values, though it works with gcc.
For non-native types and compound types, Nick's suggestion of hashing the shape and a couple of values seems to be the best solution.
Should we do anything about this before 3.3.0?
Author: Martin v. Löwis (loewis) *
Date: 2012-08-29 16:39
I STRONGLY request to take hashing out of this issue. The only further action that this issue can have is complete reversal of the patch, starting over. Now that the issue has been resolved, a NEW issue arises, namely that hashing is inconsistent with equality. Please re-read the original issue: the objective of supporting unknown formats in memoryview comparisons is achieved. Remember that by reopening the issue, you are blocking the release (since it's still a release blocker).
Any significant change to the 3.3.0 code base should, IMO, require an additional release candidate (since the release candidate really ought to be what is going to be released, hence the name, so NO changes between candidate and release). Do you really want to delay the release of Python 3.3 until October because of hashing of memoryview objects (which, AFAICT, only affects bytes objects in the standard library)?
I don't think memoryview objects should be hashable at all. No matter what definition of hashing you chose, you likely won't manage to get it consistent with ==. Since a memoryview compares equal with its underlying object, it should also hash equal.
(unrelated to hashing, related to the original issue) I think it will be possible to create non-transitive equalities, where A==memoryview(A)==memoryview(B)==B, yet not (A==B)
I'm not sure whether the hash of a memoryview will surely be constant (as it should): isn't it possible to create a read-only buffer for a mutable memory block (not PyObject), so that any hashing considering the contents will result in changing hashes? Or, since the hash is cached, couldn't that cause the hash to deviate from ==?
Author: Stefan Krah (skrah) *
Date: 2012-08-29 17:09
I agree that this isn't a blocker due to the limited use of memoryview hashing. Tracking this in #15814.
Author: Roundup Robot (python-dev)
Date: 2012-11-02 17:07
New changeset 969069d464bc by Stefan Krah in branch '3.3': Issue #15814: Use hash function that is compatible with the equality http://hg.python.org/cpython/rev/969069d464bc