[Python-Dev] cmp(x,x) (original) (raw)

Armin Rigo arigo at tunes.org
Tue May 18 04:57:33 EDT 2004


Hello,

A minor semantic change that creeped in some time ago was an implicit assumption that any object x should "reasonably" be expected to compare equal to itself. The arguments are summarized below (should this be documented, inserted in NEWS, turned in a mini-PEP a posteriori ... ?).

The point is that comparisons now behave differently if they are issued by C or Python code. The expression 'x == y' will always call x.eq(y), because in some settings (e.g. Numeric) the result is not just True or False but an arbitrary object (e.g. a Numeric array of zeroes and ones). You cannot just say that 'x == x' should return True, because of that. So the behavior of PyObject_RichCompare() didn't change.

On the other hand, when C code does a comparison it uses PyObject_RichCompareBool(), meaning it is only interested in a 1 or 0 answer. So PyObject_RichCompareBool() is where the shortcut about comparing an object with itself has been inserted.

The result is the following: say x has a method cmp() that always returns -1.

x == x # x.cmp() is called False x < x # x.cmp() is called True cmp(x,x) # x.cmp() is NOT called 0 [x] == [x] # x.cmp() is NOT called True

The only way to explain the semantic is that the expression 'x == x' always call the special methods, but built-in functions are allowed to assume that any object compares equal to itself. In other words, C code usually checks that objects are "identical or equal", which is equalivalent to the Python expression 'x is y or x == y'. For example, the equality of lists now works as follows:

def eq(lst1, lst2): if len(lst1) != len(lst2): return False for x, y in zip(lst1, lst2): if not (x is y or x == y): return False else: return True

Should any of this be documented?

An alternative behavior would have been to leave PyObject_RichCompareBool() alone and only insert the short-cut on specific object types' comparison methods on a case-by-case basis. For example, identical lists would just compare equal, without going through the loop comparing each element. This would remove the surprize of x.cmp(x) being not always called. The semantics would be easier to explain too: two lists are equal if they are the same list or if elements compare pairwise equal. We would have (with x as above):

cmp(x,x) # x.cmp(x) is called -1 [x] == [x] False # because the lists contain a non-equal element lst = [x]; lst == lst True # because it is the same list

Finally, whatever the final semantic is, we should make sure that existing built-in objects behave in a consistent way. Of course I'm thinking about floats: on my Linux box,

f = float('nan') cmp(f,f) 0 # because f is f f == f False # because float.eq() is called

Note that as discussed below the following behavior is expected and in accordance with standards:

float('nan') is float('nan') False float('nan') == float('nan') False # not the same object

Unless there are serious objections I suggest to (i.e. I plan to) remove the short-cut in PyObject_RichCompareBool() -- performance is probably not an issue here -- and then review all built-in comparison methods and make sure that they return "equal" for identical objects.

-+- summary of arguments that lead to the original change (in cvs head).

The argument in favor of the change is to remove the complex and not useful code trying to compare self-recursive structures: for example, in some setting comparing builtin.dict with itself would trigger recursive comparison of builtin.dict with itself in an endless loop. The complex algorithm was able to spot that. The new semantics immediately assume builtin.dict to be equal to itself. Removing the complex algorithm means that you will now get an endless loop when comparing two non-identical self-recursive structures with the same shape, which is most probably not a problem in practice (on the contrary this implicit algorithm did hide a bug in one of my program).

The argument against used to be that e.g. it makes sense that the float value 'nan' should be different from itself, as various standards require.
This argument does not apply in Python: these standards are about comparing values, not objects, so it makes perfect sense to say that even if x is the result of a computation that yielded an unknown answer 'nan', this answer is still equal to itself; what it is probably not equal to is another 'nan' which was obtained differently. In other words two float objects both containing 'nan' should be different, but one 'nan' object is still equal to itself. This is sane as long as no code considers 'nan' as a singleton, or tries to reuse 'nan' objects for different 'nan' values.

-+-

Armin



More information about the Python-Dev mailing list