ENH: Unhelpful output from assert_frame_equal when indexes differ and check_like=True (original) (raw)

Problem:

Calling testing.assert_frame_equal with mismatched indexes and check_like=True generates unhelpful output.

If you run:

import pandas as pd df1 = pd.DataFrame({"A": [1.0, 2.0, 3.0], "B": [4.0, 5.0, 6.0]}, index=["a", "b", "c"]) df2 = pd.DataFrame({"A": [1.0, 2.0, 3.0], "B": [4.0, 5.0, 6.0]}, index=["a", "b", "d"]) pd.testing.assert_frame_equal(df1, df2, check_like=True)

The output will be:

AssertionError: DataFrame.iloc[:, 0] (column name="A") are different

DataFrame.iloc[:, 0] (column name="A") values are different (33.33333 %)
[index]: [a, b, d]
[left]:  [1.0, 2.0, nan]
[right]: [1.0, 2.0, 3.0]

The data of the input DataFrames are not actually different (there is no nan), but when check_like=True the code calls left.reindex_like(right) before comparing indexes (and columns), in order to ensure that both frames are ordered the same.
However, if the indexes contain different values (rather than the same values in a different order),
the reindex_like function fills the data values (row or column) for the mismatched index entries with NaNs.
This results in the subsequent index checks passing, but the assert_frame_equals function failing
with a data not equal error (as above).

Even more confusingly, if the values being compared are not floats then you get a dtype not equal error:

AssertionError: Attributes of DataFrame.iloc[:, 0] (column name="A") are different

Attribute "dtype" are different
[left]:  float64
[right]: int64

These messages are quite unhelpful, as the mismatch is in the index, and the error should logically be the same as you would get if you ran with check_like=False.

Applies to:

The code above was run against the latest code from master.

print(pd.version) 1.2.0.dev0+950.gd321be6

Solution:

The message for the above assertion failure should be something like:

AssertionError: DataFrame.index are different

DataFrame.index values are different (33.33333 %)
[left]:  Index(['a', 'b', 'c'], dtype='object')
[right]: Index(['a', 'b', 'd'], dtype='object')

Which is what you get if you run with check_like=False.