BUG: BlockManager.references_same_values not working with object-dtype block that has reference to an Index · Issue #57276 · pandas-dev/pandas (original) (raw)

On main, the following snippet fails:

df = pd.DataFrame({'A': ['a', 'b', 'c', 'd'], 'B': [1, 2, 3, 4], 'C': ['a', 'a', 'b', 'b']}) df = df.set_index("A") key = df["C"] df.groupby(key, observed=True)

with

File ~/scipy/pandas/pandas/core/internals/managers.py:342, in BaseBlockManager.references_same_values(self, mgr, blkno)
    337 """
    338 Checks if two blocks from two different block managers reference the
    339 same underlying values.
    340 """
    341 ref = weakref.ref(self.blocks[blkno])
--> 342 return ref in mgr.blocks[blkno].refs.referenced_blocks

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

I don't yet fully understand why the list containment check results in that kind of error (I would expect it just does == checks on the individual weakref objects and then checks if any of them is True)

But it seems to be triggered by at least 1) the index being constructed from a column (if the df is created with an index from scratch (passing index=pd.Index(['a', 'b', 'c', 'd'])), it doesn't trigger this), and 2) the index being object dtype (when trying with an integer column set as index, it also doesn't happen).

So following two examples pass:

df = pd.DataFrame({'B': [1, 2, 3, 4], 'C': ['a', 'a', 'b', 'b']}, index=pd.Index(['a', 'b', 'c', 'd'], name="A")) key = df["C"] df.groupby(key, observed=True)

df = pd.DataFrame({'A': [1, 2, 3, 4], 'B': [1, 2, 3, 4], 'C': ['a', 'a', 'b', 'b']}) df = df.set_index("A") key = df["C"] df.groupby(key, observed=True)

(this is also all without enabling CoW, we pass by that code path regardless)

cc @phofl I suppose this is originally caused by #51442, but we never noticed it. The same snippet also raises another error on 2.2.0 (ValueError: buffer source array is read-only in comp_method_OBJECT_ARRAY)