pandas (original) (raw)

Doing some timings to explore the cost of checking for equality (when passing a Series with a name that matches a column name), and taking the extreme case of a single column to aggregate (so the cost of the check is relatively larger compared to the actual grouped reduction operation).

For an integer group key, the cost seems to be relatively small (2-3% of the time that the groupby takes):

df = pd.DataFrame({"key": list(range(10))*10000, "col": np.random.randn(100000)})
s1 = df["key"].copy()
s2 = df["key"].copy()

In [38]: %timeit df.groupby("key").mean()
1.38 ms ± 29.5 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [41]: %timeit s1.equals(s2)
42.4 µs ± 405 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

But for a string group values, the equality is much more expensive (almost as much time as the groupby):

df = pd.DataFrame({"key": list("abcdefghij")*10000, "col": np.random.randn(100000)})
s1 = df["key"].copy()
s2 = df["key"].copy()

In [46]: %timeit df.groupby("key").mean()
3.29 ms ± 216 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [47]: %timeit s1.equals(s2)
2.94 ms ± 16.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

The time taken by equals here is almost entirely due to lib.array_equivalent_object. This could be optimized little bit for just strings (if you would use StringDtype), but only ~30% based on a quick test.

I can't directly think of a better way of checking or keeping track that two Series objects are identical.
We could actually use the current machinery that keeps track of the parent dataframe (Series._cacher, which is currently being used to update the DataFrame's item_cache if we update the Series), and then check if two Series objects with the same name have the same parent dataframe in their cache. But it would be nice if we could actually get rid of this code with CoW, and I am not sure it is worth to keep it for this performance issue in a corner case in groupby.

On the short term, since this additional equality check is 1) only done when CoW is enabled, and 2) only done in case you actually provide a Series with a name that is present in the dataframe, and 3) will only be costly if all values are actually the same (so only if you are actually passing a column that is present in the dataframe, and not a derived column), I think it is fine to use this equality check.