ENH: Reduce type requirements for the subset parameter in drop_duplicates/duplicated · Issue #59237 · pandas-dev/pandas (original) (raw)

Feature Type

Problem Description

I would like to pass a set to drop_duplicates like so:

df = get_some_df()
subset = {"column1", "column3"}
df_dropped = df.drop_duplicates(subset=subset)

According to type hints, subset is a Hashable | Sequence[Hashable] | None. The documentation says "column label or sequence of labels, optional".

The problem is, I would like to pass a set of columns. The name subset suggests that should be ok. And it does work indeed. However, a set is not a Sequence.

Can the requirements be lowered? Maybe to Collection? (or even Iterable, but that might come with problems).

Looking at the code: https://github.com/pandas-dev/pandas/blob/v2.2.2/pandas/core/frame.py#L6935

It checks if the subset is np.iterable, then a set is created and len(subset) is called. This could be done with collections as well.

Am I missing anything or could this Sequence be make a Collection?

Feature Description

If I am not mistaken, Sequence could simply be replaced with Collection in duplicated as well as drop_duplicates + in the docu.

Alternative Solutions

no alt solution required

Additional Context

No response