ENH: Reduce type requirements for the subset parameter in drop_duplicates/duplicated · Issue #59237 · pandas-dev/pandas (original) (raw)
Feature Type
- Adding new functionality to pandas
- Changing existing functionality in pandas
- Removing existing functionality in pandas
Problem Description
I would like to pass a set to drop_duplicates like so:
df = get_some_df()
subset = {"column1", "column3"}
df_dropped = df.drop_duplicates(subset=subset)
According to type hints, subset
is a Hashable | Sequence[Hashable] | None
. The documentation says "column label or sequence of labels, optional".
The problem is, I would like to pass a set of columns. The name subset
suggests that should be ok. And it does work indeed. However, a set
is not a Sequence
.
Can the requirements be lowered? Maybe to Collection
? (or even Iterable
, but that might come with problems).
Looking at the code: https://github.com/pandas-dev/pandas/blob/v2.2.2/pandas/core/frame.py#L6935
It checks if the subset is np.iterable
, then a set
is created and len(subset)
is called. This could be done with collections as well.
Am I missing anything or could this Sequence be make a Collection?
Feature Description
If I am not mistaken, Sequence could simply be replaced with Collection in duplicated as well as drop_duplicates + in the docu.
Alternative Solutions
no alt solution required
Additional Context
No response