PERF: DataFrame.duplicated with subset= for 1 column is slower than Series.duplicated · Issue #45236 · pandas-dev/pandas (original) (raw)
Pandas version checks
- I have checked that this issue has not already been reported.
- I have confirmed this issue exists on the latest version of pandas.
- I have confirmed this issue exists on the master branch of pandas.
Reproducible Example
Set up:
import pandas as pd
df = pd.DataFrame({ 'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'], 'style': ['cup', 'cup', 'cup', 'pack', 'pack'], 'rating': [4, 4, 3.5, 15, 5] })
The following are multiple ways to check for duplicates of the 'brand'
column.
- Using
subset=['brand']
averages 176 µs:
%timeit df.duplicated(subset=['brand'])
176 µs ± 2.61 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
- Using
[]
indexing averages 31.2 µs:
%timeit df['brand'].duplicated()
31.2 µs ± 213 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
- Using
.loc[:,'brand']
indexing averages 40.6 µs:
%timeit df.loc[:,'brand'].duplicated()
40.6 µs ± 141 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
(1) is the exact example used in the doc for pandas.DataFrame.duplicated, but it is the slowest.
Installed Versions
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
<ipython-input-2-3d232a07e144> in <module>
----> 1 pd.show_versions()
/opt/homebrew/Caskroom/miniconda/base/envs/pandas/lib/python3.10/site-packages/pandas/util/_print_versions.py in show_versions(as_json)
107 """
108 sys_info = _get_sys_info()
--> 109 deps = _get_dependency_info()
110
111 if as_json:
/opt/homebrew/Caskroom/miniconda/base/envs/pandas/lib/python3.10/site-packages/pandas/util/_print_versions.py in _get_dependency_info()
86 result: dict[str, JSONSerializable] = {}
87 for modname in deps:
---> 88 mod = import_optional_dependency(modname, errors="ignore")
89 result[modname] = get_version(mod) if mod else None
90 return result
/opt/homebrew/Caskroom/miniconda/base/envs/pandas/lib/python3.10/site-packages/pandas/compat/_optional.py in import_optional_dependency(name, extra, errors, min_version)
113 )
114 try:
--> 115 module = importlib.import_module(name)
116 except ImportError:
117 if errors == "raise":
/opt/homebrew/Caskroom/miniconda/base/envs/pandas/lib/python3.10/importlib/__init__.py in import_module(name, package)
124 break
125 level += 1
--> 126 return _bootstrap._gcd_import(name[level:], package, level)
127
128
/opt/homebrew/Caskroom/miniconda/base/envs/pandas/lib/python3.10/importlib/_bootstrap.py in _gcd_import(name, package, level)
/opt/homebrew/Caskroom/miniconda/base/envs/pandas/lib/python3.10/importlib/_bootstrap.py in _find_and_load(name, import_)
/opt/homebrew/Caskroom/miniconda/base/envs/pandas/lib/python3.10/importlib/_bootstrap.py in _find_and_load_unlocked(name, import_)
/opt/homebrew/Caskroom/miniconda/base/envs/pandas/lib/python3.10/importlib/_bootstrap.py in _load_unlocked(spec)
/opt/homebrew/Caskroom/miniconda/base/envs/pandas/lib/python3.10/importlib/_bootstrap_external.py in exec_module(self, module)
/opt/homebrew/Caskroom/miniconda/base/envs/pandas/lib/python3.10/importlib/_bootstrap.py in _call_with_frames_removed(f, *args, **kwds)
/opt/homebrew/Caskroom/miniconda/base/envs/pandas/lib/python3.10/site-packages/setuptools/__init__.py in <module>
6 import re
7
----> 8 import _distutils_hack.override # noqa: F401
9
10 import distutils.core
/opt/homebrew/Caskroom/miniconda/base/envs/pandas/lib/python3.10/site-packages/_distutils_hack/override.py in <module>
----> 1 __import__('_distutils_hack').do_override()
/opt/homebrew/Caskroom/miniconda/base/envs/pandas/lib/python3.10/site-packages/_distutils_hack/__init__.py in do_override()
71 if enabled():
72 warn_distutils_present()
---> 73 ensure_local_distutils()
74
75
/opt/homebrew/Caskroom/miniconda/base/envs/pandas/lib/python3.10/site-packages/_distutils_hack/__init__.py in ensure_local_distutils()
59 # check that submodules load as expected
60 core = importlib.import_module('distutils.core')
---> 61 assert '_distutils' in core.__file__, core.__file__
62
63
AssertionError: /opt/homebrew/Caskroom/miniconda/base/envs/pandas/lib/python3.10/distutils/core.py
I don't think this is expected 😳 this is how the environment was set up:
mamba create -n pandas pandas ipython -y
If it's any better, output of conda list
:
# packages in environment at /opt/homebrew/Caskroom/miniconda/base/envs/pandas:
#
# Name Version Build Channel
appnope 0.1.2 py310h2ec42d9_2 conda-forge
backcall 0.2.0 pyh9f0ad1d_0 conda-forge
backports 1.0 py_2 conda-forge
backports.functools_lru_cache 1.6.4 pyhd8ed1ab_0 conda-forge
bzip2 1.0.8 h0d85af4_4 conda-forge
ca-certificates 2021.10.8 h033912b_0 conda-forge
decorator 5.1.0 pyhd8ed1ab_0 conda-forge
ipython 7.31.0 py310h2ec42d9_0 conda-forge
jedi 0.18.1 py310h2ec42d9_0 conda-forge
libblas 3.9.0 12_osx64_openblas conda-forge
libcblas 3.9.0 12_osx64_openblas conda-forge
libcxx 12.0.1 habf9029_1 conda-forge
libffi 3.4.2 h0d85af4_5 conda-forge
libgfortran 5.0.0 9_3_0_h6c81a4c_23 conda-forge
libgfortran5 9.3.0 h6c81a4c_23 conda-forge
liblapack 3.9.0 12_osx64_openblas conda-forge
libopenblas 0.3.18 openmp_h3351f45_0 conda-forge
libzlib 1.2.11 h9173be1_1013 conda-forge
llvm-openmp 12.0.1 hda6cdc1_1 conda-forge
matplotlib-inline 0.1.3 pyhd8ed1ab_0 conda-forge
ncurses 6.2 h2e338ed_4 conda-forge
numpy 1.22.0 py310hfbbbacf_0 conda-forge
openssl 3.0.0 h0d85af4_2 conda-forge
pandas 1.3.5 py310hdd25497_0 conda-forge
parso 0.8.3 pyhd8ed1ab_0 conda-forge
pexpect 4.8.0 pyh9f0ad1d_2 conda-forge
pickleshare 0.7.5 py_1003 conda-forge
pip 21.3.1 pyhd8ed1ab_0 conda-forge
prompt-toolkit 3.0.24 pyha770c72_0 conda-forge
ptyprocess 0.7.0 pyhd3deb0d_0 conda-forge
pygments 2.11.1 pyhd8ed1ab_0 conda-forge
python 3.10.1 h38b4d05_2_cpython conda-forge
python-dateutil 2.8.2 pyhd8ed1ab_0 conda-forge
python_abi 3.10 2_cp310 conda-forge
pytz 2021.3 pyhd8ed1ab_0 conda-forge
readline 8.1 h05e3726_0 conda-forge
setuptools 60.2.0 py310h2ec42d9_0 conda-forge
six 1.16.0 pyh6c4a22f_0 conda-forge
sqlite 3.37.0 h23a322b_0 conda-forge
tk 8.6.11 h5dbffcc_1 conda-forge
traitlets 5.1.1 pyhd8ed1ab_0 conda-forge
tzdata 2021e he74cb21_0 conda-forge
wcwidth 0.2.5 pyh9f0ad1d_2 conda-forge
wheel 0.37.1 pyhd8ed1ab_0 conda-forge
xz 5.2.5 haf1e3a3_1 conda-forge
zlib 1.2.11 h9173be1_1013 conda-forge
Prior Performance
No response