PERF: DataFrame.duplicated with subset= for 1 column is slower than Series.duplicated · Issue #45236 · pandas-dev/pandas (original) (raw)

Pandas version checks

Reproducible Example

Set up:

import pandas as pd

df = pd.DataFrame({ 'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'], 'style': ['cup', 'cup', 'cup', 'pack', 'pack'], 'rating': [4, 4, 3.5, 15, 5] })

The following are multiple ways to check for duplicates of the 'brand' column.

  1. Using subset=['brand'] averages 176 µs:
    %timeit df.duplicated(subset=['brand'])

176 µs ± 2.61 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

  1. Using [] indexing averages 31.2 µs:
    %timeit df['brand'].duplicated()

31.2 µs ± 213 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

  1. Using .loc[:,'brand'] indexing averages 40.6 µs:
    %timeit df.loc[:,'brand'].duplicated()

40.6 µs ± 141 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

(1) is the exact example used in the doc for pandas.DataFrame.duplicated, but it is the slowest.

Installed Versions

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-2-3d232a07e144> in <module>
----> 1 pd.show_versions()

/opt/homebrew/Caskroom/miniconda/base/envs/pandas/lib/python3.10/site-packages/pandas/util/_print_versions.py in show_versions(as_json)
    107     """
    108     sys_info = _get_sys_info()
--> 109     deps = _get_dependency_info()
    110 
    111     if as_json:

/opt/homebrew/Caskroom/miniconda/base/envs/pandas/lib/python3.10/site-packages/pandas/util/_print_versions.py in _get_dependency_info()
     86     result: dict[str, JSONSerializable] = {}
     87     for modname in deps:
---> 88         mod = import_optional_dependency(modname, errors="ignore")
     89         result[modname] = get_version(mod) if mod else None
     90     return result

/opt/homebrew/Caskroom/miniconda/base/envs/pandas/lib/python3.10/site-packages/pandas/compat/_optional.py in import_optional_dependency(name, extra, errors, min_version)
    113     )
    114     try:
--> 115         module = importlib.import_module(name)
    116     except ImportError:
    117         if errors == "raise":

/opt/homebrew/Caskroom/miniconda/base/envs/pandas/lib/python3.10/importlib/__init__.py in import_module(name, package)
    124                 break
    125             level += 1
--> 126     return _bootstrap._gcd_import(name[level:], package, level)
    127 
    128 

/opt/homebrew/Caskroom/miniconda/base/envs/pandas/lib/python3.10/importlib/_bootstrap.py in _gcd_import(name, package, level)

/opt/homebrew/Caskroom/miniconda/base/envs/pandas/lib/python3.10/importlib/_bootstrap.py in _find_and_load(name, import_)

/opt/homebrew/Caskroom/miniconda/base/envs/pandas/lib/python3.10/importlib/_bootstrap.py in _find_and_load_unlocked(name, import_)

/opt/homebrew/Caskroom/miniconda/base/envs/pandas/lib/python3.10/importlib/_bootstrap.py in _load_unlocked(spec)

/opt/homebrew/Caskroom/miniconda/base/envs/pandas/lib/python3.10/importlib/_bootstrap_external.py in exec_module(self, module)

/opt/homebrew/Caskroom/miniconda/base/envs/pandas/lib/python3.10/importlib/_bootstrap.py in _call_with_frames_removed(f, *args, **kwds)

/opt/homebrew/Caskroom/miniconda/base/envs/pandas/lib/python3.10/site-packages/setuptools/__init__.py in <module>
      6 import re
      7 
----> 8 import _distutils_hack.override  # noqa: F401
      9 
     10 import distutils.core

/opt/homebrew/Caskroom/miniconda/base/envs/pandas/lib/python3.10/site-packages/_distutils_hack/override.py in <module>
----> 1 __import__('_distutils_hack').do_override()

/opt/homebrew/Caskroom/miniconda/base/envs/pandas/lib/python3.10/site-packages/_distutils_hack/__init__.py in do_override()
     71     if enabled():
     72         warn_distutils_present()
---> 73         ensure_local_distutils()
     74 
     75 

/opt/homebrew/Caskroom/miniconda/base/envs/pandas/lib/python3.10/site-packages/_distutils_hack/__init__.py in ensure_local_distutils()
     59     # check that submodules load as expected
     60     core = importlib.import_module('distutils.core')
---> 61     assert '_distutils' in core.__file__, core.__file__
     62 
     63 

AssertionError: /opt/homebrew/Caskroom/miniconda/base/envs/pandas/lib/python3.10/distutils/core.py

I don't think this is expected 😳 this is how the environment was set up:

mamba create -n pandas pandas ipython -y

If it's any better, output of conda list:

# packages in environment at /opt/homebrew/Caskroom/miniconda/base/envs/pandas:
#
# Name                    Version                   Build  Channel
appnope                   0.1.2           py310h2ec42d9_2    conda-forge
backcall                  0.2.0              pyh9f0ad1d_0    conda-forge
backports                 1.0                        py_2    conda-forge
backports.functools_lru_cache 1.6.4              pyhd8ed1ab_0    conda-forge
bzip2                     1.0.8                h0d85af4_4    conda-forge
ca-certificates           2021.10.8            h033912b_0    conda-forge
decorator                 5.1.0              pyhd8ed1ab_0    conda-forge
ipython                   7.31.0          py310h2ec42d9_0    conda-forge
jedi                      0.18.1          py310h2ec42d9_0    conda-forge
libblas                   3.9.0           12_osx64_openblas    conda-forge
libcblas                  3.9.0           12_osx64_openblas    conda-forge
libcxx                    12.0.1               habf9029_1    conda-forge
libffi                    3.4.2                h0d85af4_5    conda-forge
libgfortran               5.0.0           9_3_0_h6c81a4c_23    conda-forge
libgfortran5              9.3.0               h6c81a4c_23    conda-forge
liblapack                 3.9.0           12_osx64_openblas    conda-forge
libopenblas               0.3.18          openmp_h3351f45_0    conda-forge
libzlib                   1.2.11            h9173be1_1013    conda-forge
llvm-openmp               12.0.1               hda6cdc1_1    conda-forge
matplotlib-inline         0.1.3              pyhd8ed1ab_0    conda-forge
ncurses                   6.2                  h2e338ed_4    conda-forge
numpy                     1.22.0          py310hfbbbacf_0    conda-forge
openssl                   3.0.0                h0d85af4_2    conda-forge
pandas                    1.3.5           py310hdd25497_0    conda-forge
parso                     0.8.3              pyhd8ed1ab_0    conda-forge
pexpect                   4.8.0              pyh9f0ad1d_2    conda-forge
pickleshare               0.7.5                   py_1003    conda-forge
pip                       21.3.1             pyhd8ed1ab_0    conda-forge
prompt-toolkit            3.0.24             pyha770c72_0    conda-forge
ptyprocess                0.7.0              pyhd3deb0d_0    conda-forge
pygments                  2.11.1             pyhd8ed1ab_0    conda-forge
python                    3.10.1          h38b4d05_2_cpython    conda-forge
python-dateutil           2.8.2              pyhd8ed1ab_0    conda-forge
python_abi                3.10                    2_cp310    conda-forge
pytz                      2021.3             pyhd8ed1ab_0    conda-forge
readline                  8.1                  h05e3726_0    conda-forge
setuptools                60.2.0          py310h2ec42d9_0    conda-forge
six                       1.16.0             pyh6c4a22f_0    conda-forge
sqlite                    3.37.0               h23a322b_0    conda-forge
tk                        8.6.11               h5dbffcc_1    conda-forge
traitlets                 5.1.1              pyhd8ed1ab_0    conda-forge
tzdata                    2021e                he74cb21_0    conda-forge
wcwidth                   0.2.5              pyh9f0ad1d_2    conda-forge
wheel                     0.37.1             pyhd8ed1ab_0    conda-forge
xz                        5.2.5                haf1e3a3_1    conda-forge
zlib                      1.2.11            h9173be1_1013    conda-forge

Prior Performance

No response