"maximum recursion depth exceeded" when calculating duplicates in big DataFrame (regression comparing to the old version) · Issue #21524 · pandas-dev/pandas (original) (raw)
Code Sample, a copy-pastable example if possible
I'm currently in the middle of upgrading old system from old pandas (0.12) to the new version (0.23.0). One of the parts of the system is duplicate columns detection in medium-sized DataFrames (~100 columns, few thousand rows). We were detecting it like this dupes = df.T.duplicated()
and previously it worked but after the upgrade, it started failing. Simplest snippet to reproduce this locally
import numpy as np import pandas as pd
data = {} for i in range(70): data['col_{0:02d}'.format(i)] = np.random.randint(0, 1000, 20000) df = pd.DataFrame(data) dupes = df.T.duplicated() print dupes
Problem description
To the contrast of the note below, this issue isn't resolved by upgrading to the newest pandas. On the contrary, it is caused by such upgrade :) Old version I've copied below from 0.12 works on a snippet above
def old_duplicated(self, cols=None, take_last=False): """ Return boolean Series denoting duplicate rows, optionally only considering certain columns
Parameters
----------
cols : column label or sequence of labels, optional
Only consider certain columns for identifying duplicates, by
default use all of the columns
take_last : boolean, default False
Take the last observed row in a row. Defaults to the first row
Returns
-------
duplicated : Series
"""
# kludge for #1833
def _m8_to_i8(x):
if issubclass(x.dtype.type, np.datetime64):
return x.view(np.int64)
return x
if cols is None:
values = list(_m8_to_i8(self.values.T))
else:
if np.iterable(cols) and not isinstance(cols, basestring):
if isinstance(cols, tuple):
if cols in self.columns:
values = [self[cols]]
else:
values = [_m8_to_i8(self[x].values) for x in cols]
else:
values = [_m8_to_i8(self[x].values) for x in cols]
else:
values = [self[cols]]
keys = lib.fast_zip_fillna(values)
duplicated = lib.duplicated(keys, take_last=take_last)
return pd.Series(duplicated, index=self.index)
but the new one now fails with
Traceback (most recent call last):
File "/home/modintsov/workspace/DataRobot/playground.py", line 56, in <module>
dupes = df.T.duplicated()
File "/home/modintsov/.virtualenvs/dev/local/lib/python2.7/site-packages/pandas/core/frame.py", line 4384, in duplicated
ids = get_group_index(labels, shape, sort=False, xnull=False)
File "/home/modintsov/.virtualenvs/dev/local/lib/python2.7/site-packages/pandas/core/sorting.py", line 95, in get_group_index
return loop(list(labels), list(shape))
File "/home/modintsov/.virtualenvs/dev/local/lib/python2.7/site-packages/pandas/core/sorting.py", line 86, in loop
return loop(labels, shape)
... many-many lines of the same...
File "/home/modintsov/.virtualenvs/dev/local/lib/python2.7/site-packages/pandas/core/sorting.py", line 60, in loop
stride = np.prod(shape[1:nlev], dtype='i8')
File "/home/modintsov/.virtualenvs/dev/local/lib/python2.7/site-packages/numpy/core/fromnumeric.py", line 2566, in prod
out=out, **kwargs)
RuntimeError: maximum recursion depth exceeded
Which is obviously a regression.
[this should explain why the current behaviour is a problem and why the expected output is a better solution.]
Note: We receive a lot of issues on our GitHub tracker, so it is very possible that your issue has been posted before. Please check first before submitting so that we do not have to handle and close duplicates!
Note: Many problems can be resolved by simply upgrading pandas
to the latest version. Before submitting, please check if that solution works for you. If possible, you may want to check if master
addresses this issue, but that is not necessary.
For documentation-related issues, you can check the latest versions of the docs on master
here:
https://pandas-docs.github.io/pandas-docs-travis/
If the issue has not been resolved there, go ahead and file it in the issue tracker.
Expected Output
I expect no exception and return of bool Series. Example above in old pandas output this
col_00 False
col_01 False
col_02 False
col_03 False
col_04 False
col_05 False
col_06 False
col_07 False
col_08 False
col_09 False
col_10 False
col_11 False
col_12 False
col_13 False
col_14 False
...
col_55 False
col_56 False
col_57 False
col_58 False
col_59 False
col_60 False
col_61 False
col_62 False
col_63 False
col_64 False
col_65 False
col_66 False
col_67 False
col_68 False
col_69 False
Length: 70, dtype: bool
Output of pd.show_versions()
[paste the output of pd.show_versions()
here below this line]
pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 2.7.12.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-128-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None
pandas: 0.23.0
pytest: 3.5.1
pip: 9.0.1
setuptools: 39.2.0
Cython: 0.21
numpy: 1.14.3
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 5.5.0
sphinx: 1.5.5
patsy: 0.2.1
dateutil: 2.7.3
pytz: 2015.7
blosc: None
bottleneck: None
tables: None
numexpr: 2.6.5
feather: None
matplotlib: None
openpyxl: None
xlrd: 0.9.2
xlwt: 0.7.5
xlsxwriter: None
lxml: None
bs4: 4.6.0
html5lib: None
sqlalchemy: 1.2.7
pymysql: None
psycopg2: 2.7.3.2.dr2 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None