"maximum recursion depth exceeded" when calculating duplicates in big DataFrame (regression comparing to the old version) · Issue #21524 · pandas-dev/pandas (original) (raw)

Code Sample, a copy-pastable example if possible

I'm currently in the middle of upgrading old system from old pandas (0.12) to the new version (0.23.0). One of the parts of the system is duplicate columns detection in medium-sized DataFrames (~100 columns, few thousand rows). We were detecting it like this dupes = df.T.duplicated() and previously it worked but after the upgrade, it started failing. Simplest snippet to reproduce this locally

import numpy as np import pandas as pd

data = {} for i in range(70): data['col_{0:02d}'.format(i)] = np.random.randint(0, 1000, 20000) df = pd.DataFrame(data) dupes = df.T.duplicated() print dupes

Problem description

To the contrast of the note below, this issue isn't resolved by upgrading to the newest pandas. On the contrary, it is caused by such upgrade :) Old version I've copied below from 0.12 works on a snippet above

def old_duplicated(self, cols=None, take_last=False): """ Return boolean Series denoting duplicate rows, optionally only considering certain columns

Parameters
----------
cols : column label or sequence of labels, optional
    Only consider certain columns for identifying duplicates, by
    default use all of the columns
take_last : boolean, default False
    Take the last observed row in a row. Defaults to the first row

Returns
-------
duplicated : Series
"""

# kludge for #1833
def _m8_to_i8(x):
    if issubclass(x.dtype.type, np.datetime64):
        return x.view(np.int64)
    return x

if cols is None:
    values = list(_m8_to_i8(self.values.T))
else:
    if np.iterable(cols) and not isinstance(cols, basestring):
        if isinstance(cols, tuple):
            if cols in self.columns:
                values = [self[cols]]
            else:
                values = [_m8_to_i8(self[x].values) for x in cols]
        else:
            values = [_m8_to_i8(self[x].values) for x in cols]
    else:
        values = [self[cols]]

keys = lib.fast_zip_fillna(values)
duplicated = lib.duplicated(keys, take_last=take_last)
return pd.Series(duplicated, index=self.index)

but the new one now fails with

Traceback (most recent call last):
  File "/home/modintsov/workspace/DataRobot/playground.py", line 56, in <module>
    dupes = df.T.duplicated()
  File "/home/modintsov/.virtualenvs/dev/local/lib/python2.7/site-packages/pandas/core/frame.py", line 4384, in duplicated
    ids = get_group_index(labels, shape, sort=False, xnull=False)
  File "/home/modintsov/.virtualenvs/dev/local/lib/python2.7/site-packages/pandas/core/sorting.py", line 95, in get_group_index
    return loop(list(labels), list(shape))
  File "/home/modintsov/.virtualenvs/dev/local/lib/python2.7/site-packages/pandas/core/sorting.py", line 86, in loop
    return loop(labels, shape)

... many-many lines of the same...

  File "/home/modintsov/.virtualenvs/dev/local/lib/python2.7/site-packages/pandas/core/sorting.py", line 60, in loop
    stride = np.prod(shape[1:nlev], dtype='i8')
  File "/home/modintsov/.virtualenvs/dev/local/lib/python2.7/site-packages/numpy/core/fromnumeric.py", line 2566, in prod
    out=out, **kwargs)
RuntimeError: maximum recursion depth exceeded

Which is obviously a regression.

[this should explain why the current behaviour is a problem and why the expected output is a better solution.]

Note: We receive a lot of issues on our GitHub tracker, so it is very possible that your issue has been posted before. Please check first before submitting so that we do not have to handle and close duplicates!

Note: Many problems can be resolved by simply upgrading pandas to the latest version. Before submitting, please check if that solution works for you. If possible, you may want to check if master addresses this issue, but that is not necessary.

For documentation-related issues, you can check the latest versions of the docs on master here:

https://pandas-docs.github.io/pandas-docs-travis/

If the issue has not been resolved there, go ahead and file it in the issue tracker.

Expected Output

I expect no exception and return of bool Series. Example above in old pandas output this

col_00    False
col_01    False
col_02    False
col_03    False
col_04    False
col_05    False
col_06    False
col_07    False
col_08    False
col_09    False
col_10    False
col_11    False
col_12    False
col_13    False
col_14    False
...
col_55    False
col_56    False
col_57    False
col_58    False
col_59    False
col_60    False
col_61    False
col_62    False
col_63    False
col_64    False
col_65    False
col_66    False
col_67    False
col_68    False
col_69    False
Length: 70, dtype: bool

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]

pd.show_versions()
INSTALLED VERSIONS

commit: None
python: 2.7.12.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-128-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None
pandas: 0.23.0
pytest: 3.5.1
pip: 9.0.1
setuptools: 39.2.0
Cython: 0.21
numpy: 1.14.3
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 5.5.0
sphinx: 1.5.5
patsy: 0.2.1
dateutil: 2.7.3
pytz: 2015.7
blosc: None
bottleneck: None
tables: None
numexpr: 2.6.5
feather: None
matplotlib: None
openpyxl: None
xlrd: 0.9.2
xlwt: 0.7.5
xlsxwriter: None
lxml: None
bs4: 4.6.0
html5lib: None
sqlalchemy: 1.2.7
pymysql: None
psycopg2: 2.7.3.2.dr2 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None