duplicated() performance and bug on long rows regression from 0.15.2->0.16.0 · Issue #10161 · pandas-dev/pandas (original) (raw)

the following works quickly in 0.15.2 and has a performance issue on the last operation df.T.duplicated() in 0.16.0 and 0.16.1
also on a private data set that works on 0.15.2 i get an error on 0.16.0 and 0.16.1 on the same operation.

code:

import pandas,numpy

df = pandas.DataFrame({'A': [1 for x in range(1000)], 'B': [1 for x in range(1000)]})

print (numpy.count_nonzero(df.duplicated())) print (numpy.count_nonzero(df.T.duplicated()))

df = pandas.DataFrame({'A': [1 for x in range(1000000)], 'B': [1 for x in range(1000000)]})

print (numpy.count_nonzero(df.duplicated())) print (numpy.count_nonzero(df.T.duplicated()))

this is the error i get on the private data set (code not reproduce yet with synthetic data): File "C:\Anaconda3\lib\site-packages\pandas\util\decorators.py", line 88, in wrapper return func(*args, **kwargs) File "C:\Anaconda3\lib\site-packages\pandas\core\frame.py", line 2867, in duplicated labels, shape = map(list, zip( * map(f, vals))) File "C:\Anaconda3\lib\site-packages\pandas\core\frame.py", line 2856, in f labels, shape = factorize(vals, size_hint=min(len(self), _SIZE_HINT_LIMIT)) File "C:\Anaconda3\lib\site-packages\pandas\core\algorithms.py", line 135, in factorize labels = table.get_labels(vals, uniques, 0, na_sentinel) File "pandas\hashtable.pyx", line 813, in pandas.hashtable.PyObjectHashTable.get_labels (pandas\hashtable.c:14025) ValueError: Buffer has wrong number of dimensions (expected 1, got 2)