BUG: DataFrame.describe() breaks with a column index of object type and numeric entries · Issue #13288 · pandas-dev/pandas (original) (raw)

Preparing a commit for another issue in .describe(), I encountered this puzzling bug, surprisingly easy to trigger.

Symptoms

df = pd.DataFrame({'A': list("BCDE"), 0: [1,2,3,4]}) df.describe()

Long traceback listing formatting and internal functions...

ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long'

However:

df.describe(include='all') 0 A count 4.000000 4 unique NaN 4 top NaN D freq NaN 1 mean 2.500000 NaN std 1.290994 NaN min 1.000000 NaN 25% 1.750000 NaN 50% 2.500000 NaN 75% 3.250000 NaN max 4.000000 NaN

It's OK if we don't print on screen:

x = df.describe() x.columns Out[8]: Index([0], dtype='int64')

Fixing this suspicious index (int works too):

x.columns = x.columns.astype(object) x Out[10]: 0 count 4.000000 mean 2.500000 std 1.290994 min 1.000000 25% 1.750000 50% 2.500000 75% 3.250000 max 4.000000

Same issue happens with a simpler data frame:

df0 = pd.DataFrame([1,2,3,4])

It's OK now

df0.describe() Out[28]: 0 count 4.000000 mean 2.500000 std 1.290994 min 1.000000 25% 1.750000 50% 2.500000 75% 3.250000 max 4.000000

Modify column index:

df0.columns = pd.Index([0], dtype=object) df0.describe()

...

ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long'

Current version (but the bug is also present in pandas release 0.18.1):

pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.5.1.final.0
python-bits: 64
OS: Linux
OS-release: 4.1.20-1
machine: x86_64
processor: Intel(R)_Core(TM)_i5-2520M_CPU_@_2.50GHz
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.18.1+64.g7ed22fe.dirty
nose: 1.3.7
pip: 8.1.2
setuptools: 21.0.0
Cython: 0.24
numpy: 1.11.0
scipy: 0.17.0.dev0+3f3c371
IPython: 4.0.1
...

Reason

Some internal function gets confused by dtypes of a column index, I guess. But the faulty index is created in .describe().

Output from %debug df.describe()

NDFrame.describe() in pandas/core/generic.py:

4943 data = self 4944 else: 4945 data = self.select_dtypes(include=include, exclude=exclude) 4946 4947 ldesc = [describe_1d(s, percentiles) for _, s in data.iteritems()] 4948 # set a convenient order for rows 4949 names = [] 4950 ldesc_indexes = sorted([x.index for x in ldesc], key=len) 4951 for idxnames in ldesc_indexes: 4952 for name in idxnames: 4953 if name not in names: 4954 names.append(name) 4955 4956 d = pd.concat(ldesc, join_axes=pd.Index([names]), axis=1) 1> 4957 d.columns = self.columns._shallow_copy(values=d.columns.values) 4958 d.columns.names = data.columns.names 4959 return d

_shallow_copy() in the marked line changes d.columns:

ipdb> p d.columns Int64Index([0], dtype='int64') ipdb> n

/home/users/piotr/workspace/pandas-pijucha/pandas/core/generic.py(4958)describe() 1 4957 d.columns = self.columns._shallow_copy(values=d.columns.values) -> 4958 d.columns.names = data.columns.names 4959 return d ipdb> p d.columns Index([0], dtype='int64')

Possible solutions

Lines 4957-4958 are actually used to fix issues that pd.concat brings about. They try to pass the column structure from self to d.
I think a simpler solution is replacing these lines with:

d = pd.concat(ldesc, join_axes=pd.Index([names]), axis=1) d.columns = data.columns return d

or

d = pd.DataFrame(pd.concat(ldesc, axis=1), index = pd.Index(names), columns = data.columns) return d

data is a subframe of self and retains the same column structure.

pd.concat has some parameters that help pass a hierarchical index but can't do anything on its own with a categorical one.

I'm going to submit a pull request with this fix together with some others related with describe(). I hope I haven't overlooked anything obvious. But if so, any comments are very welcome.