BUG: DataFrame.describe() breaks with a column index of object type and numeric entries · Issue #13288 · pandas-dev/pandas (original) (raw)
Preparing a commit for another issue in .describe()
, I encountered this puzzling bug, surprisingly easy to trigger.
Symptoms
df = pd.DataFrame({'A': list("BCDE"), 0: [1,2,3,4]}) df.describe()
Long traceback listing formatting and internal functions...
ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long'
However:
df.describe(include='all') 0 A count 4.000000 4 unique NaN 4 top NaN D freq NaN 1 mean 2.500000 NaN std 1.290994 NaN min 1.000000 NaN 25% 1.750000 NaN 50% 2.500000 NaN 75% 3.250000 NaN max 4.000000 NaN
It's OK if we don't print on screen:
x = df.describe() x.columns Out[8]: Index([0], dtype='int64')
Fixing this suspicious index (int works too):
x.columns = x.columns.astype(object) x Out[10]: 0 count 4.000000 mean 2.500000 std 1.290994 min 1.000000 25% 1.750000 50% 2.500000 75% 3.250000 max 4.000000
Same issue happens with a simpler data frame:
df0 = pd.DataFrame([1,2,3,4])
It's OK now
df0.describe() Out[28]: 0 count 4.000000 mean 2.500000 std 1.290994 min 1.000000 25% 1.750000 50% 2.500000 75% 3.250000 max 4.000000
Modify column index:
df0.columns = pd.Index([0], dtype=object) df0.describe()
...
ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long'
Current version (but the bug is also present in pandas release 0.18.1):
pd.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 3.5.1.final.0
python-bits: 64
OS: Linux
OS-release: 4.1.20-1
machine: x86_64
processor: Intel(R)_Core(TM)_i5-2520M_CPU_@_2.50GHz
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
pandas: 0.18.1+64.g7ed22fe.dirty
nose: 1.3.7
pip: 8.1.2
setuptools: 21.0.0
Cython: 0.24
numpy: 1.11.0
scipy: 0.17.0.dev0+3f3c371
IPython: 4.0.1
...
Reason
Some internal function gets confused by dtypes of a column index, I guess. But the faulty index is created in .describe()
.
Output from %debug df.describe()
NDFrame.describe() in pandas/core/generic.py:
4943 data = self 4944 else: 4945 data = self.select_dtypes(include=include, exclude=exclude) 4946 4947 ldesc = [describe_1d(s, percentiles) for _, s in data.iteritems()] 4948 # set a convenient order for rows 4949 names = [] 4950 ldesc_indexes = sorted([x.index for x in ldesc], key=len) 4951 for idxnames in ldesc_indexes: 4952 for name in idxnames: 4953 if name not in names: 4954 names.append(name) 4955 4956 d = pd.concat(ldesc, join_axes=pd.Index([names]), axis=1) 1> 4957 d.columns = self.columns._shallow_copy(values=d.columns.values) 4958 d.columns.names = data.columns.names 4959 return d
_shallow_copy()
in the marked line changes d.columns
:
ipdb> p d.columns Int64Index([0], dtype='int64') ipdb> n
/home/users/piotr/workspace/pandas-pijucha/pandas/core/generic.py(4958)describe() 1 4957 d.columns = self.columns._shallow_copy(values=d.columns.values) -> 4958 d.columns.names = data.columns.names 4959 return d ipdb> p d.columns Index([0], dtype='int64')
Possible solutions
Lines 4957-4958 are actually used to fix issues that pd.concat
brings about. They try to pass the column structure from self
to d
.
I think a simpler solution is replacing these lines with:
d = pd.concat(ldesc, join_axes=pd.Index([names]), axis=1) d.columns = data.columns return d
or
d = pd.DataFrame(pd.concat(ldesc, axis=1), index = pd.Index(names), columns = data.columns) return d
data
is a subframe of self
and retains the same column structure.
pd.concat
has some parameters that help pass a hierarchical index but can't do anything on its own with a categorical one.
I'm going to submit a pull request with this fix together with some others related with describe()
. I hope I haven't overlooked anything obvious. But if so, any comments are very welcome.