BUG: Binned groupby median function calculates median on empty bins and outputs random numbers · Issue #13629 · pandas-dev/pandas (original) (raw)

Code Sample, a copy-pastable example if possible

import pandas as pd
d = pd.DataFrame([1,2,5,6,9,3,6,5,9,7,11,36,4,7,8,25,8,24])
b = [0,5,10,15,20,25,30,35,40,45,50,55]
g = d.groupby(pd.cut(d[0],b))
print g.mean()
print g.median()
print g.get_group('(0, 5]').median()
print g.get_group('(40, 45]').median()

Expected Output

                  0
0                  
(0, 5]     3.333333
(5, 10]    7.500000
(10, 15]  11.000000
(15, 20]        NaN
(20, 25]  24.500000
(25, 30]        NaN
(30, 35]        NaN
(35, 40]  36.000000
(40, 45]        NaN
(45, 50]        NaN
(50, 55]        NaN
             0
0             
(0, 5]     3.5
(5, 10]    7.5
(10, 15]  11.0
(15, 20]  18.0
(20, 25]  24.5
(25, 30]  30.5
(30, 35]  30.5
(35, 40]  36.0
(40, 45]  18.0
(45, 50]  18.0
(50, 55]  18.0
0    3.5
dtype: float64
Traceback (most recent call last):

  File "<ipython-input-9-0663486889da>", line 1, in <module>
    runfile('C:/PythonDir/test04.py', wdir='C:/PythonDir')

  File "C:\Anaconda2\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 714, in runfile
    execfile(filename, namespace)

  File "C:\Anaconda2\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 74, in execfile
    exec(compile(scripttext, filename, 'exec'), glob, loc)

  File "C:/PythonDir/test04.py", line 20, in <module>
    print g.get_group('(40, 45]').median()

  File "C:\Anaconda2\lib\site-packages\pandas\core\groupby.py", line 587, in get_group
    raise KeyError(name)

KeyError: '(40, 45]'

This example shows how the median-function of the groupby object outputs a random number instead of NaN like the mean-function does when a bin is empty. Directly trying to call that bin with its key leads to an error since it doesn't exist, yet the full median output suggests it does exist and that the value might even be meaningful (like in the (15, 20] bin or the (30, 35] bin). The wrong numbers that are returned can change randomly, another possible output using the same code might look like this:

(0, 5]     3.500000e+00
(5, 10]    7.500000e+00
(10, 15]   1.100000e+01
(15, 20]   1.800000e+01
(20, 25]   2.450000e+01
(25, 30]   3.050000e+01
(30, 35]   3.050000e+01
(35, 40]   3.600000e+01
(40, 45]  4.927210e+165
(45, 50]  4.927210e+165
(50, 55]  4.927210e+165

output of pd.show_versions()

pandas: 0.18.1