Variance is not calculated correctly in some cases + inconsistent definition · Issue #10242 · pandas-dev/pandas (original) (raw)

The var method acts inconsistently when used on the Series vs the numpy array returned from values:

dat = pd.DataFrame({'x': [1,2,3,4,0,0,0]}) print(dat['x'].values.var()) print(dat['x'].var())

Prints:
2.24489795918
2.61904761905

This is due to numpy using ddof=0 as its default (calculates the biased variance), whereas pandas calculates the unbiased variance by default. These two should be consistent, so either pandas should adapt or numpy should (maybe it should be numpy since we should calculate unbiased by default?).

The other problem is that pandas does not calculate the variance of this DataFrame properly.
Inconsistent definition aside, the variance should clearly not be zero when calculated from pandas.

dat = pd.DataFrame({'x': [9.0692124699,9.0692124692,9.0692124702,9.0692124686, 9.0692124687,9.0692124707,9.0692124679,9.0692124685, 9.0692124698,9.0692124719,9.0692124698,9.0692124692, 9.0692124689,9.0692124673,9.0692124707,9.0692124714, 9.0692124714,9.0692124734,9.0692124719,9.0692124710, 9.0692124694,9.0692124705,9.0692124713,9.0692124717 ]}) print(dat['x'].values.var()) print(dat['x'].var())

Prints:
2.06817742558e-18
0.0

Here is the system information:
INSTALLED VERSIONS:
commit: None
python: 3.4.0.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-52-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_CA.UTF-8

pandas: 0.16.1
numpy: 1.9.2