BUG: np.mean(pd.Series) != np.mean(pd.Series.values) (original) (raw)


Code Sample, a copy-pastable example

import pandas as pd import numpy as np

a = pd.Series(np.random.normal(scale=0.1, size=(1_000_000,)).astype(np.float32)).pow(2)

assert isinstance(np.mean(a), float) assert isinstance(np.mean(a.values), np.float32) assert abs(1 - np.mean(a)/np.mean(a.values)) > 4e-4

Problem description

  1. pd.DataFrame.mean/pd.Series.mean/np.mean(pd.Series) outputs a Python float instead of a numpy float. Since np.mean(pd.Series.values) does return an np float, I'm assuming for now that this should be fixed in pandas
  2. if dtype==np.float32, then calling mean on a pandas object gives a significantly different result vs calling mean on the underlying numpy ndarray.

Expected Output

The output of np.mean(a) should be the same as np.mean(a.values).

additional tests

both b and c ~1e-2

b = a.mean() # the pandas impl of mean assert isinstance(b, float) # PYTHON float, not numpy float? Ergo implicit f64

h = np.mean(a) assert isinstance(h, float) assert h == b

c = a.values.mean() # the numpy impl of mean assert isinstance(c, np.float32) # as exprected

print('\nerrors between pandas mean and numpy mean') print(f'relative error: {abs(1-b/c):.3e}') # ~ 5e-4 print(f'absolute error: {abs(b -c):.3e}') # ~ 5e-6

print(f'relative error after casting: {abs(1-np.float32(b)/c):.3e}') # ~ 5e-4 print(f'absolute error after casting: {abs(np.float32(b) -c):.3e}') # ~ 5e-6

d = a.sum() / len(a) assert isinstance(d, np.float64) # expected, because division. Note sum returns an np.float32

e = a.values.sum() / len(a) assert isinstance(e, np.float64) # expected, because division

these methods are equivalent

assert d==e

and up to f32 precision equal to the numpy impl

assert d.astype(np.float32) == c

the cherry on the cake

f = a.astype(np.float64).mean() assert isinstance(f, float) # still not ideal, should be np.float64

g = a.astype(np.float64).values.mean() print('\nrelative error between pandas f64 mean and numpy f64 mean') print(f'relative error numpy f64/pandas f64: {abs(1-g/f):.3e}') # ~ 1e-14 -- 1e-16, not bad but I would have expected equality

print('\nerrors between pandas f64 mean and numpy/pandas f32 mean') print(f'relative error pandas f32/pandas f64: {abs(1-b/f):.3e}') # ~ 5e-4 print(f'absolute error numpy f32/pandas f64: {abs(1-c/f):.3e}') # ~ 1e-7 -- 1e-9

finally...

h = np.mean(a) assert isinstance(h, float) assert h == b

output

errors between pandas mean and numpy mean relative error: 5.210e-04 absolute error: 5.204e-06 relative error after casting: 5.210e-04 absolute error after casting: 5.204e-06

relative error between pandas f64 mean and numpy f64 mean relative error numpy f64/pandas f64: 1.066e-14

errors between pandas f64 mean and numpy/pandas f32 mean relative error pandas f32/pandas f64: 5.214e-04 absolute error numpy f32/pandas f64: 2.399e-07

Output of pd.show_versions()

Details

INSTALLED VERSIONS

commit : c7f7443
python : 3.8.3.final.0
python-bits : 64
OS : Linux
OS-release : 5.4.0-80-generic
Version : #90-Ubuntu SMP Fri Jul 9 22:49:44 UTC 2021
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.3.1
numpy : 1.21.1
pytz : 2021.1
dateutil : 2.8.1
pip : 21.1.1
setuptools : 52.0.0.post20210125
Cython : 0.29.23
pytest : 6.2.3
hypothesis : None
sphinx : 4.0.1
blosc : None
feather : None
xlsxwriter : 1.3.8
lxml.etree : 4.6.3
html5lib : 1.1
pymysql : None
psycopg2 : 2.8.6 (dt dec pq3 ext lo64)
jinja2 : 3.0.0
IPython : 7.22.0
pandas_datareader: None
bs4 : 4.9.3
bottleneck : 1.3.2
fsspec : 0.9.0
fastparquet : None
gcsfs : None
matplotlib : 3.3.4
numexpr : 2.7.3
odfpy : None
openpyxl : 3.0.7
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : 1.6.2
sqlalchemy : 1.4.15
tables : 3.6.1
tabulate : None
xarray : None
xlrd : 2.0.1
xlwt : 1.3.0
numba : 0.51.2