Pandas crosstab margins double counting if values specifies a different field than rows/cols (original) (raw)

See http://stackoverflow.com/questions/17236852/pandas-crosstab-double-counting-when-using-two-aggregate-functions for discussion.

To reproduce:
Create a test dataframe:

df = DataFrame({'A': ['one', 'one', 'two', 'three'] * 6, 'B': ['A', 'B', 'C'] * 8, 'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 4, 'D': np.random.randn(24), 'E': np.random.randn(24)})

Crosstab gives the expected results:

crosstab(rows=[df['A'],df['B']], cols=[df['C']], margins=True) C bar foo All A B
one A 2 2 4 B 2 2 4 C 2 2 4 three A 2 0 2 B 0 2 2 C 2 0 2 two A 0 2 2 B 2 0 2 C 0 2 2 All 12 12 24

However, if you try to get mean and count in the same crosstab, specifying values for the mean, the crosstab will double count the elements when calculating the margin total:

crosstab(rows=[df['A'],df['B']], cols=[df['C']], margins=True, aggfunc=[np.size, np.mean], values=df['D'])

     size                mean                    

C bar foo All bar foo All A B
one A 2 2 4 0.245998 0.076366 0.161182 B 2 2 4 -0.739757 0.137780 -0.300988 C 2 2 4 -1.555759 -1.446554 -1.501157 three A 2 NaN 2 1.216109 NaN 1.216109 B NaN 2 2 NaN 0.255482 0.255482 C 2 NaN 2 0.732448 NaN 0.732448 two A NaN 2 2 NaN -0.273747 -0.273747 B 2 NaN 2 -0.001649 NaN -0.001649 C NaN 2 2 NaN 0.685422 0.685422 All 24 24 24 -0.017102 -0.094208 -0.055655