Issue 24068: statistics module - incorrect results with boolean input (original) (raw)

the mean function in the statistics module gives nonsensical results with boolean values in the input, e.g.:

mean([True, True, False, False]) 0.25

mean([True, 1027]) 0.5

This is an issue with the module's internal _sum function that mean relies on. Other functions relying on _sum are affected more subtly, e.g.:

variance([1, 1027, 0]) 351234.3333333333

variance([True, 1027, 0]) 351234.3333333334

The problem with _sum is that it will try to coerce its result to any non-int type found in the input (so bool in the examples), but bool(1028) is just True so information gets lost.

I've attached a patch preventing the type cast when it would be to bool. I don't have time to write a separate test though so if somebody wants to take over .. :)

I wonder if it would be better to reject Bool data in this context?

It's not uncommon (and quite useful) in NumPy world to compute basic statistics on arrays of boolean dtype: the sum of such an array gives a count of the Trues, and the mean gives the proportion of True entries. I think it would be handy to allow the statistics module to work with lists of bools, if possible.