BUG: DataFrame.sample weights not required to sum to less than 1 (original) (raw)

Pandas version checks

Reproducible Example

import pandas as pd

data = {'w': [100, 1, 1]} df = pd.DataFrame(data)

df.sample(n=2, weights=df.w, replace=False)

Issue Description

In order for PPS sampling without replacement to be feasible, the selection probabilities must be less than 1, i.e.

$ \frac{n \cdot w_i}{\sum w_i}< 1$

where w is the weight and n is the total number of units to be sampled. This is often not the case if you are selecting a decent proportion of all units and there is wide variance in unit size. For example, suppose you want to select 2 units with PPS without replacement from a sampling frame of 3 units with sizes 100, 1, and 1. There is no way to make the probability of selection of the first unit 100x the probability of selection of the other two units (since the max prob for the first unit is 1 and at least one of the other units must have prob >= .5).

Unfortunately, pandas df.sampling function doesn't throw an error in this case.

Expected Behavior

The code above should throw some sort of error like "Some unit probabilities are larger than 1 and thus PPS sampling without replacement cannot be performed"

Installed Versions

Details

Replace this line with the output of pd.show_versions()