set_index on categorical fails with empty partitions (original) (raw)

Reproducible example

import numpy as np import pandas as pd import dask.dataframe as dd

pdf = pd.DataFrame({ "cat": pd.Categorical(np.repeat(list("ABC"), 20), ordered=True), "value": np.random.rand(60) }) ddf = dd.from_pandas(pdf, npartitions=3)

Filter on category A, partitions 2 and 3 will be empty.

ddf = ddf.loc[ddf["cat"] == "A"]

ddf.set_index("cat")

ValueError: zero-size array to reduction operation maximum which has no identity

Sounds similar to #2820

For now, the cull_empty_parititions workaround from SO can be used.