API: Expand read_csv dtype for categoricals · Issue #14503 · pandas-dev/pandas (original) (raw)

In #13406 Chris added support for read_csv(..., dtype={'col': 'category'}) (thanks!). This issue is for expanding that syntax to allow a more complete specification of the resulting categorical.

Your code here

df = pd.read_csv(path, dtype={'col': pd.Categorical(['a', 'b', 'c'], ordered=True}) df = pd.read_csv(path, dtype={'col': ['a', 'b', 'c']}) # shorthand, but unordered only

we would still accept dtype={'col': 'category'} as well, to infer categories

Implementation-wise, I think we can keep all the parsing logic as is, and simply loop over dtype and call set_categories (and maybe as_ordered) on all the categoricals just before returning to the user.

This would help a bit in dask, where their category type inference can fail if the first partition doesn't contain all the categories (see dask/dask#1705). This is why it'd be preferable to do it as an option to read_csv, rather than putting in on the user to followup with a set_categories.