API: Expand read_csv dtype for categoricals · Issue #14503 · pandas-dev/pandas (original) (raw)
In #13406 Chris added support for read_csv(..., dtype={'col': 'category'})
(thanks!). This issue is for expanding that syntax to allow a more complete specification of the resulting categorical.
Your code here
df = pd.read_csv(path, dtype={'col': pd.Categorical(['a', 'b', 'c'], ordered=True}) df = pd.read_csv(path, dtype={'col': ['a', 'b', 'c']}) # shorthand, but unordered only
we would still accept dtype={'col': 'category'}
as well, to infer categories
Implementation-wise, I think we can keep all the parsing logic as is, and simply loop over dtype
and call set_categories
(and maybe as_ordered
) on all the categoricals just before returning to the user.
This would help a bit in dask, where their category type inference can fail if the first partition doesn't contain all the categories (see dask/dask#1705). This is why it'd be preferable to do it as an option to read_csv
, rather than putting in on the user to followup with a set_categories
.