Make categories and ordered part of CategoricalDtype · Issue #14711 · pandas-dev/pandas (original) (raw)

This is to discuss pushing the Categorical.categories and
Categorical.ordered information into the extension type CategoricalDtype.

pd.CategoricalDtype(categories, ordered=False)

Note that there is no values argument. This is a type constructor, that
isn't attached to any specific Categorical instance.

Why?

Several times now (read_csv(..., dtype=...), .astype(...), Series([], dtype=...))
we have places where we accept dtype='category' which takes the values
in the method (the series, or column from the CSV, etc.)
and hands it off to the value constructor, with no control over the
categories and ordered arguments.

Categorical(values, categories=None, ordered=False)

The proposal here would add the categories and ordered
attributes / arguments to CategoricalDtype and provide a common API
for specifying non-default parameters for the Categorical constructor
in methods like read_csv, astype, etc.

t = pd.CategoricalDtype(['low', 'med', 'high'], ordered=True) pd.read_csv('foo.csv', dtype={'A': int, 'B': t) pd.Series(['high', 'low', 'high'], dtype=t)

s = pd.Series(['high', 'low', 'high']) s.astype(t)

We would continue to accept dtype='category'.

This becomes even more import when doing operations on larger than memory datasets with something like dask or even (read_csv(..., chunksize=N)). Right now you don't have an easy way to specify the categories or ordered for columns (assuming you know them ahead of time).

Issues

CategoricalDtype currently isn't part of the public API. Which methods
on it do we make public?
Equality semantics: For backwards compat, I think all instances
of CategoricalDtype should compare equal with all others. You can use
identity to check if two types are the same

t1 = pd.CategoricalDtype(['a', 'b'], ordered=True) t2 = pd.CategoricalDtype(['a', 'b'], ordered=False)

t1 == t2 # True t1 is t2 # False t1 is t1 # True

Should the categories argument be required? Currently dtype='category'
says 1.) infer the categories based on the values, and 2.) it's unordered.
Would CategoricalDtype(None, ordered=False) be allowed?
Strictness? If I say

pd.Series(['a', 'b', 'c'], dtype=pd.CategoricalDtype(['a', 'b']))

What happens? I would probably expect a TypeError or ValueError as c
isn't "supposed" to be there. Or do we replace 'c' with NA? Should
strict be another parameter to CategoricalDtype (I don't think so).

I'm willing to work on this over the next couple weeks.

xref #14676 (astype)
xref #14503 (read_csv)