Make categories and ordered part of CategoricalDtype · Issue #14711 · pandas-dev/pandas (original) (raw)
This is to discuss pushing the Categorical.categories
andCategorical.ordered
information into the extension type CategoricalDtype
.
pd.CategoricalDtype(categories, ordered=False)
Note that there is no values
argument. This is a type constructor, that
isn't attached to any specific Categorical
instance.
Why?
Several times now (read_csv(..., dtype=...)
, .astype(...)
, Series([], dtype=...)
)
we have places where we accept dtype='category'
which takes the values
in the method (the series, or column from the CSV, etc.)
and hands it off to the value constructor, with no control over thecategories
and ordered
arguments.
Categorical(values, categories=None, ordered=False)
The proposal here would add the categories
and ordered
attributes / arguments to CategoricalDtype
and provide a common API
for specifying non-default parameters for the Categorical
constructor
in methods like read_csv
, astype
, etc.
t = pd.CategoricalDtype(['low', 'med', 'high'], ordered=True) pd.read_csv('foo.csv', dtype={'A': int, 'B': t) pd.Series(['high', 'low', 'high'], dtype=t)
s = pd.Series(['high', 'low', 'high']) s.astype(t)
We would continue to accept dtype='category'
.
This becomes even more import when doing operations on larger than memory datasets with something like dask
or even (read_csv(..., chunksize=N)
). Right now you don't have an easy way to specify the categories
or ordered
for columns (assuming you know them ahead of time).
Issues
CategoricalDtype
currently isn't part of the public API. Which methods
on it do we make public?- Equality semantics: For backwards compat, I think all instances
ofCategoricalDtype
should compare equal with all others. You can use
identity to check if two types are the same
t1 = pd.CategoricalDtype(['a', 'b'], ordered=True) t2 = pd.CategoricalDtype(['a', 'b'], ordered=False)
t1 == t2 # True t1 is t2 # False t1 is t1 # True
- Should the
categories
argument be required? Currentlydtype='category'
says 1.) infer the categories based on the values, and 2.) it's unordered.
WouldCategoricalDtype(None, ordered=False)
be allowed? - Strictness? If I say
pd.Series(['a', 'b', 'c'], dtype=pd.CategoricalDtype(['a', 'b']))
What happens? I would probably expect a TypeError
or ValueError
as c
isn't "supposed" to be there. Or do we replace 'c'
with NA
? Shouldstrict
be another parameter to CategoricalDtype
(I don't think so).
I'm willing to work on this over the next couple weeks.