Discussion: feedback on the Categorical integration · Issue #8074 · pandas-dev/pandas (original) (raw)
I finally had some time to read up on the discussions and to look at the implementation of the Categoricals integration, and have still some questions and comments. I am sorry that this is rather late to the party, but I still think this is important to discuss (and I certainly don't want to disregard the really great work @JanSchulz and @jreback put in this! Thanks a lot for that!).
It is mainly about the public interface, and not about the internals.
So below I summed some remarks. And to be clear, just some personal ideas to discuss!
1. The Categorical
object vs the 'category' Series
Previously, there was already the Categorical
class:
In [1]: pd.Categorical(["a","b","c","a"])
Out[1]:
a
b
c
a
Levels (3, object): [a < b < c]
Now, you can also put this in a Series
:
In [2]: pd.Series(pd.Categorical(["a","b","c","a"]))
Out[2]:
0 a
1 b
2 c
3 a
dtype: category
Levels (3, object): [a < b < c]
To create this Series, you can either put an existing Categorical
inside a Series (as above, or by assigning it to a column of DataFrame, like df['cat'] = pd.Categorical(["a","b","c","a"])
), or you can convert an existing Series
to the 'category' dtype:
pd.Series(["a","b","c","a"]).astype('category')
So basically, there are now two different main objects to deal with categorical values (the Categorical
object and the Series
of 'category' dtype), which are also used both and mixed in the docs.
This raises the question if this is needed? (also touced lightly here: ). Some remarks:
- The constructor
pd.Series(pd.Categorical(...))
is a bit cumbersome I think - In the documentation both Categorical and category Series are used. But what is the difference between both? (in user facing interaction) What is the advantage of one above the other, and in what circumstances? Why should I use a
Categorical
and not a category Series? - Going further, are both possibilities needed (in user facing API, docs, etc)? Why not just using always a Series with
category
dtype?
I'v also found in the discusion the following by @JanSchulz (sorry if I misquote you guys :-)) (#8007 (comment)):
I've not found a usecase which would need to touch categoricals instead of Series(Categorical(...)))
and response of @njsmith to that:
I don't really care what the data type for holding categorical data is, but I can certainly see the advantage of having just one data type. And if so then Series seems like a good choice for that.
So proposal: just use Series
with 'category' dtype in all user facing API/functions and documentation.
2. Naming issues (levels, labels, codes, categories, ..)
The concepts of a categorical:
codes
: numerical representation (previously calledlabels
)levels
: descriptive names
For codes
, I think this is indeed much better, as labels
was very confusing (and more logical the labels would be the different values inside levels
..), and it has the advantage of the same name as in R.
But, the name levels
is somewhat more problematic IMHO.level
has already another and established meaning in pandas, namely the different levels of a hierarchical MultiIndex. In many methods, you have a level=..
keyword, and there are a lot of index methods to handle levels like reorder_levels
, droplevel
.
In [14]: pd.MultiIndex.from_product([['a', 'b'],[1,2]], names=['A', 'B'])
Out[14]:
MultiIndex(levels=[[u'a', u'b'], [1, 2]],
labels=[[0, 0, 1, 1], [0, 1, 0, 1]],
names=[u'A', u'B'])
Here, there are two levels, a "first" and a "second level". While in a categorical context, the "first level" from above would consist of two levels, namely 'a' and 'b' ...
Possible outcomes
What are the 'things' we call now 'levels'. It are the different classes, or the different categories that are possible within the Categorical
series. So maybe 'classes' or 'categories' would be an alternative name? So in that regard I like the proposal of @immerrr (#7217 (comment)):
As for naming, I like the name "codes" for the numerical representation and would like to propose "categories" for the descriptive names.
@JanSchulz responded:
I wouldn't change the name levels to keep that aspect of R's factors.
A good argument, but I personally think the possible confusion between df.index.reorder_levels
and df.cat.reorder_levels
is important enough to reconsider this. Certainly when we would have eg a CategoricalIndex in the future, then reorder_levels
will become totally dubious ...
So, at least I would go with categorical_levels
to make the distinction (as mentioned by @JanSchulz here: #7217 (comment)), or go with another name like categories
.
3. Return type of Series.values
At the moment, when you have a Series with 'category' dtype, Series.values
will return the Categorical
object, and not a numpy array:
- This seems not very consistent with the other dtypes.
- The documentation of
Series.values
is also very clear on that : "returns Series as numpy.ndarray" (and this is also how it is printed in my head) - What is a good reason to deviate from this rule? (certainly if you can have eg the
s.cat
attribute to return it)
Of course, if we would want to return a numpy array, it would have to be decided what it should return (eg what is returned now from np.asarray()
). You loose information with this (the levels), and I suppose this is the reason to return a Categorical?. But I personally find the consistency more important here, certainly if you can do everything with the Series what you can do with the categorical (discussion above).
@JanSchulz @jreback @jseabold @njsmith @immerrr @cpcloud @hayd