Discussion: feedback on the Categorical integration · Issue #8074 · pandas-dev/pandas (original) (raw)

I finally had some time to read up on the discussions and to look at the implementation of the Categoricals integration, and have still some questions and comments. I am sorry that this is rather late to the party, but I still think this is important to discuss (and I certainly don't want to disregard the really great work @JanSchulz and @jreback put in this! Thanks a lot for that!).
It is mainly about the public interface, and not about the internals.

So below I summed some remarks. And to be clear, just some personal ideas to discuss!

1. The Categorical object vs the 'category' Series

Previously, there was already the Categorical class:

In [1]: pd.Categorical(["a","b","c","a"])
Out[1]:
 a
 b
 c
 a
Levels (3, object): [a < b < c]

Now, you can also put this in a Series:

In [2]: pd.Series(pd.Categorical(["a","b","c","a"]))
Out[2]:
0    a
1    b
2    c
3    a
dtype: category
Levels (3, object): [a < b < c]

To create this Series, you can either put an existing Categorical inside a Series (as above, or by assigning it to a column of DataFrame, like df['cat'] = pd.Categorical(["a","b","c","a"])), or you can convert an existing Series to the 'category' dtype:

pd.Series(["a","b","c","a"]).astype('category')

So basically, there are now two different main objects to deal with categorical values (the Categorical object and the Series of 'category' dtype), which are also used both and mixed in the docs.
This raises the question if this is needed? (also touced lightly here: ). Some remarks:

I'v also found in the discusion the following by @JanSchulz (sorry if I misquote you guys :-)) (#8007 (comment)):

I've not found a usecase which would need to touch categoricals instead of Series(Categorical(...)))

and response of @njsmith to that:

I don't really care what the data type for holding categorical data is, but I can certainly see the advantage of having just one data type. And if so then Series seems like a good choice for that.

So proposal: just use Series with 'category' dtype in all user facing API/functions and documentation.

2. Naming issues (levels, labels, codes, categories, ..)

The concepts of a categorical:

For codes, I think this is indeed much better, as labels was very confusing (and more logical the labels would be the different values inside levels ..), and it has the advantage of the same name as in R.

But, the name levels is somewhat more problematic IMHO.
level has already another and established meaning in pandas, namely the different levels of a hierarchical MultiIndex. In many methods, you have a level=.. keyword, and there are a lot of index methods to handle levels like reorder_levels, droplevel.

In [14]: pd.MultiIndex.from_product([['a', 'b'],[1,2]], names=['A', 'B'])
Out[14]:
MultiIndex(levels=[[u'a', u'b'], [1, 2]],
           labels=[[0, 0, 1, 1], [0, 1, 0, 1]],
           names=[u'A', u'B'])

Here, there are two levels, a "first" and a "second level". While in a categorical context, the "first level" from above would consist of two levels, namely 'a' and 'b' ...

Possible outcomes
What are the 'things' we call now 'levels'. It are the different classes, or the different categories that are possible within the Categorical series. So maybe 'classes' or 'categories' would be an alternative name? So in that regard I like the proposal of @immerrr (#7217 (comment)):

As for naming, I like the name "codes" for the numerical representation and would like to propose "categories" for the descriptive names.

@JanSchulz responded:

I wouldn't change the name levels to keep that aspect of R's factors.

A good argument, but I personally think the possible confusion between df.index.reorder_levels and df.cat.reorder_levels is important enough to reconsider this. Certainly when we would have eg a CategoricalIndex in the future, then reorder_levels will become totally dubious ...

So, at least I would go with categorical_levels to make the distinction (as mentioned by @JanSchulz here: #7217 (comment)), or go with another name like categories.

3. Return type of Series.values

At the moment, when you have a Series with 'category' dtype, Series.values will return the Categorical object, and not a numpy array:

Of course, if we would want to return a numpy array, it would have to be decided what it should return (eg what is returned now from np.asarray()). You loose information with this (the levels), and I suppose this is the reason to return a Categorical?. But I personally find the consistency more important here, certainly if you can do everything with the Series what you can do with the categorical (discussion above).

@JanSchulz @jreback @jseabold @njsmith @immerrr @cpcloud @hayd