Discussion: feedback on the Categorical integration · Issue #8074 · pandas-dev/pandas (original) (raw)

I finally had some time to read up on the discussions and to look at the implementation of the Categoricals integration, and have still some questions and comments. I am sorry that this is rather late to the party, but I still think this is important to discuss (and I certainly don't want to disregard the really great work @JanSchulz and @jreback put in this! Thanks a lot for that!).
It is mainly about the public interface, and not about the internals.

So below I summed some remarks. And to be clear, just some personal ideas to discuss!

1. The `Categorical` object vs the 'category' `Series`

Previously, there was already the Categorical class:

In [1]: pd.Categorical(["a","b","c","a"])
Out[1]:
 a
 b
 c
 a
Levels (3, object): [a < b < c]

Now, you can also put this in a Series:

In [2]: pd.Series(pd.Categorical(["a","b","c","a"]))
Out[2]:
0    a
1    b
2    c
3    a
dtype: category
Levels (3, object): [a < b < c]

To create this Series, you can either put an existing Categorical inside a Series (as above, or by assigning it to a column of DataFrame, like df['cat'] = pd.Categorical(["a","b","c","a"])), or you can convert an existing Series to the 'category' dtype:

pd.Series(["a","b","c","a"]).astype('category')

So basically, there are now two different main objects to deal with categorical values (the Categorical object and the Series of 'category' dtype), which are also used both and mixed in the docs.
This raises the question if this is needed? (also touced lightly here: ). Some remarks:

The constructor pd.Series(pd.Categorical(...)) is a bit cumbersome I think
In the documentation both Categorical and category Series are used. But what is the difference between both? (in user facing interaction) What is the advantage of one above the other, and in what circumstances? Why should I use a Categorical and not a category Series?
Going further, are both possibilities needed (in user facing API, docs, etc)? Why not just using always a Series with category dtype?

I'v also found in the discusion the following by @JanSchulz (sorry if I misquote you guys :-)) (#8007 (comment)):

I've not found a usecase which would need to touch categoricals instead of Series(Categorical(...)))

and response of @njsmith to that:

I don't really care what the data type for holding categorical data is, but I can certainly see the advantage of having just one data type. And if so then Series seems like a good choice for that.

So proposal: just use Series with 'category' dtype in all user facing API/functions and documentation.

2. Naming issues (levels, labels, codes, categories, ..)

The concepts of a categorical:

codes: numerical representation (previously called labels)
levels: descriptive names

For codes, I think this is indeed much better, as labels was very confusing (and more logical the labels would be the different values inside levels ..), and it has the advantage of the same name as in R.

But, the name levels is somewhat more problematic IMHO.
level has already another and established meaning in pandas, namely the different levels of a hierarchical MultiIndex. In many methods, you have a level=.. keyword, and there are a lot of index methods to handle levels like reorder_levels, droplevel.

In [14]: pd.MultiIndex.from_product([['a', 'b'],[1,2]], names=['A', 'B'])
Out[14]:
MultiIndex(levels=[[u'a', u'b'], [1, 2]],
           labels=[[0, 0, 1, 1], [0, 1, 0, 1]],
           names=[u'A', u'B'])

Here, there are two levels, a "first" and a "second level". While in a categorical context, the "first level" from above would consist of two levels, namely 'a' and 'b' ...

Possible outcomes
What are the 'things' we call now 'levels'. It are the different classes, or the different categories that are possible within the Categorical series. So maybe 'classes' or 'categories' would be an alternative name? So in that regard I like the proposal of @immerrr (#7217 (comment)):

As for naming, I like the name "codes" for the numerical representation and would like to propose "categories" for the descriptive names.

@JanSchulz responded:

I wouldn't change the name levels to keep that aspect of R's factors.

A good argument, but I personally think the possible confusion between df.index.reorder_levels and df.cat.reorder_levels is important enough to reconsider this. Certainly when we would have eg a CategoricalIndex in the future, then reorder_levels will become totally dubious ...

So, at least I would go with categorical_levels to make the distinction (as mentioned by @JanSchulz here: #7217 (comment)), or go with another name like categories.

3. Return type of `Series.values`

At the moment, when you have a Series with 'category' dtype, Series.values will return the Categorical object, and not a numpy array:

This seems not very consistent with the other dtypes.
The documentation of Series.values is also very clear on that : "returns Series as numpy.ndarray" (and this is also how it is printed in my head)
What is a good reason to deviate from this rule? (certainly if you can have eg the s.cat attribute to return it)

Of course, if we would want to return a numpy array, it would have to be decided what it should return (eg what is returned now from np.asarray()). You loose information with this (the levels), and I suppose this is the reason to return a Categorical?. But I personally find the consistency more important here, certainly if you can do everything with the Series what you can do with the categorical (discussion above).

@JanSchulz @jreback @jseabold @njsmith @immerrr @cpcloud @hayd

Discussion: feedback on the Categorical integration · Issue #8074 · pandas-dev/pandas (original) (raw)

1. The Categorical object vs the 'category' Series

2. Naming issues (levels, labels, codes, categories, ..)

3. Return type of Series.values

1. The `Categorical` object vs the 'category' `Series`

3. Return type of `Series.values`