Categorical type by TomAugspurger · Pull Request #16015 · pandas-dev/pandas (original) (raw)

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation164 Commits7 Checks0 Files changed

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

TomAugspurger

We extended the CategoricalDtype to accept optional categories and ordered
argument. CategoricalDtype is now part of the public API. This allows users to
specify the desired categories and orderedness of an operation ahead of time.
The current behavior, which is still possible with categories=None, the
default, is to infer the categories from whatever is present.

This change will make it easy to implement support for specifying categories
that are know ahead of time in other places e.g. .astype, .read_csv, and the
Series constructor.

Closes #14711
Closes #15078
Closes #14676

Dunno if we would want to do this for 0.20 or not. It ended up not being too big a change, so maybe.

@jreback

.dtype on a Categorical/CI should construct this

Categorica.is_dtype_equal should just work

@jreback

can u run asv on this to make sure nothing much changing perf wise

@TomAugspurger

.dtype on a Categorical/CI should construct this

Ha, that's why it felt like so much less code this time, because I forgot to do the other half :)

I'll work on this a bit more today, but will probably push to 0.21

@jreback

yeah this feels like it should be in 0.21.0. maybe we also expose this in pandas.api.types.CategoricalDtype rather than at the top-level.

jreback

@@ -131,6 +171,33 @@ def __eq__(self, other):
return isinstance(other, CategoricalDtype)
@staticmethod
def _hash_categories(categories):
from pandas.tools.hashing import hash_array, _combine_hash_arrays

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can just usehash_pandas_object. Its an ordered hash of the data.

gfyoung

@@ -356,6 +357,70 @@ def test_not_string(self):
self.assertFalse(is_string_dtype(PeriodDtype('D')))
class TestCategoricalDtypeParametrized(object):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can just call this TestCategoricalDtype (naming consistency)

gfyoung

result = pd.Series(['a', 'b'], dtype=pd.CategoricalDtype(['b', 'a']))
assert is_categorical_dtype(result)
tm.assert_index_equal(result.cat.categories, pd.Index(['b', 'a']))
assert result.cat.ordered is False

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's more idiomatic: assert ... is False or assert not ... ?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you are asserting that the result is a boolean (False) , then the first is more appropriate. If its just not True then the 2nd is fine (e.g. imagine 0 satisfies 2nd but not 1st). In this case we do want to assert a boolean result.

gfyoung

def test_categories(self):
result = CategoricalDtype(['a', 'b', 'c'])
tm.assert_index_equal(result.categories, pd.Index(['a', 'b', 'c']))
assert result.ordered is False

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as here

jreback added a commit to jreback/pandas that referenced this pull request

Apr 23, 2017

@jreback

jreback added a commit to jreback/pandas that referenced this pull request

Apr 24, 2017

@jreback

jreback added a commit that referenced this pull request

Apr 24, 2017

@jreback

pcluo pushed a commit to pcluo/pandas that referenced this pull request

May 22, 2017

@jreback @pcluo

@pep8speaks

Hello @TomAugspurger! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on September 23, 2017 at 16:34 Hours UTC

TomAugspurger

@@ -618,3 +618,8 @@ def test_categorical_equality(self, ordered, other, expected):
c1 = CategoricalDtype(['a', 'b'], ordered)
result = c1 == other
assert result == expected
def test_mixed(self):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to fail at the moment. hash_object_array tries to work on the mixed array first, and if that fails it falls back to hash_object_array(a.astype(str)), which will cause a "collision" with the section set of categoricals. Not sure how I'll handle this yet.

jreback

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks pretty good. I don't like the MultiIndex change (already commened on that PR). instead I would simply just change it (rather than adding a kw); the small api break is fine.

CategoricalDtype
----------------
.. versionadded:: 0.21.0

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

versionchanged; it exists now.

@@ -6,6 +6,7 @@
from pandas.core.algorithms import factorize, unique, value_counts
from pandas.core.dtypes.missing import isna, isnull, notna, notnull

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think exposing this is a good idea here. We don't directly do this with other dtypes. instead the are in pandas.api.types

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I could go either way on this. It would be the only type at the top level. On the other hand pd.api.types.CategoricalDtype(values) is quite long :)

I'll think about it some more, but you're probably right.

dtype = CategoricalDtype()
categories = getattr(dtype, 'categories', None)
ordered = getattr(dtype, 'ordered', False)
dtype = CategoricalDtype(categories=categories, ordered=ordered)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not really necessary if you allow CategoricalDtype(dtype) IOW, you can check whether the incoming categories is an instance of a CagetegoricalDtype and act accordingly.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I like overloading the categories like that.

Type for categorical data with the categories and orderedness,
but not the values.
.. versionadded:: 0.20.0

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

versionchanged

Examples
--------
>>> t = CategoricalDtype(categories=['b', 'a'], ordered=True)
>>> s = Series(['a', 'a', 'b', 'b', 'a'])

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't show anything not related to the actual dtype here.

def __new__(cls, categories=None, ordered=False, fastpath=False):
from pandas.core.indexes.base import Index
if categories is not None:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do the import inside the function

# validation
cls._validate_categories(categories, fastpath=fastpath)
cls._validate_ordered(ordered)
# We have a choice when hashing pandas unordered categoricals

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use blank lines to indicate different thoughts

def __new__(cls, categories=None, ordered=False, fastpath=False):
from pandas.core.indexes.base import Index
if categories is not None:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

categories could be a CategoricalDtype (see my comment above)

@jorisvandenbossche

@TomAugspurger

@jorisvandenbossche Agreed about not having a fastpath in the public API. see 91752fe

Initially we used __new__ for initialization, since we cached the instance (even though it didn't take parameters). That commit cleans things up a bit.

jreback

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

look good. a few changes.

@@ -646,7 +646,10 @@ strings and apply several methods to it. These can be accessed like
Categorical
~~~~~~~~~~~
If the Series is of dtype ``category``, ``Series.cat`` can be used to change the the categorical
.. autoclass:: api.types.CategoricalDtype
:members: categories, ordered

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, we should use the auto* things more readily in other places. Maybe make an issue about this.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we in general need to do this, as for most functions/methods, we already have the generated pages to link to

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just did this instead of autosummary since there are a bunch of unrelated methods that are just there for NumPy duck-typing.

A categorical's type is fully described by
1. its categories: a sequence of unique values and no missing values

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe show as

1. ``categories``: ....
2. ``ordered``: .....

as these are the actual kwargs

expects a `dtype`. For example :func:`pandas.read_csv`,
:func:`pandas.DataFrame.astype`, or in the Series constructor.
As a convenience, you can use the string ``'category'`` in place of a

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe put this in a warning/note as this is the backward compat aspect here. (and then refer to this in the whatsnew)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is back incompat about doing dtype='category' ?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Jeff was saying that was the only way to do things in the past (backwards compatible). But I don't want to give the impression that we're thinking about deprecating dtype='category'.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but you didn't do that actual deprecation (or not yet)?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but you didn't do that actual deprecation (or not yet)?

I don't think this is desirable, 'category' is a nice convenience

Sorry, I was speaking about astype(dtype='category', categories=[...]). I suppose I misread Jeff's comment (as it was about backwards compatibility, I thought it was about that part, as this is the only part that might be back incompat, if we deprecate categories=.. (not dtype='category')

@@ -10,6 +10,8 @@ users upgrade to this version.
Highlights include:
- Integration with `Apache Parquet https://parquet.apache.org/`__, including a new top-level :func:`read_parquet` and :func:`DataFrame.to_parquet` method, see :ref:`here <io.parquet>`.
- New user-facing :class:`pandas.api.types.CategoricalDtype` for specifying
categoricals independent of the data (:issue:`14711`, :issue:`15078`)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add the link the sub-section (and then list the issues there instead)

expanded to include the ``categories`` and ``ordered`` attributes. A
``CategoricalDtype`` can be used to specify the set of categories and
orderedness of an array, independent of the data themselves. This can be useful,
e.g., when converting string data to a ``Categorical``:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this

data = self.categories._format_data(name=self.__class__.__name__)
return tpl.format(data, self.ordered)
def __repr__(self):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

categories = list(categories) # breaks if a np.array of categories
cat_array = hash_tuples(categories)
else:
if categories.dtype == 'O':

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

# we want to reuse self.dtype if possible, i.e. neither are
# overridden.
if dtype is not None and (categories is not None or
ordered is not None):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

msg?

(pd.Categorical(['a', 'b']), CategoricalDtype(['a', 'b'])),
(pd.CategoricalIndex(['a', 'b']).dtype, CategoricalDtype(['a', 'b'])),
(pd.CategoricalIndex(['a', 'b']), CategoricalDtype(['a', 'b'])),
(CategoricalDtype(), CategoricalDtype()),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leave original pd.Categorical(['a', 'b']).dtype, CategoricalDtype()) which should still work

assert result == expected
def test_invalid_raises(self):
with tm.assert_raises_regex(TypeError, 'ordered'):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might add a test for passing just a string for the first argument (should fail)
e.g.
CategoricalDtype('category') (see my comment about a from_string above, which we should add for compat.

@TomAugspurger

Correct, and I don't plan to deprecate `dtype'category'`.

On Mon, Sep 18, 2017 at 5:55 AM, Joris Van den Bossche < ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In doc/source/categorical.rst <#16015 (comment)>: > +should be inferred from whatever is present in the data when the +:class:`pandas.Categorical` is created. + +.. ipython:: python + + from pandas.api.types import CategoricalDtype + + CategoricalDtype(['a', 'b', 'c']) + CategoricalDtype(['a', 'b', 'c'], ordered=True) + CategoricalDtype() + +A :class:`~pandas.api.types.CategoricalDtype` can be used in any place pandas +expects a `dtype`. For example :func:`pandas.read_csv`, +:func:`pandas.DataFrame.astype`, or in the Series constructor. + +As a convenience, you can use the string ``'category'`` in place of a Yes, but you didn't do that actual deprecation (or not yet)? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#16015 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHItygdUADe7vz66Heen2kIDzUzdl9ks5sjkwVgaJpZM4M-hoW> .

@jreback

Yes, but you didn't do that actual deprecation (or not yet)?

I don't think this is desirable, 'category' is a nice convenience

@TomAugspurger

We extended the CategoricalDtype to accept optional categories and ordered argument.

pd.CategoricalDtype(categories=['a', 'b'], ordered=True

CategoricalDtype is now part of the public API. This allows users to specify the desired categories and orderedness of an operation ahead of time. The current behavior, which is still possible with categories=None, the default, is to infer the categories from whatever is present.

This change will make it easy to implement support for specifying categories that are know ahead of time in other places e.g. .astype, .read_csv, and the Series constructor.

Closes pandas-dev#14711 Closes pandas-dev#15078 Closes pandas-dev#14676

@TomAugspurger

@TomAugspurger

Get a valid instance of CategoricalDtype as early as possible, and use that throughout.

@TomAugspurger

416d1d7 has the changes to Categorical.__init__ that you requested @jreback. It's a bit messy since there are basically 3 ways for specifying the categories

  1. passing dtype=CategoricalDtype(...)
  2. passing categories=...
  3. passing a Categorical for values

I think all three are useful and (hopefully) well-tested now. I'll update the docstring to indicate the priority (basically moving that comment to the docstring)

@TomAugspurger

jreback

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a couple of comments, which can be put in a followup. let's merge on green.

Since ``dtype='category'`` is essentially ``CategoricalDtype(None, False)``,
and since all instances ``CategoricalDtype`` compare equal to ``'`category'``,
all instances of ``CategoricalDtype`` compare equal to a ``CategoricalDtype(None)``

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so you should say CategoricalDtype(None, False) at the end.

categories = list(categories) # breaks if a np.array of categories
cat_array = hash_tuples(categories)
else:
if categories.dtype == 'O':

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

neh hash collisions basically don't happen in the space of things we are doing (normally). however you can collide by construction, e.g. code doing something wrong..

In any event we should move this code (can be a followup, maybe make an issue for things to do)

return hash(self) == hash(other)
def __unicode__(self):
tpl = u'CategoricalDtype(categories={}ordered={})'

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have this hacky thing in Categorical to repr only 10 categories. CategoricalIndex actually correctly uses the option max_categories, so should do that here (again can do as a followup)

tm.assert_index_equal(result.categories, expected)
assert result.ordered is False
def test_constructor_tuples_datetimes(self):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is janky. I don't think we should actually allow it. Where did this example come from? So we allow tuples as categories, I guess so.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was extracted from a groupby test that was failing when I improperly handled tuples of level-1. Agreed that it's weird, but I'm inclined to support it.

jreback

expanded to include the ``categories`` and ``ordered`` attributes. A
``CategoricalDtype`` can be used to specify the set of categories and
orderedness of an array, independent of the data themselves. This can be useful,
e.g., when converting string data to a ``Categorical`` (:issue:`14711`, :issue:`15078`):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is 1 more issue you listed in the top of the PR

@TomAugspurger

@TomAugspurger

Thanks. I fixed up the doc comments. I'll merge on green and do the other followups later today.

On Fri, Sep 22, 2017 at 7:31 PM, Jeff Reback ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In doc/source/whatsnew/v0.21.0.txt <#16015 (comment)>: > @@ -89,6 +91,30 @@ This does not raise any obvious exceptions, but also does not create a new colum Setting a list-like data structure into a new attribute now raise a ``UserWarning`` about the potential for unexpected behavior. See :ref:`Attribute Access <indexing.attribute_access>`. +.. _whatsnew_0210.enhancements.categorical_dtype: + +``CategoricalDtype`` for specifying categoricals +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +:class:`pandas.api.types.CategoricalDtype` has been added to the public API and +expanded to include the ``categories`` and ``ordered`` attributes. A +``CategoricalDtype`` can be used to specify the set of categories and +orderedness of an array, independent of the data themselves. This can be useful, +e.g., when converting string data to a ``Categorical`` (:issue:`14711`, :issue:`15078`): there is 1 more issue you listed in the top of the PR — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#16015 (review)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIlwjFqIXM4vMTgsVZ20tYOzp9CQSks5slFFogaJpZM4M-hoW> .

@TomAugspurger

@TomAugspurger

@jreback

alanbato pushed a commit to alanbato/pandas that referenced this pull request

Nov 10, 2017

@TomAugspurger @alanbato

No-Stream pushed a commit to No-Stream/pandas that referenced this pull request

Nov 28, 2017

@TomAugspurger @No-Stream