Categorical type by TomAugspurger · Pull Request #16015 · pandas-dev/pandas (original) (raw)
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Conversation164 Commits7 Checks0 Files changed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters
[ Show hidden characters]({{ revealButtonHref }})
We extended the CategoricalDtype to accept optional categories
and ordered
argument. CategoricalDtype is now part of the public API. This allows users to
specify the desired categories and orderedness of an operation ahead of time.
The current behavior, which is still possible with categories=None
, the
default, is to infer the categories from whatever is present.
This change will make it easy to implement support for specifying categories
that are know ahead of time in other places e.g. .astype
, .read_csv
, and theSeries
constructor.
Closes #14711
Closes #15078
Closes #14676
Dunno if we would want to do this for 0.20 or not. It ended up not being too big a change, so maybe.
.dtype on a Categorical/CI should construct this
Categorica.is_dtype_equal should just work
can u run asv on this to make sure nothing much changing perf wise
.dtype on a Categorical/CI should construct this
Ha, that's why it felt like so much less code this time, because I forgot to do the other half :)
I'll work on this a bit more today, but will probably push to 0.21
yeah this feels like it should be in 0.21.0. maybe we also expose this in pandas.api.types.CategoricalDtype
rather than at the top-level.
@@ -131,6 +171,33 @@ def __eq__(self, other): |
---|
return isinstance(other, CategoricalDtype) |
@staticmethod |
def _hash_categories(categories): |
from pandas.tools.hashing import hash_array, _combine_hash_arrays |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can just usehash_pandas_object
. Its an ordered hash of the data.
@@ -356,6 +357,70 @@ def test_not_string(self): |
---|
self.assertFalse(is_string_dtype(PeriodDtype('D'))) |
class TestCategoricalDtypeParametrized(object): |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can just call this TestCategoricalDtype
(naming consistency)
result = pd.Series(['a', 'b'], dtype=pd.CategoricalDtype(['b', 'a'])) |
---|
assert is_categorical_dtype(result) |
tm.assert_index_equal(result.cat.categories, pd.Index(['b', 'a'])) |
assert result.cat.ordered is False |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's more idiomatic: assert ... is False
or assert not ...
?
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you are asserting that the result is a boolean (False
) , then the first is more appropriate. If its just not True
then the 2nd is fine (e.g. imagine 0 satisfies 2nd but not 1st). In this case we do want to assert a boolean result.
def test_categories(self): |
---|
result = CategoricalDtype(['a', 'b', 'c']) |
tm.assert_index_equal(result.categories, pd.Index(['a', 'b', 'c'])) |
assert result.ordered is False |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as here
jreback added a commit to jreback/pandas that referenced this pull request
jreback added a commit to jreback/pandas that referenced this pull request
jreback added a commit that referenced this pull request
pcluo pushed a commit to pcluo/pandas that referenced this pull request
Hello @TomAugspurger! Thanks for updating the PR.
Cheers ! There are no PEP8 issues in this Pull Request. 🍻
Comment last updated on September 23, 2017 at 16:34 Hours UTC
@@ -618,3 +618,8 @@ def test_categorical_equality(self, ordered, other, expected): |
---|
c1 = CategoricalDtype(['a', 'b'], ordered) |
result = c1 == other |
assert result == expected |
def test_mixed(self): |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is going to fail at the moment. hash_object_array
tries to work on the mixed array first, and if that fails it falls back to hash_object_array(a.astype(str))
, which will cause a "collision" with the section set of categoricals. Not sure how I'll handle this yet.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks pretty good. I don't like the MultiIndex change (already commened on that PR). instead I would simply just change it (rather than adding a kw); the small api break is fine.
CategoricalDtype |
---|
---------------- |
.. versionadded:: 0.21.0 |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
versionchanged; it exists now.
@@ -6,6 +6,7 @@ |
---|
from pandas.core.algorithms import factorize, unique, value_counts |
from pandas.core.dtypes.missing import isna, isnull, notna, notnull |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think exposing this is a good idea here. We don't directly do this with other dtypes. instead the are in pandas.api.types
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I could go either way on this. It would be the only type at the top level. On the other hand pd.api.types.CategoricalDtype(values)
is quite long :)
I'll think about it some more, but you're probably right.
dtype = CategoricalDtype() |
---|
categories = getattr(dtype, 'categories', None) |
ordered = getattr(dtype, 'ordered', False) |
dtype = CategoricalDtype(categories=categories, ordered=ordered) |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not really necessary if you allow CategoricalDtype(dtype)
IOW, you can check whether the incoming categories
is an instance of a CagetegoricalDtype
and act accordingly.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I like overloading the categories
like that.
Type for categorical data with the categories and orderedness, |
---|
but not the values. |
.. versionadded:: 0.20.0 |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
versionchanged
Examples |
---|
-------- |
>>> t = CategoricalDtype(categories=['b', 'a'], ordered=True) |
>>> s = Series(['a', 'a', 'b', 'b', 'a']) |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't show anything not related to the actual dtype here.
def __new__(cls, categories=None, ordered=False, fastpath=False): |
from pandas.core.indexes.base import Index |
if categories is not None: |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do the import inside the function
# validation |
---|
cls._validate_categories(categories, fastpath=fastpath) |
cls._validate_ordered(ordered) |
# We have a choice when hashing pandas unordered categoricals |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use blank lines to indicate different thoughts
def __new__(cls, categories=None, ordered=False, fastpath=False): |
from pandas.core.indexes.base import Index |
if categories is not None: |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
categories could be a CategoricalDtype (see my comment above)
@jorisvandenbossche Agreed about not having a fastpath
in the public API. see 91752fe
Initially we used __new__
for initialization, since we cached the instance (even though it didn't take parameters). That commit cleans things up a bit.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
look good. a few changes.
@@ -646,7 +646,10 @@ strings and apply several methods to it. These can be accessed like |
---|
Categorical |
~~~~~~~~~~~ |
If the Series is of dtype ``category``, ``Series.cat`` can be used to change the the categorical |
.. autoclass:: api.types.CategoricalDtype |
:members: categories, ordered |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, we should use the auto* things more readily in other places. Maybe make an issue about this.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we in general need to do this, as for most functions/methods, we already have the generated pages to link to
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just did this instead of autosummary
since there are a bunch of unrelated methods that are just there for NumPy duck-typing.
A categorical's type is fully described by |
1. its categories: a sequence of unique values and no missing values |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe show as
1. ``categories``: ....
2. ``ordered``: .....
as these are the actual kwargs
expects a `dtype`. For example :func:`pandas.read_csv`, |
---|
:func:`pandas.DataFrame.astype`, or in the Series constructor. |
As a convenience, you can use the string ``'category'`` in place of a |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe put this in a warning/note as this is the backward compat aspect here. (and then refer to this in the whatsnew)
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is back incompat about doing dtype='category'
?
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Jeff was saying that was the only way to do things in the past (backwards compatible). But I don't want to give the impression that we're thinking about deprecating dtype='category'
.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but you didn't do that actual deprecation (or not yet)?
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but you didn't do that actual deprecation (or not yet)?
I don't think this is desirable, 'category' is a nice convenience
Sorry, I was speaking about astype(dtype='category', categories=[...])
. I suppose I misread Jeff's comment (as it was about backwards compatibility, I thought it was about that part, as this is the only part that might be back incompat, if we deprecate categories=..
(not dtype='category'
)
@@ -10,6 +10,8 @@ users upgrade to this version. |
---|
Highlights include: |
- Integration with `Apache Parquet https://parquet.apache.org/`__, including a new top-level :func:`read_parquet` and :func:`DataFrame.to_parquet` method, see :ref:`here <io.parquet>`. |
- New user-facing :class:`pandas.api.types.CategoricalDtype` for specifying |
categoricals independent of the data (:issue:`14711`, :issue:`15078`) |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add the link the sub-section (and then list the issues there instead)
expanded to include the ``categories`` and ``ordered`` attributes. A |
---|
``CategoricalDtype`` can be used to specify the set of categories and |
orderedness of an array, independent of the data themselves. This can be useful, |
e.g., when converting string data to a ``Categorical``: |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this
data = self.categories._format_data(name=self.__class__.__name__) |
---|
return tpl.format(data, self.ordered) |
def __repr__(self): |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
categories = list(categories) # breaks if a np.array of categories |
---|
cat_array = hash_tuples(categories) |
else: |
if categories.dtype == 'O': |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
# we want to reuse self.dtype if possible, i.e. neither are |
---|
# overridden. |
if dtype is not None and (categories is not None or |
ordered is not None): |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
msg?
(pd.Categorical(['a', 'b']), CategoricalDtype(['a', 'b'])), |
---|
(pd.CategoricalIndex(['a', 'b']).dtype, CategoricalDtype(['a', 'b'])), |
(pd.CategoricalIndex(['a', 'b']), CategoricalDtype(['a', 'b'])), |
(CategoricalDtype(), CategoricalDtype()), |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
leave original pd.Categorical(['a', 'b']).dtype, CategoricalDtype())
which should still work
assert result == expected |
---|
def test_invalid_raises(self): |
with tm.assert_raises_regex(TypeError, 'ordered'): |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might add a test for passing just a string for the first argument (should fail)
e.g.CategoricalDtype('category')
(see my comment about a from_string
above, which we should add for compat.
Correct, and I don't plan to deprecate `dtype'category'`.
On Mon, Sep 18, 2017 at 5:55 AM, Joris Van den Bossche < ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In doc/source/categorical.rst <#16015 (comment)>: > +should be inferred from whatever is present in the data when the +:class:`pandas.Categorical` is created. + +.. ipython:: python + + from pandas.api.types import CategoricalDtype + + CategoricalDtype(['a', 'b', 'c']) + CategoricalDtype(['a', 'b', 'c'], ordered=True) + CategoricalDtype() + +A :class:`~pandas.api.types.CategoricalDtype` can be used in any place pandas +expects a `dtype`. For example :func:`pandas.read_csv`, +:func:`pandas.DataFrame.astype`, or in the Series constructor. + +As a convenience, you can use the string ``'category'`` in place of a Yes, but you didn't do that actual deprecation (or not yet)? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#16015 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHItygdUADe7vz66Heen2kIDzUzdl9ks5sjkwVgaJpZM4M-hoW> .
Yes, but you didn't do that actual deprecation (or not yet)?
I don't think this is desirable, 'category' is a nice convenience
We extended the CategoricalDtype to accept optional categories and ordered argument.
pd.CategoricalDtype(categories=['a', 'b'], ordered=True
CategoricalDtype is now part of the public API. This allows users to specify the desired categories and orderedness of an operation ahead of time. The current behavior, which is still possible with categories=None, the default, is to infer the categories from whatever is present.
This change will make it easy to implement support for specifying categories that are know ahead of time in other places e.g. .astype, .read_csv, and the Series constructor.
Closes pandas-dev#14711 Closes pandas-dev#15078 Closes pandas-dev#14676
Get a valid instance of CategoricalDtype
as early as possible, and use that
throughout.
416d1d7 has the changes to Categorical.__init__
that you requested @jreback. It's a bit messy since there are basically 3 ways for specifying the categories
- passing
dtype=CategoricalDtype(...)
- passing
categories=...
- passing a
Categorical
forvalues
I think all three are useful and (hopefully) well-tested now. I'll update the docstring to indicate the priority (basically moving that comment to the docstring)
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a couple of comments, which can be put in a followup. let's merge on green.
Since ``dtype='category'`` is essentially ``CategoricalDtype(None, False)``, |
and since all instances ``CategoricalDtype`` compare equal to ``'`category'``, |
all instances of ``CategoricalDtype`` compare equal to a ``CategoricalDtype(None)`` |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so you should say CategoricalDtype(None, False)
at the end.
categories = list(categories) # breaks if a np.array of categories |
---|
cat_array = hash_tuples(categories) |
else: |
if categories.dtype == 'O': |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
neh hash collisions basically don't happen in the space of things we are doing (normally). however you can collide by construction, e.g. code doing something wrong..
In any event we should move this code (can be a followup, maybe make an issue for things to do)
return hash(self) == hash(other) |
---|
def __unicode__(self): |
tpl = u'CategoricalDtype(categories={}ordered={})' |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have this hacky thing in Categorical to repr only 10 categories. CategoricalIndex actually correctly uses the option max_categories
, so should do that here (again can do as a followup)
tm.assert_index_equal(result.categories, expected) |
---|
assert result.ordered is False |
def test_constructor_tuples_datetimes(self): |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is janky. I don't think we should actually allow it. Where did this example come from? So we allow tuples as categories, I guess so.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was extracted from a groupby test that was failing when I improperly handled tuples of level-1. Agreed that it's weird, but I'm inclined to support it.
expanded to include the ``categories`` and ``ordered`` attributes. A |
---|
``CategoricalDtype`` can be used to specify the set of categories and |
orderedness of an array, independent of the data themselves. This can be useful, |
e.g., when converting string data to a ``Categorical`` (:issue:`14711`, :issue:`15078`): |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is 1 more issue you listed in the top of the PR
Thanks. I fixed up the doc comments. I'll merge on green and do the other followups later today.
On Fri, Sep 22, 2017 at 7:31 PM, Jeff Reback ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In doc/source/whatsnew/v0.21.0.txt <#16015 (comment)>: > @@ -89,6 +91,30 @@ This does not raise any obvious exceptions, but also does not create a new colum Setting a list-like data structure into a new attribute now raise a ``UserWarning`` about the potential for unexpected behavior. See :ref:`Attribute Access <indexing.attribute_access>`. +.. _whatsnew_0210.enhancements.categorical_dtype: + +``CategoricalDtype`` for specifying categoricals +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +:class:`pandas.api.types.CategoricalDtype` has been added to the public API and +expanded to include the ``categories`` and ``ordered`` attributes. A +``CategoricalDtype`` can be used to specify the set of categories and +orderedness of an array, independent of the data themselves. This can be useful, +e.g., when converting string data to a ``Categorical`` (:issue:`14711`, :issue:`15078`): there is 1 more issue you listed in the top of the PR — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#16015 (review)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIlwjFqIXM4vMTgsVZ20tYOzp9CQSks5slFFogaJpZM4M-hoW> .
alanbato pushed a commit to alanbato/pandas that referenced this pull request
No-Stream pushed a commit to No-Stream/pandas that referenced this pull request