BUG: Categorical comparison with unordered by TomAugspurger · Pull Request #16339 · pandas-dev/pandas (original) (raw)
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Conversation18 Commits1 Checks0 Files changed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters
[ Show hidden characters]({{ revealButtonHref }})
Fixes categorical comparison operations improperly considering
ordering when two unordered categoricals are compared.
Closes #16014
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
result = c1 == c2 |
---|
assert result[0] |
assert not result[1] |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally find tm.assert_array_equal(c1 == c2, np.array([True, False]))
clearer in this case (but that is maybe taste :-))
def test_unordered_different_order_equal(self): |
---|
# https://github.com/pandas-dev/pandas/issues/16014 |
c1 = Categorical(['a', 'b'], categories=['a', 'b'], ordered=False) |
c2 = Categorical(['a', 'b'], categories=['b', 'a'], ordered=False) |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A lot of the tests above are with Series instead of Categorical.
Don't need to change all, but maybe good to test this with a Series as well
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor comments, otherwise lgtm.
@@ -453,6 +453,14 @@ the original values: |
---|
np.asarray(cat) > base |
When you compare two unordered categoricals with the same categories, the order is not considered: |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
versionadded tag
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think not necessary since it's a bugfix.
msg = ("Categoricals can only be compared if " |
---|
"'categories' are the same") |
if len(self.categories) != len(other.categories): |
raise TypeError(msg) |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe this could be a more specialized message (e.g the len comparison)
c2 = Categorical(['a', 'c'], categories=['c', 'a'], ordered=False) |
---|
with pytest.raises(TypeError) as rec: |
c1 == c2 |
assert rec.match("Categoricals can only be compared") |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should choose a way to do this, this way or the tm.assert_raises_regex
na_mask = (self._codes == -1) | (other._codes == -1) |
---|
if not self.ordered: |
# Comparison uses codes, so align theirs to ours |
other_codes = _get_codes_for_values(other, self.categories) |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only needed when the categories are not equal?
Might be good to do a quick perf check for the basic case of comparing categoricals with equal categories
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch. Just pushed a commit fixing this to only do the recoding when the categories don't match.
msg = ("Categoricals can only be compared if " |
---|
"'categories' are the same.") |
if len(self.categories) != len(other.categories): |
raise TypeError(msg + " Categories are different lengths") |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you are including the origninal message here, but it its a little awkward
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What did you have in mind? I like how it's currently "Categoricals can only be compared if 'categories' are the same. Categories are different lengths." Since it's the general problem (different categories) and a specific hint
as to what's wrong (different lengths)
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually this is fine.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor comment, lgtm.
if not (self.ordered == other.ordered): |
---|
raise TypeError("Categoricals can only be compared if " |
"'ordered' is the same") |
na_mask = (self._codes == -1) | (other._codes == -1) |
if not self.categories.equals(other.categories): |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe you can do a if not self.ordered and and not self.categories.equals(..)
to avoid doing this check if the ordered case (when the check is not needed, as it was already checked above)
Fixes categorical comparison operations improperly considering ordering when two unordered categoricals are compared.
Closes pandas-dev#16014
pawroman added a commit to pawroman/pandas that referenced this pull request
pcluo pushed a commit to pcluo/pandas that referenced this pull request
Fixes categorical comparison operations improperly considering ordering when two unordered categoricals are compared.
Closes pandas-dev#16014
TomAugspurger added a commit to TomAugspurger/pandas that referenced this pull request
Fixes categorical comparison operations improperly considering ordering when two unordered categoricals are compared.
Closes pandas-dev#16014 (cherry picked from commit 91e9e52)
TomAugspurger added a commit that referenced this pull request
Fixes categorical comparison operations improperly considering ordering when two unordered categoricals are compared.
Closes #16014 (cherry picked from commit 91e9e52)
stangirala pushed a commit to stangirala/pandas that referenced this pull request
Fixes categorical comparison operations improperly considering ordering when two unordered categoricals are compared.
Closes pandas-dev#16014