API: Make Categorical.searchsorted returns a scalar when supplied a scalar by topper-123 · Pull Request #23466 · pandas-dev/pandas (original) (raw)
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Conversation16 Commits3 Checks0 Files changed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters
[ Show hidden characters]({{ revealButtonHref }})
- closes BUG: CategoricalIndex.searchsorted doesn't return a scalar if input was scalar #21019
- tests added / passed
- passes
git diff upstream/master -u -- "*.py" | flake8 --diff
- whatsnew entry
Categorical.searchsorted
returns the wrong shape for scalar input. Numpy arrays and all other array types return a scalar if the input is a scalar, but Categorical
does not.
For example:
import numpy as np np.array([1, 2, 3]).searchsorted(1) 0 np.array([1, 2, 3]).searchsorted([1]) array([0]) import pandas as pd d = pd.date_range('2018', periods=4) d.searchsorted(d[0]) 0 d.searchsorted(d[:1]) array([0])
n = 100_000 c = pd.Categorical(list('a' * n + 'b' * n + 'c' * n), ordered=True) c.searchsorted('b') array([100000], dtype=int32) # master 100000 # this PR. Scalar input should lead to scalar output c.searchsorted(['b']) array([100000], dtype=int32) # master and this PR
This new implementation is BTW quite a bit faster than the old implementation, because we avoid recoding the codes when doing the self.codes.searchsorted(code, ...)
bit:
%timeit c.searchsorted('b') 237 µs # master 6.12 µs # this PR
A concequence of the new implementation is that KeyError is now raised when a key isn't found. Previously a ValueError was raised.
if is_scalar(value): |
---|
codes = self.categories.get_loc(value) |
else: |
codes = [self.categories.get_loc(val) for val in value] |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can use .get_indexer
here
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately get_indexer
is much slower than get_loc
:
%timeit c.categories.get_loc('b') 6.12 µs # this PR %timeit c.categories.get_indexer(['b']) 257 µs
I've made the update to use .get_indexer
anyway, and will use this as an opportunity to look for a way to make get_indexer
faster, as that will yield benefits beyound .searchsorted
. Alternatively I can roll back this last commit, and add the get_indexer
part later, when I figure out why get_indexer is slow.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.get_indexer is for many items when it will be much faster than an iteration of .get_loc, but for a small number of items the reverse maybe true, e.g. there will be a cross-over point.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that is true here: get_loc
makes a call to get_indexer
, so get_indexer
shouldn't be slower, and the very least not this much slower. My guess is that there is some unneeded type conversion or parameter usage happening.
I'll look into to it. If everything is in get_indexer for the right reasons, I just won't pursue the case further.
looks fine, ping on green.
if -1 in values_as_codes: |
raise ValueError("Value(s) to be inserted must be in categories.") |
if is_scalar(value): |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think this is confusing code because get_loc raises, i would rather just use .get_indexer here
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The point of searchsorted is fast searching. get_indexer
is currently very very slow, as it always creates an array. get_loc
OTOH can return scalar or a slice, which is both faster to create and faster to use.
So I think we need to keep get_loc
, unless get_indexer
gets a redesign
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and this actually makes a difference? show this specific case
i am sure that optimizing get_indexer would not be hard and is a better soln
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
c = pd.Categorical(list('a' + 'b' + 'c' )) %timeit c.categories.get_loc('b') 1.19 µs %timeit c.categories.get_indexer(['b']) 261 µs
I can take look at optimizing get_indexer
It turns out that making get_indexer
fast in not easy. The issue is that the method needs an Index as its argument, or converts its input to an Index. Converting to Index is a very slow process, and probably it's best to make get_indexer
use arrays/ExtensionArrays (lower overhead when creating, presumably), but that's a completely different issue.
So I've reverted to make minimal changes in searchsorted
, and only do the changes in the API (scalar input leads to scalar output).
I'll take a look at making get_indexer
faster in a seperate PR and then - if I succeed - make searchsorted
faster using get_indexer
.
topper-123 changed the title
API/PERF: Categorical.searchsorted is faster and returns a scalar, when supplied a scalar API: Categorical.searchsorted returns a scalar, when supplied a scalar
topper-123 changed the title
API: Categorical.searchsorted returns a scalar, when supplied a scalar API: Make Categorical.searchsorted returns a scalar when supplied a scalar
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tiny typo. ping on green.
@@ -960,6 +960,8 @@ Other API Changes |
---|
- Slicing a single row of a DataFrame with multiple ExtensionArrays of the same type now preserves the dtype, rather than coercing to object (:issue:`22784`) |
- :class:`DateOffset` attribute `_cacheable` and method `_should_cache` have been removed (:issue:`23118`) |
- Comparing :class:`Timedelta` to be less or greater than unknown types now raises a ``TypeError`` instead of returning ``False`` (:issue:`20829`) |
- :meth:`Categorical.searchsorted`, when supplied a scalar value to search for, now returns a scalar instead of an array (:issue:`23466`). |
- :meth:`Categorical.searchsorted` now raises a ``keyError`` rather that a ``ValueError``, if a searched for key is not found in its categories (:issue:`23466`). |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
KeyError
thoo added a commit to thoo/pandas that referenced this pull request
tm9k1 pushed a commit to tm9k1/pandas that referenced this pull request
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request