ENH: cythonize groupby.count by cpcloud · Pull Request #7016 · pandas-dev/pandas (original) (raw)
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Conversation38 Commits5 Checks0 Files changed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters
[ Show hidden characters]({{ revealButtonHref }})
closes #7003
- vbench
axis parameter(this only works in non-cython land)- tests for
objectdtype if there are none - datetime64/timedelta64 count
vbench results:
-------------------------------------------------------------------------------
Test name | head[ms] | base[ms] | ratio |
-------------------------------------------------------------------------------
groupby_multi_count | 7.3980 | 6814.2579 | 0.0011 |
-------------------------------------------------------------------------------
Test name | head[ms] | base[ms] | ratio |
-------------------------------------------------------------------------------
@jreback axis parameter has no effect here...i kept for back compat with test_multilevel... i think should prob be implemented (i'm happy to do it)
Yeh not sure what it does in a groupby
as an aside why do u still need the lambda expression in count? is it called?
oh actually u need to pass a npfunc argument to _groupby_function ... fallback
fallback if the dtype does work / match?
FYI I don't think group count will work for datetime64; need to convert to i8 view and then compare against tslib.iNaT for nulls
fallback is called if any exceptions occur in the cython call
usually that is a dtype mismatch
cpcloud changed the title
ENH: cython groupby.count ENH: cythonize groupby.count
@jreback supporting date-likes here i think requires more than i originally set out to do with this ... about to push object dtype for counts
I think datelikes DO need to be handled (you can't just take a i8 view when you pass in?) and then do your comparisons instead of
if val != val
if val != tslib.iNaT
for nan testing?
Yeah it's there. I was trying to implement ttimedelta arithmetic for groupby but the py26 oddities are in the way. I'll do it in another pr
Sorry the numpy 1.6 oddities are there
ok...no problem (or leave it in python land). is ok too for now; can't leave it hanging though because would break things, ahh yes, the 2.6 oddities, so maybe you can just make it use python and M8/m8 in 2.6 (e.g. have a branch in the code), ugly but what the hey
@jreback yep took out the cython attempt ... now just squashing ... making sure that the cython path is hit for dates/timedeltas (but only count); current behavior is to raise ... i think there's an issue somewhere about groupby with deltas
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should float really be here? (as that should be the use_floats)? no?
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes float should be here use_(float|int)s is really only for the first two arguments which refer to the name and the ctype, but the result of a groupby operation is always a float (even count, which is astyped to int64 as the last method call before returning to the user).
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh..right..np
-------------------------------------------------------------------------------
Test name | head[ms] | base[ms] | ratio |
-------------------------------------------------------------------------------
groupby_multi_count_dates | 20.4120 | 15484.0871 | 0.0013 |
groupby_multi_count | 20.8774 | 15609.8307 | 0.0013 |
-------------------------------------------------------------------------------
Test name | head[ms] | base[ms] | ratio |
-------------------------------------------------------------------------------
758x speedup, think we're doing ok :)
ah object dtype is still slow , might be another one of my dumb typos
can you make the vbench a bit smaller (so the base case takes say 1/10 the time)? (shouldn't really affect the speedup much), which is nice!
when you do v0.14.0, put in the perf section FYI (release notes goes in improvements)
will do thanks for the heads up
@jreback can we get this in soon as green? want to avoid doc merge conflicts.
cpcloud added a commit that referenced this pull request
ENH: cythonize groupby.count
thinking about this
int types can never be nan at all
and only int64 need comparisons against iNaT, so if less than int64 you don't even need a routine you can just call size (eg it's just the size of the group themselves -/ can't have Nan's)
object types need comparison against the actual NaT value (I think)
wonder if any of that even matters
Right, really only float, object, and int64 and those are really a view on dates or time deltas which are just ints really. Could have size as the fallback function instead of current version which computes sum of nonnull. Then if I take out the generated for ints it will raise and use the fallback.
yep
sorry should have mentioned before
I can do some microbenchmarks. If size is done with Len then we go from linear in the number of rows to constant so def worth an investigation
Pandas isn't on iPhone so this may have to wait until tomorrow :/
hah
no size is already computed (when the groupby happens) so should be trivial
more worried about the issue of the type coercion in the current cython code vs iNaT which is a bigger value than say int8
though it prob does work
cpcloud deleted the groupby-count-cython branch
@jreback i have the pr with size do u want me to put it up?
can you also post the vbench results at the top of the PR?
oh...I meant the cythonized count vs before you merged it in! (in this PR)
because when someone clicks on the release notes they go to this issue
yep ... that's coming running it now