ENH: cythonize groupby.count by cpcloud · Pull Request #7016 · pandas-dev/pandas (original) (raw)

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation38 Commits5 Checks0 Files changed

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

@cpcloud

closes #7003

vbench results:

-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------
groupby_multi_count                          |   7.3980 | 6814.2579 |   0.0011 |
-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------

@cpcloud

@jreback axis parameter has no effect here...i kept for back compat with test_multilevel... i think should prob be implemented (i'm happy to do it)

@jreback

Yeh not sure what it does in a groupby

as an aside why do u still need the lambda expression in count? is it called?

@cpcloud

oh actually u need to pass a npfunc argument to _groupby_function ... fallback

@jreback

fallback if the dtype does work / match?

FYI I don't think group count will work for datetime64; need to convert to i8 view and then compare against tslib.iNaT for nulls

@cpcloud

fallback is called if any exceptions occur in the cython call

@cpcloud

usually that is a dtype mismatch

@cpcloud cpcloud changed the titleENH: cython groupby.count ENH: cythonize groupby.count

May 1, 2014

@cpcloud

@jreback supporting date-likes here i think requires more than i originally set out to do with this ... about to push object dtype for counts

@jreback

I think datelikes DO need to be handled (you can't just take a i8 view when you pass in?) and then do your comparisons instead of

if val != val

if val != tslib.iNaT

for nan testing?

@cpcloud

Yeah it's there. I was trying to implement ttimedelta arithmetic for groupby but the py26 oddities are in the way. I'll do it in another pr

@cpcloud

Sorry the numpy 1.6 oddities are there

@jreback

ok...no problem (or leave it in python land). is ok too for now; can't leave it hanging though because would break things, ahh yes, the 2.6 oddities, so maybe you can just make it use python and M8/m8 in 2.6 (e.g. have a branch in the code), ugly but what the hey

@cpcloud

@jreback yep took out the cython attempt ... now just squashing ... making sure that the cython path is hit for dates/timedeltas (but only count); current behavior is to raise ... i think there's an issue somewhere about groupby with deltas

jreback

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should float really be here? (as that should be the use_floats)? no?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes float should be here use_(float|int)s is really only for the first two arguments which refer to the name and the ctype, but the result of a groupby operation is always a float (even count, which is astyped to int64 as the last method call before returning to the user).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh..right..np

@jreback

@cpcloud

@cpcloud

@jreback

-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------
groupby_multi_count_dates                    |  20.4120 | 15484.0871 |   0.0013 |
groupby_multi_count                          |  20.8774 | 15609.8307 |   0.0013 |
-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------

758x speedup, think we're doing ok :)

@cpcloud

ah object dtype is still slow , might be another one of my dumb typos

@jreback

can you make the vbench a bit smaller (so the base case takes say 1/10 the time)? (shouldn't really affect the speedup much), which is nice!

@cpcloud

@jreback

when you do v0.14.0, put in the perf section FYI (release notes goes in improvements)

@cpcloud

will do thanks for the heads up

@cpcloud

@cpcloud

@jreback can we get this in soon as green? want to avoid doc merge conflicts.

@jreback

cpcloud added a commit that referenced this pull request

May 5, 2014

@cpcloud

ENH: cythonize groupby.count

@jreback

thinking about this

int types can never be nan at all
and only int64 need comparisons against iNaT, so if less than int64 you don't even need a routine you can just call size (eg it's just the size of the group themselves -/ can't have Nan's)

object types need comparison against the actual NaT value (I think)

wonder if any of that even matters

@cpcloud

Right, really only float, object, and int64 and those are really a view on dates or time deltas which are just ints really. Could have size as the fallback function instead of current version which computes sum of nonnull. Then if I take out the generated for ints it will raise and use the fallback.

@jreback

yep
sorry should have mentioned before

@cpcloud

I can do some microbenchmarks. If size is done with Len then we go from linear in the number of rows to constant so def worth an investigation

@cpcloud

Pandas isn't on iPhone so this may have to wait until tomorrow :/

@jreback

hah

no size is already computed (when the groupby happens) so should be trivial

more worried about the issue of the type coercion in the current cython code vs iNaT which is a bigger value than say int8
though it prob does work

@cpcloud cpcloud deleted the groupby-count-cython branch

May 6, 2014 03:45

@cpcloud

@jreback i have the pr with size do u want me to put it up?

@jreback

@jreback

can you also post the vbench results at the top of the PR?

@cpcloud

@jreback

oh...I meant the cythonized count vs before you merged it in! (in this PR)

because when someone clicks on the release notes they go to this issue

@cpcloud

yep ... that's coming running it now

@jreback

Labels

2 participants

@cpcloud @jreback