[ArrayManager] Add libreduction frame Slider for ArrayManager by jorisvandenbossche · Pull Request #40171 · pandas-dev/pandas (original) (raw)
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters
[ Show hidden characters]({{ revealButtonHref }})
xref #39146 and follow-up on #40050 (comment)
While getting groupby working for ArrayManager in #40050, I didn't yet use fast_apply / libreduction.apply_frame_axis0 because that relies on BlockManager internals. But, libreduction is faster than the python fallback (luckily, otherwise we would have it for nothing ;)).
So this PR is adding a version of the BlockSlider but then for a frame backed by an ArrayManager.
I didn't combine BlockSlider / ArraySlider in a single class for now (first wanted to get it working). I assume it would be "possible" with some gymnastics (there are several details that are different, such as different strides/shape index being used to move, access to blklocs/blklocs, etc).
So personally I would prefer keeping it as a separate class (certainly given that we don't intend to keep both long term, I think)
For benchmarks, we have groupby.Apply.time_scalar_function_single/multi_col that exercises this fast path, and this PR brings the ArrayManager code path back on par with the BlockManager version (before this PR, that benchmark saw a 3x slowdown because of using the python fallback).
| mgr._blknos = self.orig_blknos |
|---|
| cdef class ArraySlider(FrameSlider): |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i doubt it will be big, but there might be some speedups available py doing @cython.final on both of BlockSlider and ArraySlider
| # GH#35417 attributes we need to restore at each step in case |
|---|
| # the function modified them. |
| self.orig_arrays = self.dummy._mgr.arrays |
| self.arrays = list(self.dummy._mgr.arrays) |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BlockSlider defines self.blk_values = [block.values for block in self.dummy._mgr.blocks] which i think can now use the arrays property to match self.arrays here
| for i, arr in enumerate(self.arrays): |
|---|
| self.base_ptrs[i] = (<ndarray>arr).data |
| def __dealloc__(self): |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like this is the same in both classes; put it in the base class?
skipping a bunch of commenting in-line: i think a bunch more of this can be shared between the classes
For benchmarks, we have groupby.Apply.time_scalar_function_single/multi_col that exercises this fast path, and this PR brings the ArrayManager code path back on par with the BlockManager version (before this PR, that benchmark saw a 3x slowdown because of using the python fallback).
Have you happened to profile this? IIRC from when ive tried to remove fast_apply the big perf hit came in the BlockManager/Block/Index constructors, at least one of which I'd expect ArrayManager to have an easier time with.
Yes, I profiled it (because I was first trying to avoid doing this PR ;)), and indeed a significant part of the overhead comes from the constructors. But I am not sure this can be much improved. Eg the _chop method in groupby/ops.py is already constructing a DataFrame from a manager, which already is a fastpath.
Additionally, part of the overhead comes from slicing the Index, which turns out to be quite expensive. This is something that can be improved (my idea was to add an Index._slice() method that can be used in Manager.get_slice(), because the main Index.__getitem__ is harder to optimize in general).
Another part of overhead comes from checking whether the result is "like indexed" (getting the axes, comparing them) at:
| group_axes = group.axes |
|---|
| res = f(group) |
| if not _is_indexed_like(res, group_axes, axis): |
| mutated = True |
There are certainly some percentages here and there to shave off in the python implementation, but I think the cython version will always be faster. But we should maybe discuss how much slowdown we want to accept to get rid of the libreduction fast_apply altogether (but would like to keep that separate from ArrayManager performance evaluation).
Yes, I profiled it
Mind posting the results?
my idea was to add an Index._slice() method that can be used in Manager.get_slice(), because the main Index.getitem is harder to optimize in general
I like this idea. IIUC it would allow us to go through the fastpath independent of what type of Index we have (and get rid of _index_data)
But we should maybe discuss how much slowdown we want to accept to get rid of the libreduction fast_apply altogether
Sure. I'll throw out a number: if we could get worst-cast down to within about 3x and most-cases within 1.5x, I'd be open to removing the cython paths. They are a disproportionate producer of headaches. (From #36459 (possibly out of date) "entirely disabling the cython path leads to 4 currently-xfailed tests passing")
Besides which, if/when cython3 becomes available, we may have to get rid of these anyway.
Mind posting the results?
I didn't yet find an easy way to post flamegraphs from snakeviz... But so it's just a profile of the groupby call in this snippet (taken from the benchmark case that was affected by using the python fallback for ArrayManager):
N = 10 ** 4 labels = np.random.randint(0, 2000, size=N) labels2 = np.random.randint(0, 3, size=N) df = DataFrame( { "key": labels, "key2": labels2, "value1": np.random.randn(N), "value2": ["foo", "bar", "baz", "qux"] * (N // 4), } ) df_am = df._as_manager("array")
df_am.groupby("key").apply(lambda x: 1)
I should note that this specific benchmark has a resulting (aggregated) frame with about 2000 rows, starting from a dataframe with N = 10 ** 4, which means that on average the intermediate DataFrame that gets constructed has 5 rows ... Which is pretty small (and in addition it uses a dummy aggregation function that doesn't take much time itself).
And with this rather special (IMO) case, I only see a 3-4x slowdown (this is comparing BM with fast_apply vs AM with fallback, but my guess is that BM with fallback will be similar):
In [2]: %timeit df.groupby("key").apply(lambda x: 1)
6.65 ms ± 854 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [3]: %timeit df_am.groupby("key").apply(lambda x: 1)
21.8 ms ± 998 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
but when increasing N to 10 ** 6, it becomes less than a 2x slowdown:
In [6]: %timeit df.groupby("key").apply(lambda x: 1)
162 ms ± 5.25 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [7]: %timeit df_am.groupby("key").apply(lambda x: 1)
263 ms ± 11.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
(so that's already close to the numbers you cite ;))
But so it's just a profile
I was thinking %prun -s cumtime for i in range(100): do_thing()
Actually, in this second example with bigger data (N = 10 ** 6), the slowdown seems to come from the fact that with the python fallback, we call splitter._get_sorted_data() twice ..
Removing this duplication gives:
In [2]: %timeit df.groupby("key").apply(lambda x: 1)
158 ms ± 4.87 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [3]: %timeit df_am.groupby("key").apply(lambda x: 1)
159 ms ± 5.32 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
yeah ideally if you can create a common base class here. of course if we can blow awa all the cython even better :->
@jorisvandenbossche this is just perf, not a blocker for getting AM fully functional? if so, mind putting it on the backburner for a day or two while i try out my get-rid-of-libreduction idea?
Using the example #40171 (comment) (which IIRC is based on the asv that was most-affected by disabling fast_apply last time I looked getting rid of libreduction)
import numpy as np
import pandas as pd
N = 10 ** 4
labels = np.random.randint(0, 2000, size=N)
labels2 = np.random.randint(0, 3, size=N)
df = pd.DataFrame(
{
"key": labels,
"key2": labels2,
"value1": np.random.randn(N),
"value2": ["foo", "bar", "baz", "qux"] * (N // 4),
}
)
%prun -s cumtime for i in range(100): df.groupby("key").apply(lambda x: 1)
In master, where this goes through fast_apply:
ncalls tottime percall cumtime percall filename:lineno(function)
100 0.001 0.000 1.027 0.010 groupby.py:913(apply)
100 0.002 0.000 1.023 0.010 groupby.py:962(_python_apply_general)
100 0.002 0.000 0.975 0.010 ops.py:263(apply)
100 0.000 0.000 0.838 0.008 ops.py:1020(fast_apply)
100 0.310 0.003 0.829 0.008 {pandas._libs.reduction.apply_frame_axis0}
199001 0.124 0.000 0.287 0.000 base.py:725(_engine)
199101 0.125 0.000 0.206 0.000 base.py:4511(__getitem__)
408416/207611 0.060 0.000 0.100 0.000 {built-in method builtins.len}
199001 0.045 0.000 0.066 0.000 base.py:4345(_get_engine_target)
199002 0.047 0.000 0.063 0.000 common.py:157(cast_scalar_indexer)
401 0.003 0.000 0.058 0.000 series.py:277(__init__)
200503 0.040 0.000 0.056 0.000 base.py:751(__len__)
100 0.000 0.000 0.048 0.000 ops.py:232(_get_splitter)
100 0.000 0.000 0.047 0.000 ops.py:380(group_info)
100 0.000 0.000 0.047 0.000 ops.py:398(_get_compressed_codes)
100 0.000 0.000 0.046 0.000 ops.py:339(codes)
100 0.000 0.000 0.046 0.000 ops.py:341(<listcomp>)
200 0.000 0.000 0.046 0.000 grouper.py:582(codes)
100 0.000 0.000 0.046 0.000 grouper.py:603(_make_codes)
100 0.002 0.000 0.046 0.000 generic.py:1246(_wrap_applied_output)
100 0.000 0.000 0.045 0.000 ops.py:1003(sorted_data)
100 0.002 0.000 0.042 0.000 algorithms.py:559(factorize)
300 0.002 0.000 0.042 0.000 construction.py:455(sanitize_array)
100 0.000 0.000 0.040 0.000 generic.py:3533(take)
100 0.001 0.000 0.037 0.000 managers.py:1448(take)
600 0.001 0.000 0.035 0.000 take.py:23(take_nd)
100 0.001 0.000 0.032 0.000 cast.py:121(maybe_convert_platform)
600 0.003 0.000 0.031 0.000 take.py:75(_take_nd_ndarray)
If I just disable fast_apply but leave everything else as in master:
ncalls tottime percall cumtime percall filename:lineno(function)
100 0.001 0.000 5.545 0.055 groupby.py:913(apply)
100 0.002 0.000 5.540 0.055 groupby.py:962(_python_apply_general)
100 0.528 0.005 5.494 0.055 ops.py:263(apply)
198400 0.168 0.000 4.660 0.000 ops.py:990(__iter__)
198300 0.302 0.000 4.438 0.000 ops.py:1025(_chop)
198300 0.350 0.000 3.552 0.000 managers.py:788(get_slice)
198300 0.179 0.000 1.513 0.000 managers.py:794(<listcomp>)
594900 0.742 0.000 1.334 0.000 blocks.py:358(getitem_block)
198300 0.206 0.000 1.006 0.000 base.py:4511(__getitem__)
198400 0.275 0.000 0.684 0.000 managers.py:149(__init__)
198300 0.162 0.000 0.666 0.000 numeric.py:125(_shallow_copy)
198400 0.158 0.000 0.562 0.000 frame.py:550(__init__)
198300 0.131 0.000 0.504 0.000 base.py:648(_shallow_copy)
198800 0.276 0.000 0.363 0.000 generic.py:235(__init__)
198600 0.189 0.000 0.341 0.000 base.py:563(_simple_new)
594900 0.191 0.000 0.280 0.000 blocks.py:138(_simple_new)
198400 0.114 0.000 0.234 0.000 managers.py:155(<listcomp>)
2217700 0.191 0.000 0.194 0.000 {built-in method builtins.isinstance}
594900 0.192 0.000 0.192 0.000 blocks.py:353(_slice)
595400 0.122 0.000 0.174 0.000 managers.py:233(ndim)
793500 0.122 0.000 0.122 0.000 {built-in method __new__ of type object at 0x103788410}
397100 0.087 0.000 0.120 0.000 base.py:6112(ensure_index)
199800 0.120 0.000 0.120 0.000 {pandas._libs.lib.is_scalar}
198300 0.067 0.000 0.108 0.000 ops.py:956(_is_indexed_like)
198600 0.085 0.000 0.105 0.000 base.py:714(_reset_identity)
198800 0.087 0.000 0.087 0.000 flags.py:47(__init__)
596000 0.079 0.000 0.079 0.000 blocks.py:295(mgr_locs)
198400 0.071 0.000 0.071 0.000 frame.py:692(axes)
400 0.003 0.000 0.056 0.000 series.py:277(__init__)
603600/602300 0.053 0.000 0.054 0.000 {built-in method builtins.len}
3.552 seconds in BlockManager.get_slice, which I think we can cut down significantly by, among other things, making the BlockManager/Block signatures stricter and skip validation in their __init__ methods. For this to be safe, we need something like #40182 to keep that validation in place for downstream libraries.
So the question: how much overhead are we willing to trade for getting rid of libreduction?
this is just perf, not a blocker for getting AM fully functional?
Indeed, just performance. No problem in first exploring the "get rid of reduce".
Can you move your comment above to a new issue to discuss this?
Can you move your comment above to a new issue to discuss this?
sure
This was referenced
Mar 4, 2021
This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.
This was referenced
Apr 14, 2021
Yes, closing this awaiting any decision on #40263