PERF: faster placement creating extension blocks from arrays by jorisvandenbossche · Pull Request #32856 · pandas-dev/pandas (original) (raw)
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Conversation24 Commits6 Checks0 Files changed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters
[ Show hidden characters]({{ revealButtonHref }})
When creating a DataFrame from many arrays stored in ExtensionBlocks, it seems quite some time is taken inside BlockPlacement using np.require
on the passed list. Specifying the placement as a slice instead gives a much faster creation of the BlockPlacement. This delays the conversion to an array, though, but afterwards the conversion of the slice to an array inside BlockPlacement when neeeded is faster than an initial creation of a BlockPlacement from a list/array of 1 element.
From investigating #32196 (comment)
@rth this reduces it with another third! (only from the dataframe creation, to be clear)
Using the same example from #32826. With:
arrays = [pd.arrays.SparseArray(np.random.randint(0, 2, 1000), dtype="float64") for _ in range(10000)]
index = pd.Index(range(len(arrays[0])))
columns = pd.Index(range(len(arrays)))
it gives
In [4]: %timeit pd.DataFrame._from_arrays(arrays, index=index, columns=columns)
113 ms ± 874 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) <-- master
72.9 ms ± 648 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) <-- PR
This is also indirectly covered by the sparse benchmark, but adding some benchmarks specifically for _from_arrays
:
before after ratio
[e31354f0] [6d6e822a]
<master> <perf-placement>
- 14.5±0.7ms 9.58±0.6ms 0.66 frame_ctor.FromArrays.time_frame_from_arrays_sparse
make_block(array, klass=ObjectValuesExtensionBlock, placement=[i]) |
---|
make_block( |
array, klass=ObjectValuesExtensionBlock, placement=slice(i, i + 1, 1) |
) |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rather than changing this here
simply convert a single integer into a slice at a lower level
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, indeed, I can create the slice inside BlockPlacement constructor. But don't we then want to explicitly pass a single integer instead of a list of 1 integer?
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we require list / slice
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To compare, I just pushed a commit that does both.
Personally, I like passing it as an integer. It's another 33% faster compared to passing it as a slice or 1-element list, and it makes it also explicit when constructing it that it is about a single column.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops, I missed you proposed a single integer yourself (for some reason I thought you wanted to catch the single element list in BlockPlacement). Updated.
another option would be to pass BlockPlacement objects. If we do that consistently, we could remove the checks/casting in the constructor/property
Sorry, I don't understand that option. You still need to create BlockPlacement objects, right? And the question here is about how to create them (from an integer, or from a slice, or a 1-len list)
You still need to create BlockPlacement objects, right? And the question here is about how to create them (from an integer, or from a slice, or a 1-len list)
Yah, the thought was about creating the BlockPlacement object and passing it to make_block
rather than having it created later. Can ignore as orthogonal, but might trim some overhead.
passing it to make_block rather than having it created later.
It shouldn't matter much I think, it is just passed through until ExtensionBlock init, and there it's doing a if not isinstance(placement, BlockPlacement): placement = BlockPlacement(placement)
So we would first need to eliminate all other places where we pass a slice/array as placement to block creation, and then it would only elimiate one isinstance call
So we would first need to eliminate all other places where we pass a slice/array
Yes, it is not a trivial idea.
and then it would only elimiate one isinstance call
We could also make mgr_locs not-a-property, so get marginally faster lookups.
But again, ignore as orthogonal.
Yep, any other comments on the PR itself?
self.columns = pd.Index(range(N_cols)) |
---|
def time_frame_from_arrays_float(self): |
self.df = DataFrame._from_arrays( |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't change this if no other feedback but I don't think you need the assignment here in any of the benchmarks
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True (was only mimicking the other benchmarks in this file)
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm this is wrong, but yeah can clean this up in a followup (if you want to create a followup issue or PR to do it)
self.columns = pd.Index(range(N_cols)) |
---|
def time_frame_from_arrays_float(self): |
self.df = DataFrame._from_arrays( |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm this is wrong, but yeah can clean this up in a followup (if you want to create a followup issue or PR to do it)
Related to non-user accessible pandas implementation
label
rth mentioned this pull request
SeeminSyed pushed a commit to CSCD01-team01/pandas that referenced this pull request
jbrockmendel pushed a commit to jbrockmendel/pandas that referenced this pull request
Labels
Related to non-user accessible pandas implementation
Memory or execution speed performance