PERF: faster placement creating extension blocks from arrays by jorisvandenbossche · Pull Request #32856 · pandas-dev/pandas (original) (raw)

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation24 Commits6 Checks0 Files changed

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

jorisvandenbossche

When creating a DataFrame from many arrays stored in ExtensionBlocks, it seems quite some time is taken inside BlockPlacement using np.require on the passed list. Specifying the placement as a slice instead gives a much faster creation of the BlockPlacement. This delays the conversion to an array, though, but afterwards the conversion of the slice to an array inside BlockPlacement when neeeded is faster than an initial creation of a BlockPlacement from a list/array of 1 element.

From investigating #32196 (comment)

@rth this reduces it with another third! (only from the dataframe creation, to be clear)

@jorisvandenbossche

@jorisvandenbossche

Using the same example from #32826. With:

arrays = [pd.arrays.SparseArray(np.random.randint(0, 2, 1000), dtype="float64") for _ in range(10000)]
index = pd.Index(range(len(arrays[0])))  
columns = pd.Index(range(len(arrays)))    

it gives

In [4]: %timeit pd.DataFrame._from_arrays(arrays, index=index, columns=columns)  
113 ms ± 874 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)   <-- master
72.9 ms ± 648 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)   <-- PR

@jorisvandenbossche

@jorisvandenbossche

This is also indirectly covered by the sparse benchmark, but adding some benchmarks specifically for _from_arrays:

       before           after         ratio
     [e31354f0]       [6d6e822a]
     <master>         <perf-placement>
-      14.5±0.7ms       9.58±0.6ms     0.66  frame_ctor.FromArrays.time_frame_from_arrays_sparse

rth

jreback

make_block(array, klass=ObjectValuesExtensionBlock, placement=[i])
make_block(
array, klass=ObjectValuesExtensionBlock, placement=slice(i, i + 1, 1)
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather than changing this here
simply convert a single integer into a slice at a lower level

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, indeed, I can create the slice inside BlockPlacement constructor. But don't we then want to explicitly pass a single integer instead of a list of 1 integer?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we require list / slice

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To compare, I just pushed a commit that does both.
Personally, I like passing it as an integer. It's another 33% faster compared to passing it as a slice or 1-element list, and it makes it also explicit when constructing it that it is about a single column.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, I missed you proposed a single integer yourself (for some reason I thought you wanted to catch the single element list in BlockPlacement). Updated.

@jorisvandenbossche

@jorisvandenbossche

jbrockmendel

@jbrockmendel

another option would be to pass BlockPlacement objects. If we do that consistently, we could remove the checks/casting in the constructor/property

@jorisvandenbossche

Sorry, I don't understand that option. You still need to create BlockPlacement objects, right? And the question here is about how to create them (from an integer, or from a slice, or a 1-len list)

@jbrockmendel

You still need to create BlockPlacement objects, right? And the question here is about how to create them (from an integer, or from a slice, or a 1-len list)

Yah, the thought was about creating the BlockPlacement object and passing it to make_block rather than having it created later. Can ignore as orthogonal, but might trim some overhead.

@jorisvandenbossche

passing it to make_block rather than having it created later.

It shouldn't matter much I think, it is just passed through until ExtensionBlock init, and there it's doing a if not isinstance(placement, BlockPlacement): placement = BlockPlacement(placement)

So we would first need to eliminate all other places where we pass a slice/array as placement to block creation, and then it would only elimiate one isinstance call

@jbrockmendel

So we would first need to eliminate all other places where we pass a slice/array

Yes, it is not a trivial idea.

and then it would only elimiate one isinstance call

We could also make mgr_locs not-a-property, so get marginally faster lookups.

But again, ignore as orthogonal.

@jorisvandenbossche

Yep, any other comments on the PR itself?

@jbrockmendel

WillAyd

self.columns = pd.Index(range(N_cols))
def time_frame_from_arrays_float(self):
self.df = DataFrame._from_arrays(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't change this if no other feedback but I don't think you need the assignment here in any of the benchmarks

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True (was only mimicking the other benchmarks in this file)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm this is wrong, but yeah can clean this up in a followup (if you want to create a followup issue or PR to do it)

@jorisvandenbossche

@jorisvandenbossche

jreback

self.columns = pd.Index(range(N_cols))
def time_frame_from_arrays_float(self):
self.df = DataFrame._from_arrays(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm this is wrong, but yeah can clean this up in a followup (if you want to create a followup issue or PR to do it)

@jreback jreback added the Internals

Related to non-user accessible pandas implementation

label

Mar 21, 2020

@jreback

@rth rth mentioned this pull request

Mar 22, 2020

SeeminSyed pushed a commit to CSCD01-team01/pandas that referenced this pull request

Mar 22, 2020

@jorisvandenbossche @SeeminSyed

jbrockmendel pushed a commit to jbrockmendel/pandas that referenced this pull request

Mar 23, 2020

@jorisvandenbossche @jbrockmendel

Labels

Internals

Related to non-user accessible pandas implementation

Performance

Memory or execution speed performance