PERF: setting values via df.loc / df.iloc with pyarrow-backed columns by lukemanley · Pull Request #50248 · pandas-dev/pandas (original) (raw)

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation9 Commits8 Checks0 Files changed

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

lukemanley

Performance improvement in ArrowExtensionArray.__setitem__ when key is a null slice. The rationale for adding a fast path here is that internally null slices get used in a variety of DataFrame setitem operations. Here is an example:

import pandas as pd
import numpy as np

arr = pd.array(np.arange(10**6), dtype="int64[pyarrow]")
df = pd.DataFrame({'a': arr, 'b': arr})

%timeit df.iloc[0, 0] = 0

# 483 ms ± 74.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)    <- main
# 3.4 ms ± 590 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  <- PR

The example above does not use a null slice, however, internally a null slice is used in ExtensionBlock.set_inplace:

ASV added:

       before           after         ratio
     [7ef6a71c]       [4e4b36d4]
                      <arrow-ea-setitem-null-slice>
-         802±2μs       57.4±0.5μs     0.07  array.ArrowStringArray.time_setitem_null_slice(False)
-     3.12±0.04ms          194±2μs     0.06  array.ArrowStringArray.time_setitem_null_slice(True)

@lukemanley

@lukemanley

@lukemanley lukemanley changed the titlePERF: df.loc / df.iloc with pyarrow-backed columns PERF: setting values via df.loc / df.iloc with pyarrow-backed columns

Dec 14, 2022

@lukemanley

mroeschke

value = self._maybe_convert_setitem_value(value)
# fast path (GH50248)
if com.is_null_slice(key):
if is_scalar(value) and not pa_version_under6p0:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to directly check not pa_version_under6p0? If a user has data with a pyarrow data type I think it's implicitly assumed that pyarrow is installed?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that path hits pc.if_else which I believe was added in 6.0. pc.if_else was the fastest way I found to create an array of all the same value. There are slower options of course.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We recently bumped the min pyarrow version to 6.0 so I think this should be safe to remove

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

mroeschke

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Is this robust to the case when the value being set is not the same type as the original array e.g. ser = pd.Series([1], dtype="int64"); ser[:] = "foo" should not work
  2. Related to the above, was this case before in setitem?
  3. Could you add tests for the null slice if/else branches you added?

@lukemanley

@lukemanley

  1. Is this robust to the case when the value being set is not the same type as the original array e.g. ser = pd.Series([1], dtype="int64"); ser[:] = "foo" should not work
  2. Related to the above, was this case before in setitem?

It is robust in the sense that it raises:

import pandas as pd
ser = pd.Series([1], dtype="int64[pyarrow]")
ser[:] = "foo"

# with this PR:
ArrowInvalid: Could not convert 'foo' with type str: tried to convert to int64

# Previously:
ArrowNotImplementedError: NumPy type not implemented: unrecognized type (19) in GetNumPyTypeName

However, different indexers are raising different errors:.

import pandas as pd
ser = pd.Series([1], dtype="int64[pyarrow]")
ser[0] = "foo"

# ArrowNotImplementedError: NumPy type not implemented: unrecognized type (19) in GetNumPyTypeName

I suspect that refactoring setitem to not go through numpy would help with consistency and might make it cleaner.

  1. Could you add tests for the null slice if/else branches you added?

added

@mroeschke

Okay glad to see that Arrow is erroring in this case. Could you add a test this invalid setting case?

Yeah I think generally ArrowInvalid: Could not convert 'foo' with type str: tried to convert to int64 is a better error message but the refactoring can be left for a separate PR

@lukemanley

@lukemanley

Okay glad to see that Arrow is erroring in this case. Could you add a test this invalid setting case?

Added

mroeschke

@mroeschke

phofl pushed a commit to phofl/pandas that referenced this pull request

Dec 17, 2022

@lukemanley @phofl

…pandas-dev#50248)

2 participants

@lukemanley @mroeschke