PERF: setting values via df.loc / df.iloc with pyarrow-backed columns by lukemanley · Pull Request #50248 · pandas-dev/pandas (original) (raw)
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Conversation9 Commits8 Checks0 Files changed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters
[ Show hidden characters]({{ revealButtonHref }})
- Tests added and passed if fixing a bug or adding a new feature
- All code checks passed.
- Added type annotations to new arguments/methods/functions.
- Added an entry in the latest
doc/source/whatsnew/v2.0.0.rst
file if fixing a bug or adding a new feature.
Performance improvement in ArrowExtensionArray.__setitem__
when key is a null slice. The rationale for adding a fast path here is that internally null slices get used in a variety of DataFrame setitem operations. Here is an example:
import pandas as pd
import numpy as np
arr = pd.array(np.arange(10**6), dtype="int64[pyarrow]")
df = pd.DataFrame({'a': arr, 'b': arr})
%timeit df.iloc[0, 0] = 0
# 483 ms ± 74.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) <- main
# 3.4 ms ± 590 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) <- PR
The example above does not use a null slice, however, internally a null slice is used in ExtensionBlock.set_inplace
:
ASV added:
before after ratio
[7ef6a71c] [4e4b36d4]
<arrow-ea-setitem-null-slice>
- 802±2μs 57.4±0.5μs 0.07 array.ArrowStringArray.time_setitem_null_slice(False)
- 3.12±0.04ms 194±2μs 0.06 array.ArrowStringArray.time_setitem_null_slice(True)
lukemanley changed the title
PERF: df.loc / df.iloc with pyarrow-backed columns PERF: setting values via df.loc / df.iloc with pyarrow-backed columns
value = self._maybe_convert_setitem_value(value) |
---|
# fast path (GH50248) |
if com.is_null_slice(key): |
if is_scalar(value) and not pa_version_under6p0: |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to directly check not pa_version_under6p0
? If a user has data with a pyarrow data type I think it's implicitly assumed that pyarrow is installed?
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that path hits pc.if_else
which I believe was added in 6.0. pc.if_else
was the fastest way I found to create an array of all the same value. There are slower options of course.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We recently bumped the min pyarrow version to 6.0 so I think this should be safe to remove
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Is this robust to the case when the value being set is not the same type as the original array e.g.
ser = pd.Series([1], dtype="int64"); ser[:] = "foo"
should not work - Related to the above, was this case before in setitem?
- Could you add tests for the null slice if/else branches you added?
- Is this robust to the case when the value being set is not the same type as the original array e.g.
ser = pd.Series([1], dtype="int64"); ser[:] = "foo"
should not work- Related to the above, was this case before in setitem?
It is robust in the sense that it raises:
import pandas as pd
ser = pd.Series([1], dtype="int64[pyarrow]")
ser[:] = "foo"
# with this PR:
ArrowInvalid: Could not convert 'foo' with type str: tried to convert to int64
# Previously:
ArrowNotImplementedError: NumPy type not implemented: unrecognized type (19) in GetNumPyTypeName
However, different indexers are raising different errors:.
import pandas as pd
ser = pd.Series([1], dtype="int64[pyarrow]")
ser[0] = "foo"
# ArrowNotImplementedError: NumPy type not implemented: unrecognized type (19) in GetNumPyTypeName
I suspect that refactoring setitem to not go through numpy would help with consistency and might make it cleaner.
- Could you add tests for the null slice if/else branches you added?
added
Okay glad to see that Arrow is erroring in this case. Could you add a test this invalid setting case?
Yeah I think generally ArrowInvalid: Could not convert 'foo' with type str: tried to convert to int64
is a better error message but the refactoring can be left for a separate PR
Okay glad to see that Arrow is erroring in this case. Could you add a test this invalid setting case?
Added
phofl pushed a commit to phofl/pandas that referenced this pull request
perf: ArrowExtensionArray.setitem(null_slice)
gh refs
fix test
add test for setitem null slice paths
add test
remove version check
fix text
2 participants