REF/PERF: ArrowExtensionArray.setitem by lukemanley · Pull Request #50632 · pandas-dev/pandas (original) (raw)
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Conversation9 Commits8 Checks0 Files changed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters
[ Show hidden characters]({{ revealButtonHref }})
- Tests added and passed if fixing a bug or adding a new feature
- All code checks passed.
- Added type annotations to new arguments/methods/functions.
- Added an entry in the latest
doc/source/whatsnew/v2.0.0.rst
file if fixing a bug or adding a new feature.
Simplifies and improves performance of ArrowExtensionArray.__setitem__
via more use of pyarrow compute functions and less back and forth through numpy.
ASVs:
before after ratio
[059b2c18] [91049c37]
<main> <arrow-setitem>
- 382±2μs 336±1μs 0.88 array.ArrowStringArray.time_setitem_list(False)
- 900±5μs 566±10μs 0.63 array.ArrowStringArray.time_setitem_list(True)
- 14.0±0.3ms 6.12±0.06ms 0.44 array.ArrowStringArray.time_setitem(False)
- 345±6μs 99.6±20μs 0.29 array.ArrowStringArray.time_setitem_slice(False)
- 24.5±0.2ms 6.12±0.05ms 0.25 array.ArrowStringArray.time_setitem(True)
- 5.35±0.06ms 304±3μs 0.06 array.ArrowStringArray.time_setitem_slice(True)
Not for this PR but because I'm cranky: the fact that we have to implement a kludgy __setitem__
sucks. ideally pyarrow would implement it; failing that better to set into each of the elements of self._data.buffers()
(which besides being a PITA im not sure is even possible in mask-less cases)
Not for this PR but because I'm cranky: the fact that we have to implement a kludgy
__setitem__
sucks. ideally pyarrow would implement it; failing that better to set into each of the elements ofself._data.buffers()
(which besides being a PITA im not sure is even possible in mask-less cases)
This was a (failed?) attempt to make it a little less kludgy :). Agreed its still kludgy.
Part of the complexity is supporting older versions of pyarrow where the pyarrow.compute functions have bugs we need to work around. We could remove the fast paths if simplicity is preferred.
I'm not familiar with the buffers but could take a look.
I'm not familiar with the buffers but could take a look.
Definitely don't put time into this on my account. I've got a bug in my bonnet about preserving views (xref #45419) that becomes less important with CoW. Plus as mentioned, im not confident that the buffers() approach is even feasible.
@@ -172,7 +172,6 @@ def test_setitem(multiple_chunks, key, value, expected): |
---|
result[key] = value |
tm.assert_equal(result, expected) |
assert result._data.num_chunks == expected._data.num_chunks |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this no longer holds?
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The existing implementation iterated through the chunks and set values chunk by chunk. This implementation passes the entire ChunkedArray
to pyarrow's compute functions. At the moment it looks like pyarrow.compute.if_else
combines chunks (but still returns a ChunkedArray with one chunk) whereas pyarrow.compute.replace_with_mask
maintains the chunking layout of the input. I'm not sure if that behavior applies for all cases or based on inputs. Let me know if you think pandas should ensure chunking layout remains unchanged.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally I think we should try to maintain the chunking layout as much as possible, but it's more of an implementation detail and if the pyarrow compute functions don't necessarily maintain the chunking layout I suppose this is fine
if isinstance(data, pa.Array): |
data = pa.chunked_array([data]) |
self._data = data |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we guaranteed here that self._data.type
matches self.dtype.pyarrow_dtype
?
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I believe so. I addressed the TODO in ArrowExtensionArray._maybe_convert_setitem_value
so that it now boxes the setitem values will raise if the replacement values cannot be cast to the original self._data.type.