PERF: ArrowExtensionArray.to_numpy by lukemanley · Pull Request #49973 · pandas-dev/pandas (original) (raw)
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Conversation18 Commits10 Checks0 Files changed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters
[ Show hidden characters]({{ revealButtonHref }})
- Tests added and passed if fixing a bug or adding a new feature
- All code checks passed.
- Added type annotations to new arguments/methods/functions.
- Added an entry in the latest
doc/source/whatsnew/v2.0.0.rst
file if fixing a bug or adding a new feature.
Moved and refactored to_numpy
from ArrowStringArray
to ArrowExtensionArray
to improve performance for other pyarrow types.
before after ratio
<main> <arrow-ea-numpy>
- 63.3±0.5ms 3.17±0.2ms 0.05 array.ArrowExtensionArray.time_to_numpy('float64[pyarrow]', True)
- 63.8±1ms 3.10±0.2ms 0.05 array.ArrowExtensionArray.time_to_numpy('int64[pyarrow]', True)
- 62.1±1ms 1.42±0.02ms 0.02 array.ArrowExtensionArray.time_to_numpy('boolean[pyarrow]', True)
- 31.2±1ms 122±0.3μs 0.00 array.ArrowExtensionArray.time_to_numpy('boolean[pyarrow]', False)
- 31.0±0.4ms 33.9±0.8μs 0.00 array.ArrowExtensionArray.time_to_numpy('int64[pyarrow]', False)
- 31.7±1ms 30.6±1μs 0.00 array.ArrowExtensionArray.time_to_numpy('float64[pyarrow]', False)
This results in perf improvements for a number of ops that go through numpy:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(10**5, 10), dtype="float64[pyarrow]")
%timeit df.corr()
402 ms ± 17.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) <- main
26 ms ± 1.38 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) <- PR
%timeit df.rolling(100).sum()
378 ms ± 9.79 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) <- main
15.2 ms ± 503 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) <- PR
pa_type = self._data.type |
if pa.types.is_timestamp(pa_type) or pa.types.is_duration(pa_type): |
result = np.array(self.tolist(), dtype=np.object_) |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've looked at it and I think it would only be usable for a subset of cases. The inability to specify dtype
and na_value
make it difficult for general usage in this interface. Also, need to be careful with things like timestamps as it will strip off the timezone.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha. Yeah where EA.to_numpy()
has more functionality (dtype and na_value) than pyarrow's to_numpy()
I agree we need to have out custom behavior.
But otherwise I would lean towards using self._data.to_numpy()
. IMO at this stage of the ArrowExtensionArray it's better to let pyarrow behavior pass through (and potentially improve) rather than modifying it's behavior and possibly diverging
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed with trying to limit divergence from pyarrow.
I was attempting to maintain status quo behavior with better perf here. Current default behavior of ArrowExtensionArray.to_numpy
is mostly consistent with other masked/nullable pandas dtypes. If its preferred to let the pyarrow behavior flow through the expected behavior in a number of tests would need to be changed (or xfail)
Here is an example where behavior differs between the current ArrowExtensionArray.to_numpy
and pyarrow's to_numpy:
import pandas as pd
data = [1, None]
pd.array(data, dtype="Int64").to_numpy() # -> array([1, <NA>], dtype=object)
pd.array(data, dtype="int64[pyarrow]").to_numpy() # -> array([1, <NA>], dtype=object)
pd.array(data, dtype="int64[pyarrow]")._data.to_numpy() # -> array([1.0, nan])
Let me know if you have thoughts or suggestions. Happy to modify or close this if you think it needs more consideration.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll also note that np.array(pa_array)
goes through pyarrow's to_numpy method. pyarrow implements pa.Array.__array__
which is a small wrapper around the to_numpy method.
So these should be equivalent and both go through pa.Array.to_numpy
:
np.array(pa_array)
pa_array.to_numpy(zero_copy_only=False, writable=True)
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. If we go with that change and get closer to pyarrow behavior, is it ok that (by default) to_numpy
will no longer represent missing values with pd.NA (and cast to object)? Thinking specifically about string[pyarrow]
which would behave differently than string[python]
if this change were to be made.
In [2]: pd.array(["foo", pd.NA], dtype="string[python]").to_numpy()
Out[2]: array(['foo', <NA>], dtype=object)
In [3]: pd.array(["foo", pd.NA], dtype="string[pyarrow]").to_numpy()
Out[3]: array(['foo', <NA>], dtype=object)
In [4]: pd.array(["foo", pd.NA], dtype="string[pyarrow]")._data.to_numpy()
Out[4]: array(['foo', None], dtype=object)
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just pushed the changes discussed above but left ArrowStringArray behavior as is with a TODO note. (see my comment directly above regarding divergence from string[python]
).
Agreed it would be nice if duration and timestamp followed the same default behavior. When I tried it, there were a number of failing tests due to loss of timezone, non-ns units, etc. I'd suggest we leave that for a follow-up.
Avoiding the cast to object further improves the timings in the OP.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it ok that (by default) to_numpy will no longer represent missing values with pd.NA (and cast to object)?
Oh sorry, I think one aspect we want to make consistent is that missing values from to_numpy
get returned as pd.NA
(if na_value
isn't specified) so then object
dtype would be necessary i.e. we should diverge from pyarrow (unfortunately) when the data contains NA
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed with tackling the duration and timestamp stuff in a followup
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. updated to use pd.NA by default for missing values.
@@ -1406,3 +1406,20 @@ def test_astype_from_non_pyarrow(data): |
---|
assert not isinstance(pd_array.dtype, ArrowDtype) |
assert isinstance(result.dtype, ArrowDtype) |
tm.assert_extension_array_equal(result, data) |
def test_to_numpy_with_defaults(data, request): |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Looks like request
isn't needed anymore
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed it. thanks
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One comment otherwise LGTM
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM can merge on green
phofl pushed a commit to phofl/pandas that referenced this pull request
2 participants