PERF: ArrowExtensionArray.to_numpy by lukemanley · Pull Request #49973 · pandas-dev/pandas (original) (raw)

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation18 Commits10 Checks0 Files changed

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

lukemanley

Moved and refactored to_numpy from ArrowStringArray to ArrowExtensionArray to improve performance for other pyarrow types.

       before           after         ratio
     <main>           <arrow-ea-numpy>
-      63.3±0.5ms       3.17±0.2ms     0.05  array.ArrowExtensionArray.time_to_numpy('float64[pyarrow]', True)
-        63.8±1ms       3.10±0.2ms     0.05  array.ArrowExtensionArray.time_to_numpy('int64[pyarrow]', True)
-        62.1±1ms      1.42±0.02ms     0.02  array.ArrowExtensionArray.time_to_numpy('boolean[pyarrow]', True)
-        31.2±1ms        122±0.3μs     0.00  array.ArrowExtensionArray.time_to_numpy('boolean[pyarrow]', False)
-      31.0±0.4ms       33.9±0.8μs     0.00  array.ArrowExtensionArray.time_to_numpy('int64[pyarrow]', False)
-        31.7±1ms         30.6±1μs     0.00  array.ArrowExtensionArray.time_to_numpy('float64[pyarrow]', False)

This results in perf improvements for a number of ops that go through numpy:

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(10**5, 10), dtype="float64[pyarrow]")

%timeit df.corr()
402 ms ± 17.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)     <- main
26 ms ± 1.38 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)    <- PR

%timeit df.rolling(100).sum()
378 ms ± 9.79 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)     <- main
15.2 ms ± 503 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  <- PR

@lukemanley

@lukemanley

mroeschke

pa_type = self._data.type
if pa.types.is_timestamp(pa_type) or pa.types.is_duration(pa_type):
result = np.array(self.tolist(), dtype=np.object_)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've looked at it and I think it would only be usable for a subset of cases. The inability to specify dtype and na_value make it difficult for general usage in this interface. Also, need to be careful with things like timestamps as it will strip off the timezone.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha. Yeah where EA.to_numpy() has more functionality (dtype and na_value) than pyarrow's to_numpy() I agree we need to have out custom behavior.

But otherwise I would lean towards using self._data.to_numpy(). IMO at this stage of the ArrowExtensionArray it's better to let pyarrow behavior pass through (and potentially improve) rather than modifying it's behavior and possibly diverging

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed with trying to limit divergence from pyarrow.

I was attempting to maintain status quo behavior with better perf here. Current default behavior of ArrowExtensionArray.to_numpy is mostly consistent with other masked/nullable pandas dtypes. If its preferred to let the pyarrow behavior flow through the expected behavior in a number of tests would need to be changed (or xfail)

Here is an example where behavior differs between the current ArrowExtensionArray.to_numpy and pyarrow's to_numpy:

import pandas as pd

data = [1, None]

pd.array(data, dtype="Int64").to_numpy()                    # -> array([1, <NA>], dtype=object)
pd.array(data, dtype="int64[pyarrow]").to_numpy()           # -> array([1, <NA>], dtype=object)
pd.array(data, dtype="int64[pyarrow]")._data.to_numpy()     # -> array([1.0, nan])

Let me know if you have thoughts or suggestions. Happy to modify or close this if you think it needs more consideration.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll also note that np.array(pa_array) goes through pyarrow's to_numpy method. pyarrow implements pa.Array.__array__ which is a small wrapper around the to_numpy method.

So these should be equivalent and both go through pa.Array.to_numpy:

np.array(pa_array)
pa_array.to_numpy(zero_copy_only=False, writable=True)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. If we go with that change and get closer to pyarrow behavior, is it ok that (by default) to_numpy will no longer represent missing values with pd.NA (and cast to object)? Thinking specifically about string[pyarrow] which would behave differently than string[python] if this change were to be made.

In [2]: pd.array(["foo", pd.NA], dtype="string[python]").to_numpy()        
Out[2]: array(['foo', <NA>], dtype=object)

In [3]: pd.array(["foo", pd.NA], dtype="string[pyarrow]").to_numpy()        
Out[3]: array(['foo', <NA>], dtype=object)

In [4]: pd.array(["foo", pd.NA], dtype="string[pyarrow]")._data.to_numpy()
Out[4]: array(['foo', None], dtype=object)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just pushed the changes discussed above but left ArrowStringArray behavior as is with a TODO note. (see my comment directly above regarding divergence from string[python]).

Agreed it would be nice if duration and timestamp followed the same default behavior. When I tried it, there were a number of failing tests due to loss of timezone, non-ns units, etc. I'd suggest we leave that for a follow-up.

Avoiding the cast to object further improves the timings in the OP.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it ok that (by default) to_numpy will no longer represent missing values with pd.NA (and cast to object)?

Oh sorry, I think one aspect we want to make consistent is that missing values from to_numpy get returned as pd.NA (if na_value isn't specified) so then object dtype would be necessary i.e. we should diverge from pyarrow (unfortunately) when the data contains NA

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed with tackling the duration and timestamp stuff in a followup

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. updated to use pd.NA by default for missing values.

@lukemanley

mroeschke

@@ -1406,3 +1406,20 @@ def test_astype_from_non_pyarrow(data):
assert not isinstance(pd_array.dtype, ArrowDtype)
assert isinstance(result.dtype, ArrowDtype)
tm.assert_extension_array_equal(result, data)
def test_to_numpy_with_defaults(data, request):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Looks like request isn't needed anymore

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed it. thanks

mroeschke

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One comment otherwise LGTM

@lukemanley

@lukemanley

mroeschke

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM can merge on green

phofl pushed a commit to phofl/pandas that referenced this pull request

Dec 17, 2022

@lukemanley @phofl

2 participants

@lukemanley @mroeschke