PERF: ensure_string_array with non-numpy input array by topper-123 · Pull Request #37371 · pandas-dev/pandas (original) (raw)

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

Currently, if the input array to ensure_string_array is a python object (e.g. an ExtensionArray), the function will constantly switch between python and cython level code, which is slow.

This PR fixes that by ensuring we always have a numpy array, avoiding the trips to python level code.

n = 50_000 cat = pd.Categorical([*['a', pd.NA] * n]) %timeit cat.astype("string") 447 ms ± 11.9 ms per loop # master 5.43 ms ± 80.5 µs per loop # this PR

xref #35519, #36464.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow! assume we have sufficient benchmarks?

pls either add a whatsnew (or since this is similar to others that that you have recently done for this type of function, ok to just add this PR number there). ping on ngreen.


for i in range(n):
val = arr[i]
arr_val = arr[i]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does arr_val and result_val need to be in cdef?

Specifically for Categorical, we can actually get something (possibly) faster by special casing it to astype the categories only:

In [16]: n = 50_000
    ...: cat = pd.Categorical([*['a', pd.NA] * n])

In [17]: %timeit cat.astype("string")
313 ms ± 8.74 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [18]: %timeit cat.categories.array.astype("string").take(cat.codes, allow_fill=True)
1.86 ms ± 50.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

(this is also something that can work in general for other dtypes as well)

The underlying problem is in the string conversion, so should be fixed there:

In [1]: n = 50_000 ...: cat = pd.Categorical([*['a', pd.NA] * n])

In [2]: %timeit pd.arrray(cat, dtype="string") # same underlying problem 508 ms ± 9.6 ms per loop # master 6.03 ms ± 184 µs per loop # this PR

I think the failures are unrelated. I'll update later with the whatsnew and timing runs, if needed.

pls either add a whatsnew (or since this is similar to others that that you have recently done for this type of function, ok to just add this PR number there). ping on ngreen.

I've added this in the whatsnet, but this actually fixes a perf. regression coming from #36464 (the change pandas/_libs/lib.pyx).

Specifically for Categorical, we can actually get something (possibly) faster by special casing it to astype the categories only:
In [16]: n = 50_000
   ...: cat = pd.Categorical([*['a', pd.NA] * n])

In [17]: %timeit cat.astype("string")
313 ms ± 8.74 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [18]: %timeit cat.categories.array.astype("string").take(cat.codes, allow_fill=True)
1.86 ms ± 50.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
(this is also something that can work in general for other dtypes as well)

possibly but this is a separate issue and orthogonal here.


from pandas.core.dtypes.common import is_extension_array_dtype

if is_extension_array_dtype(arr):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we avoid doing this as it circularizs things, you can check if it has a .to_numpy method or maybe @jbrockmendel has a better way here.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I've changed to use hasattr(arr, "to_numpy").

if hasattr(arr, "to_numpy"):
arr = arr.to_numpy()
elif not isinstance(arr, np.ndarray):
arr = np.array(arr, dtype="object")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should probably be asarray

not a big deal, but my preference would have been to type the input arr as an ndarray and handle non-ndarray in the calling function

kesmit13 pushed a commit to kesmit13/pandas that referenced this pull request

Nov 2, 2020

PERF: ensure_string_array with non-numpy input array by topper-123 · Pull Request #37371 · pandas-dev/pandas (original) (raw)

Conversation

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Labels