PERF: ensure_string_array with non-numpy input array by topper-123 · Pull Request #37371 · pandas-dev/pandas (original) (raw)

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

@topper-123

Currently, if the input array to ensure_string_array is a python object (e.g. an ExtensionArray), the function will constantly switch between python and cython level code, which is slow.

This PR fixes that by ensuring we always have a numpy array, avoiding the trips to python level code.

n = 50_000 cat = pd.Categorical([*['a', pd.NA] * n]) %timeit cat.astype("string") 447 ms ± 11.9 ms per loop # master 5.43 ms ± 80.5 µs per loop # this PR

xref #35519, #36464.

jreback

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow! assume we have sufficient benchmarks?

pls either add a whatsnew (or since this is similar to others that that you have recently done for this type of function, ok to just add this PR number there). ping on ngreen.

jreback

for i in range(n):
val = arr[i]
arr_val = arr[i]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does arr_val and result_val need to be in cdef?

@jorisvandenbossche

Specifically for Categorical, we can actually get something (possibly) faster by special casing it to astype the categories only:

In [16]: n = 50_000
    ...: cat = pd.Categorical([*['a', pd.NA] * n])

In [17]: %timeit cat.astype("string")
313 ms ± 8.74 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [18]: %timeit cat.categories.array.astype("string").take(cat.codes, allow_fill=True)
1.86 ms ± 50.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

(this is also something that can work in general for other dtypes as well)

@topper-123

The underlying problem is in the string conversion, so should be fixed there:

In [1]: n = 50_000 ...: cat = pd.Categorical([*['a', pd.NA] * n])

In [2]: %timeit pd.arrray(cat, dtype="string") # same underlying problem 508 ms ± 9.6 ms per loop # master 6.03 ms ± 184 µs per loop # this PR

@topper-123

I think the failures are unrelated. I'll update later with the whatsnew and timing runs, if needed.

@topper-123

pls either add a whatsnew (or since this is similar to others that that you have recently done for this type of function, ok to just add this PR number there). ping on ngreen.

I've added this in the whatsnet, but this actually fixes a perf. regression coming from #36464 (the change pandas/_libs/lib.pyx).

@topper-123

@jreback

Specifically for Categorical, we can actually get something (possibly) faster by special casing it to astype the categories only:

In [16]: n = 50_000
   ...: cat = pd.Categorical([*['a', pd.NA] * n])

In [17]: %timeit cat.astype("string")
313 ms ± 8.74 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [18]: %timeit cat.categories.array.astype("string").take(cat.codes, allow_fill=True)
1.86 ms ± 50.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

(this is also something that can work in general for other dtypes as well)

possibly but this is a separate issue and orthogonal here.

jreback

from pandas.core.dtypes.common import is_extension_array_dtype
if is_extension_array_dtype(arr):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we avoid doing this as it circularizs things, you can check if it has a .to_numpy method or maybe @jbrockmendel has a better way here.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I've changed to use hasattr(arr, "to_numpy").

jreback

@jreback

jbrockmendel

if hasattr(arr, "to_numpy"):
arr = arr.to_numpy()
elif not isinstance(arr, np.ndarray):
arr = np.array(arr, dtype="object")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should probably be asarray

not a big deal, but my preference would have been to type the input arr as an ndarray and handle non-ndarray in the calling function

kesmit13 pushed a commit to kesmit13/pandas that referenced this pull request

Nov 2, 2020

@topper-123

Labels