BUG/PERF: Series.replace with dtype="category" by lukemanley · Pull Request #49404 · pandas-dev/pandas (original) (raw)

Refactor of Categorical._replace to fix a few bugs with Series(..., dtype="category").replace and improve performance.

BUG 1: overlap between to_replace and value:

Series([1, 2, 3], dtype="category").replace({1:2, 2:3, 3:4})


# main:

0    4
1    4
2    4
dtype: category
Categories (1, int64): [4]


# PR:

0    2
1    3
2    4
dtype: category
Categories (3, int64): [2, 3, 4]

BUG 2: losing nullable dtypes of underlying categories:

Series(["a", "b"], dtype="string").astype("category").replace("b", "c")


# main:

0    a
1    c
dtype: category
Categories (2, object): ['a', 'c']


# PR:

0    a
1    c
dtype: category
Categories (2, string): [a, c]

Perf improvements:

import pandas as pd
import numpy as np

arr = np.repeat(np.arange(1000), 1000)
ser = pd.Series(arr, dtype="category")

%timeit ser.replace(np.arange(200), 5)

681 ms ± 9.65 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)   <- main
11 ms ± 690 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  <- PR

"""