BUG/PERF: Series.replace with dtype="category" by lukemanley · Pull Request #49404 · pandas-dev/pandas (original) (raw)
- Tests added and passed if fixing a bug or adding a new feature
- All code checks passed.
- Added type annotations to new arguments/methods/functions.
- Added an entry in the latest
doc/source/whatsnew/v2.0.0.rst
file if fixing a bug or adding a new feature.
Refactor of Categorical._replace
to fix a few bugs with Series(..., dtype="category").replace
and improve performance.
BUG 1: overlap between to_replace
and value
:
Series([1, 2, 3], dtype="category").replace({1:2, 2:3, 3:4})
# main:
0 4
1 4
2 4
dtype: category
Categories (1, int64): [4]
# PR:
0 2
1 3
2 4
dtype: category
Categories (3, int64): [2, 3, 4]
BUG 2: losing nullable dtypes of underlying categories:
Series(["a", "b"], dtype="string").astype("category").replace("b", "c")
# main:
0 a
1 c
dtype: category
Categories (2, object): ['a', 'c']
# PR:
0 a
1 c
dtype: category
Categories (2, string): [a, c]
Perf improvements:
import pandas as pd
import numpy as np
arr = np.repeat(np.arange(1000), 1000)
ser = pd.Series(arr, dtype="category")
%timeit ser.replace(np.arange(200), 5)
681 ms ± 9.65 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) <- main
11 ms ± 690 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) <- PR
"""