pandas (original) (raw)

Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/v2.0.0.rst file if fixing a bug or adding a new feature.

Refactor of Categorical._replace to fix a few bugs with Series(..., dtype="category").replace and improve performance.

BUG 1: overlap between to_replace and value:

Series([1, 2, 3], dtype="category").replace({1:2, 2:3, 3:4})


# main:

0    4
1    4
2    4
dtype: category
Categories (1, int64): [4]


# PR:

0    2
1    3
2    4
dtype: category
Categories (3, int64): [2, 3, 4]

BUG 2: losing nullable dtypes of underlying categories:

Series(["a", "b"], dtype="string").astype("category").replace("b", "c")


# main:

0    a
1    c
dtype: category
Categories (2, object): ['a', 'c']


# PR:

0    a
1    c
dtype: category
Categories (2, string): [a, c]

Perf improvements:

import pandas as pd
import numpy as np

arr = np.repeat(np.arange(1000), 1000)
ser = pd.Series(arr, dtype="category")

%timeit ser.replace(np.arange(200), 5)

681 ms ± 9.65 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)   <- main
11 ms ± 690 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  <- PR

"""