BUG/PERF: Series.combine_first converting int64 to float64 by lukemanley · Pull Request #51777 · pandas-dev/pandas (original) (raw)

If the solution to this issue is to convert floats back to ints when they get converted to floats, that will not work correctly for some ints

great catch! thanks. I've updated the PR to handle this and updated the test.

This also provides a nice perf improvement:

import pandas as pd
import numpy as np

N = 1_000_000

s1 = pd.Series(np.random.randint(0, N, N), dtype="int64")
s1 = s1.iloc[:-5]

s2 = pd.Series(np.random.randint(0, N, N), dtype="int64")
s2 = s2.iloc[5:]

%timeit s1.combine_first(s2)

# 59.4 ms ± 4.15 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)     -> main
# 1.37 ms ± 20.9 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)  -> PR

And I'll note the perf improvement is not specific to the integer case, here is float64:

import pandas as pd
import numpy as np

N = 1_000_000

s1 = pd.Series(np.random.randn(N), dtype="float64")
s1 = s1.iloc[:-5]

s2 = pd.Series(np.random.randn(N), dtype="float64")
s2 = s2.iloc[5:]

%timeit s1.combine_first(s2)

# 48.4 ms ± 169 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)      -> main
# 1.64 ms ± 7.26 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)  -> PR