pandas (original) (raw)

tests added / passed
passes git diff upstream/master | flake8 --diff
whatsnew entry

Replace repeated list.append with np.union1d in IntIndex.make_union. make_union is used in numeric ops.

NOTE: It is also possible to fix IntIndex.intersect to use np.intersect1d, but it doesn't increase the performance (because the length of the result is smaller).

The below microbench assumes array's 90% is sparse.

import numpy as np
import pandas as pd

np.random.seed(1)
N = 1000000
a = np.array([np.nan] * N)
b = np.array([np.nan] * N)

indexer_a = np.unique(np.random.randint(0, N, N / 10))
indexer_b = np.unique(np.random.randint(0, N, N / 10))
a[indexer_a] = np.random.randint(0, 100, len(indexer_a))
b[indexer_b] = np.random.randint(0, 100, len(indexer_b))

sa = pd.SparseArray(a)
sb = pd.SparseArray(b)

on current master

%timeit sa.sp_index.make_union(sb.sp_index)
#10 loops, best of 3: 52.7 ms per loop

%timeit sa + sb
10 loops, best of 3: 47.8 ms per loop

After this PR

%timeit sa.sp_index.make_union(sb.sp_index)
100 loops, best of 3: 11.6 ms per loop

%timeit sa + sb
100 loops, best of 3: 15.3 ms per loop