Index.difference performance · Issue #12044 · pandas-dev/pandas (original) (raw)

I need to append several big Series to a big categorical Series.
Trying to update categories FAST i've found out that Index.difference uses Python's set, which is slow on creating LARGE set (i have up to 500k categories and 1.3M values).
numpy's setdiff1 is more than an order of magnitude faster (as of datetime64 Categorical):

tmp_unique = tmp.unique()
new_cats = pd.Index(pd.np.setdiff1d(tmp_unique[~pd.isnull(tmp_unique)], to.cat.categories))

Not so fast:

new_cats = pd.Index(tmp_unique[~pd.isnull(tmp_unique)]).difference(to.cat.categories)