Index.difference performance · Issue #12044 · pandas-dev/pandas (original) (raw)
I need to append several big Series to a big categorical Series.
Trying to update categories FAST i've found out that Index.difference
uses Python's set
, which is slow on creating LARGE set (i have up to 500k categories and 1.3M values).
numpy's setdiff1
is more than an order of magnitude faster (as of datetime64 Categorical):
tmp_unique = tmp.unique()
new_cats = pd.Index(pd.np.setdiff1d(tmp_unique[~pd.isnull(tmp_unique)], to.cat.categories))
Not so fast:
new_cats = pd.Index(tmp_unique[~pd.isnull(tmp_unique)]).difference(to.cat.categories)