Categoricals hash consistently by jcrist · Pull Request #15143 · pandas-dev/pandas (original) (raw)
Can you explain where you mean "like we do below"? I'm assuming you mean lines 125-130, where we handle object dtype.
For categoricals, what we want to do is:
- Get a hash for the
categories
. This should match whathash_pandas_object(series, index=False)
would return for the un-categorized data. Meaninghash_pandas_object(object_series) == hash_pandas_object(object_series.astype('category'))
. - Remap the category hashes based on the codes.
vals.categories.values
can be of any dtype, so we need to recurse through hash_array
again to get the hashes of the categories. However, we also know that the categories are already unique, so we don't want to call factorize
again. As such, we set categorize=False
to skip that. We can't just call hash_object_array
, as the categories may not be objects. And we need to do the remapping, so we can't just set vals = something
and fall through like we did before.
I don't think adding this extra keyword overly complicates things, and do think this is the simplest way to do this. I may not be understanding what you're trying to suggest here - perhaps if you could explain a bit better I might get it.