Categoricals hash consistently by jcrist · Pull Request #15143 · pandas-dev/pandas (original) (raw)

Can you explain where you mean "like we do below"? I'm assuming you mean lines 125-130, where we handle object dtype.

For categoricals, what we want to do is:

vals.categories.values can be of any dtype, so we need to recurse through hash_array again to get the hashes of the categories. However, we also know that the categories are already unique, so we don't want to call factorize again. As such, we set categorize=False to skip that. We can't just call hash_object_array, as the categories may not be objects. And we need to do the remapping, so we can't just set vals = something and fall through like we did before.

I don't think adding this extra keyword overly complicates things, and do think this is the simplest way to do this. I may not be understanding what you're trying to suggest here - perhaps if you could explain a bit better I might get it.