API/DOC: Deprecate and Advise against having `np.nan` in Categoricals (original) (raw)

This came out of work on #10729

In the documentation, we mention that

There are two ways a np.nan can be represented in categorical data: either the value is not available (“missing value”) or np.nan is a valid category.

In the first case, NaN is not in .categories, and in the second case it is. I think we should only
recommend the first.

The option of NaNs in the categories makes the code in #10729 less pleasant that it would be otherwise. I don't think we should error if NaNs are included, just advise against it in the docs. Perhaps a deprecation, but I worry that I'm missing some obvious reason why NaNs were allowed in .categories.

@JanSchulz do you remember the initial reason for allowing either representation?

Some bad things that come out of NaN in .categories:

Can't rely on a value of nan mapping to a code of -1:

In [2]: s = pd.Categorical(['a', 'b', 'a', np.nan], categories=['a', 'b', np.nan])

In [3]: s Out[3]: [a, b, a, NaN] Categories (3, object): [a, b, NaN]

In [4]: s.categories Out[4]: Index(['a', 'b', nan], dtype='object')

In [5]: s.codes Out[5]: array([0, 1, 0, 2], dtype=int8)

potentially have to upcast the index type or mix strings and floats (nan) in the .categories Index.
extra code if you want to generically handle Categoricals that may or may not have NaN in categories.

API/DOC: Deprecate and Advise against having np.nan in Categoricals (original) (raw)

API/DOC: Deprecate and Advise against having `np.nan` in Categoricals (original) (raw)