API/DOC: Deprecate and Advise against having np.nan in Categoricals · Issue #10748 · pandas-dev/pandas (original) (raw)

This came out of work on #10729

In the documentation, we mention that

There are two ways a np.nan can be represented in categorical data: either the value is not available (“missing value”) or np.nan is a valid category.

In the first case, NaN is not in .categories, and in the second case it is. I think we should only
recommend the first.

The option of NaNs in the categories makes the code in #10729 less pleasant that it would be otherwise. I don't think we should error if NaNs are included, just advise against it in the docs. Perhaps a deprecation, but I worry that I'm missing some obvious reason why NaNs were allowed in .categories.

@JanSchulz do you remember the initial reason for allowing either representation?

Some bad things that come out of NaN in .categories:

In [2]: s = pd.Categorical(['a', 'b', 'a', np.nan], categories=['a', 'b', np.nan])

In [3]: s Out[3]: [a, b, a, NaN] Categories (3, object): [a, b, NaN]

In [4]: s.categories Out[4]: Index(['a', 'b', nan], dtype='object')

In [5]: s.codes Out[5]: array([0, 1, 0, 2], dtype=int8)