API/DOC: Deprecate and Advise against having np.nan
in Categoricals · Issue #10748 · pandas-dev/pandas (original) (raw)
This came out of work on #10729
In the documentation, we mention that
There are two ways a np.nan can be represented in categorical data: either the value is not available (“missing value”) or np.nan is a valid category.
In the first case, NaN
is not in .categories
, and in the second case it is. I think we should only
recommend the first.
The option of NaN
s in the categories makes the code in #10729 less pleasant that it would be otherwise. I don't think we should error if NaNs are included, just advise against it in the docs. Perhaps a deprecation, but I worry that I'm missing some obvious reason why NaNs were allowed in .categories
.
@JanSchulz do you remember the initial reason for allowing either representation?
Some bad things that come out of NaN
in .categories
:
- Can't rely on a value of
nan
mapping to a code of-1
:
In [2]: s = pd.Categorical(['a', 'b', 'a', np.nan], categories=['a', 'b', np.nan])
In [3]: s Out[3]: [a, b, a, NaN] Categories (3, object): [a, b, NaN]
In [4]: s.categories Out[4]: Index(['a', 'b', nan], dtype='object')
In [5]: s.codes Out[5]: array([0, 1, 0, 2], dtype=int8)
- potentially have to upcast the index type or mix strings and floats (
nan
) in the.categories
Index. - extra code if you want to generically handle Categoricals that may or may not have
NaN
in categories.