Series.map on a categorical does not process missing values · Issue #22527 · pandas-dev/pandas (original) (raw)

Code Sample

pd.Series(['Pandas', 'is', np.nan], dtype='category').map(lambda x: len(x) if x == x else -1) 0 6.0 1 2.0 2 NaN dtype: category Categories (2, int64): [6, 2] pd.Series(['Pandas', 'is', np.nan], dtype='category').astype(object).map(lambda x: len(x) if x == x else -1) 0 6 1 2 2 -1 dtype: int64

Problem description

Series.map calls its function argument once for each value in the categorical, but never calls it on NaN even if that is part of the series. This is inconsistent with how Series.map usually works, and is very surprising!

I'm raising this issue even though #15706 already exists because that issue is asking for something different (they want the argument to .map to be called once per value in the series, rather than once per unique value).

Another related issue is #20714.

Expected Output

Categorical map should give values equal to those obtained by first converting to object. For any series s and function f we should have the invariant that:

s.map(f).astype(object).equals(s.astype(object).map(f).astype(object))

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.1.final.0 python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 63 Stepping 2, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None

pandas: 0.23.1
pytest: 3.1.2
pip: 18.0
setuptools: 39.0.1
Cython: 0.27.2
numpy: 1.14.3
scipy: 1.1.0
pyarrow: 0.9.0
xarray: None
IPython: 6.1.0
sphinx: None
patsy: 0.4.1
dateutil: 2.7.2
pytz: 2018.3
blosc: None
bottleneck: 1.2.1
tables: None
numexpr: None
feather: None
matplotlib: 2.2.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: 3.8.0
bs4: 4.6.0
html5lib: 0.9999999
sqlalchemy: 1.1.11
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
fastparquet: 0.1.5
pandas_gbq: None
pandas_datareader: None