BUG: pd.Categorical turns all values into NaN · Issue #43334 · pandas-dev/pandas (original) (raw)


Code Sample

import pandas as pd data = pd.read_excel('titanic.xlsx')

data['Survived'] = data['Survived'].astype('category') data['Sex'] = data['Sex'].astype('category')

data.Survived.cat.categories = ['No', 'Yes'] data.Sex.cat.categories = ['female','male']

data.Survived = pd.Categorical(data.Survived, categories=['No', 'Yes'], ordered=False) data.Sex = pd.Categorical(data.Sex, categories=['female','male'], ordered=False)

print(data.head(3))

Problem description

This code sample reads the popular Kaggle titanic file. When reading the titanic.xlsx file, the following Data Set is generated:

PassengerId Survived Pclass ... Fare Cabin Embarked
0 1 0 3 ... 7.2500 NaN S
1 2 1 1 ... 71.2833 C85 C
2 3 1 3 ... 7.9250 NaN S

When I execute the code above, the result displayed on the terminal is as follows:

PassengerId Survived Pclass ... Fare Cabin Embarked
0 1 NaN 3 ... 7.2500 NaN S
1 2 NaN 1 ... 71.2833 C85 C
2 3 NaN 3 ... 7.9250 NaN S

As can be seen, all values in the "Survived" Series are now NaN. The expected behavior, however, would be for the values to become "Yes" or "No". Strangely, if I invert the penultimate and anti-penultimate lines generating the following code sample:

import pandas as pd data = pd.read_excel('titanic.xlsx')

data['Survived'] = data['Survived'].astype('category') data['Sex'] = data['Sex'].astype('category')

data.Survived.cat.categories = ['No', 'Yes'] data.Sex.cat.categories = ['female','male']

data.Sex = pd.Categorical(data.Sex, categories=['female','male'], ordered=False) data.Survived = pd.Categorical(data.Survived, categories=['No', 'Yes'], ordered=False)

print(data.head(3))

The result generated by the code sample above is the expected one, as shown below.

PassengerId Survived Pclass ... Fare Cabin Embarked
0 1 No 3 ... 7.2500 NaN S
1 2 Yes 1 ... 71.2833 C85 C
2 3 Yes 3 ... 7.9250 NaN S

Output of pd.show_versions()

INSTALLED VERSIONS

commit : 5f648bf
python : 3.9.6.final.0
python-bits : 64
OS : Darwin
OS-release : 19.6.0
Version : Darwin Kernel Version 19.6.0: Tue Jun 22 19:49:55 PDT 2021; root:xnu-6153.141.35~1/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : pt_BR.UTF-8
LOCALE : pt_BR.UTF-8

pandas : 1.3.2
numpy : 1.21.1
pytz : 2021.1
dateutil : 2.8.1
pip : 21.1.3
setuptools : 57.0.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.0.1
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.4.3
numexpr : None
odfpy : None
openpyxl : 3.0.7
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : 1.4.15
tables : None
tabulate : None
xarray : None
xlrd : 2.0.1
xlwt : None
numba : None