BUG: pd.Categorical turns all values into NaN · Issue #43334 · pandas-dev/pandas (original) (raw)

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.

Code Sample

import pandas as pd data = pd.read_excel('titanic.xlsx')

data['Survived'] = data['Survived'].astype('category') data['Sex'] = data['Sex'].astype('category')

data.Survived.cat.categories = ['No', 'Yes'] data.Sex.cat.categories = ['female','male']

data.Survived = pd.Categorical(data.Survived, categories=['No', 'Yes'], ordered=False) data.Sex = pd.Categorical(data.Sex, categories=['female','male'], ordered=False)

print(data.head(3))

Problem description

This code sample reads the popular Kaggle titanic file. When reading the titanic.xlsx file, the following Data Set is generated:

PassengerId	Survived	Pclass	...	Fare	Cabin	Embarked
0	1	0	3	...	7.2500	NaN	S
1	2	1	1	...	71.2833	C85	C
2	3	1	3	...	7.9250	NaN	S

When I execute the code above, the result displayed on the terminal is as follows:

PassengerId	Survived	Pclass	...	Fare	Cabin	Embarked
0	1	NaN	3	...	7.2500	NaN	S
1	2	NaN	1	...	71.2833	C85	C
2	3	NaN	3	...	7.9250	NaN	S

As can be seen, all values in the "Survived" Series are now NaN. The expected behavior, however, would be for the values to become "Yes" or "No". Strangely, if I invert the penultimate and anti-penultimate lines generating the following code sample:

import pandas as pd data = pd.read_excel('titanic.xlsx')

data['Survived'] = data['Survived'].astype('category') data['Sex'] = data['Sex'].astype('category')

data.Survived.cat.categories = ['No', 'Yes'] data.Sex.cat.categories = ['female','male']

data.Sex = pd.Categorical(data.Sex, categories=['female','male'], ordered=False) data.Survived = pd.Categorical(data.Survived, categories=['No', 'Yes'], ordered=False)

print(data.head(3))

The result generated by the code sample above is the expected one, as shown below.

PassengerId	Survived	Pclass	...	Fare	Cabin	Embarked
0	1	No	3	...	7.2500	NaN	S
1	2	Yes	1	...	71.2833	C85	C
2	3	Yes	3	...	7.9250	NaN	S

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : 5f648bf
python : 3.9.6.final.0
python-bits : 64
OS : Darwin
OS-release : 19.6.0
Version : Darwin Kernel Version 19.6.0: Tue Jun 22 19:49:55 PDT 2021; root:xnu-6153.141.35~1/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : pt_BR.UTF-8
LOCALE : pt_BR.UTF-8

pandas : 1.3.2
numpy : 1.21.1
pytz : 2021.1
dateutil : 2.8.1
pip : 21.1.3
setuptools : 57.0.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.0.1
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.4.3
numexpr : None
odfpy : None
openpyxl : 3.0.7
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : 1.4.15
tables : None
tabulate : None
xarray : None
xlrd : 2.0.1
xlwt : None
numba : None

BUG: pd.Categorical turns all values into NaN · Issue #43334 · pandas-dev/pandas (original) (raw)

Code Sample

Problem description

Output of pd.show_versions()

INSTALLED VERSIONS

Output of `pd.show_versions()`