BUG: pd.Categorical turns all values into NaN · Issue #43334 · pandas-dev/pandas (original) (raw)
- I have checked that this issue has not already been reported.
- I have confirmed this bug exists on the latest version of pandas.
- (optional) I have confirmed this bug exists on the master branch of pandas.
Code Sample
import pandas as pd data = pd.read_excel('titanic.xlsx')
data['Survived'] = data['Survived'].astype('category') data['Sex'] = data['Sex'].astype('category')
data.Survived.cat.categories = ['No', 'Yes'] data.Sex.cat.categories = ['female','male']
data.Survived = pd.Categorical(data.Survived, categories=['No', 'Yes'], ordered=False) data.Sex = pd.Categorical(data.Sex, categories=['female','male'], ordered=False)
print(data.head(3))
Problem description
This code sample reads the popular Kaggle titanic file. When reading the titanic.xlsx file, the following Data Set is generated:
PassengerId | Survived | Pclass | ... | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | ... | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | ... | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | ... | 7.9250 | NaN | S |
When I execute the code above, the result displayed on the terminal is as follows:
PassengerId | Survived | Pclass | ... | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|
0 | 1 | NaN | 3 | ... | 7.2500 | NaN | S |
1 | 2 | NaN | 1 | ... | 71.2833 | C85 | C |
2 | 3 | NaN | 3 | ... | 7.9250 | NaN | S |
As can be seen, all values in the "Survived" Series are now NaN. The expected behavior, however, would be for the values to become "Yes" or "No". Strangely, if I invert the penultimate and anti-penultimate lines generating the following code sample:
import pandas as pd data = pd.read_excel('titanic.xlsx')
data['Survived'] = data['Survived'].astype('category') data['Sex'] = data['Sex'].astype('category')
data.Survived.cat.categories = ['No', 'Yes'] data.Sex.cat.categories = ['female','male']
data.Sex = pd.Categorical(data.Sex, categories=['female','male'], ordered=False) data.Survived = pd.Categorical(data.Survived, categories=['No', 'Yes'], ordered=False)
print(data.head(3))
The result generated by the code sample above is the expected one, as shown below.
PassengerId | Survived | Pclass | ... | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|
0 | 1 | No | 3 | ... | 7.2500 | NaN | S |
1 | 2 | Yes | 1 | ... | 71.2833 | C85 | C |
2 | 3 | Yes | 3 | ... | 7.9250 | NaN | S |
Output of pd.show_versions()
INSTALLED VERSIONS
commit : 5f648bf
python : 3.9.6.final.0
python-bits : 64
OS : Darwin
OS-release : 19.6.0
Version : Darwin Kernel Version 19.6.0: Tue Jun 22 19:49:55 PDT 2021; root:xnu-6153.141.35~1/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : pt_BR.UTF-8
LOCALE : pt_BR.UTF-8
pandas : 1.3.2
numpy : 1.21.1
pytz : 2021.1
dateutil : 2.8.1
pip : 21.1.3
setuptools : 57.0.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.0.1
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.4.3
numexpr : None
odfpy : None
openpyxl : 3.0.7
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : 1.4.15
tables : None
tabulate : None
xarray : None
xlrd : 2.0.1
xlwt : None
numba : None