pandas.Categorical.from_codes
incorrectly converts NaN codes to 0. · Issue #21767 · pandas-dev/pandas (original) (raw)
Code Sample, a copy-pastable example if possible
import pandas
def test_categorical_from_codes(): "Demonstrate issue with pandas.Categorical.from_codes." codes = pandas.Series([1, 2, float('NaN')]) categories = ['foo', 'bar', 'baz'] c = pandas.Categorical.from_codes(codes, categories, ordered=True)
expected = pandas.Categorical.from_codes([1, 2, -1], categories,
ordered=True)
pandas.util.testing.assert_categorical_equal(c, expected)
Problem description
pandas.Categorical.from_codes
is incorrectly coercing NaN
code values into 0
values. I think it should be raising a ValueError
. I believe this is because pandas._libs.algos.ensure_int8/16/32/64
is behaving in an unanticipated manner on numpy
arrays.
The call chain is:
pandas.core.arrays.categorical.Categorical.from_codes
pandas.core.dtypes.cast.coerce_indexer_dtype
pandas._libs.algos.ensure_int8/16/32/64
(aliased topandas.core.dtypes.common.ensure_int8/16/32/64
)
When given a single value all the ensure functions behave correctly. But when given a numpy.array
they do not.
import pandas import pytest
def test_ensure_on_value(): """ Test that ensure_int functions raise a ValueError when converting a NaN.
These tests ALL PASS, and should pass.
"""
with pytest.raises(ValueError):
pandas._libs.algos.ensure_int8(float('NaN'))
with pytest.raises(ValueError):
pandas._libs.algos.ensure_int16(float('NaN'))
with pytest.raises(ValueError):
pandas._libs.algos.ensure_int32(float('NaN'))
with pytest.raises(ValueError):
pandas._libs.algos.ensure_int64(float('NaN'))
def test_ensure_on_array(): """ These tests ALL PASS.
I assume all of these are supposed to raise ValueErrors.
Interestingly I did not get 0 values with ensure_int32 and ensure_int64.
"""
a8 = numpy.array([float('NaN')])
c8 = pandas._libs.algos.ensure_int8(a8)
numpy.testing.assert_array_equal(c8, numpy.array([0]))
a16 = numpy.array([float('NaN')])
c16 = pandas._libs.algos.ensure_int16(a16)
numpy.testing.assert_array_equal(c16, numpy.array([0]))
a32 = numpy.array([float('NaN')])
c32 = pandas._libs.algos.ensure_int32(a32)
numpy.testing.assert_array_equal(c32, numpy.array([-2147483648]))
a64 = numpy.array([float('NaN')])
c64 = pandas._libs.algos.ensure_int64(a64)
numpy.testing.assert_array_equal(c64, numpy.array([-9223372036854775808]))
Expected Output
test_categorical_from_codes
should pass. Instead it contains the codes [1, 2, 0]
.
Output of pd.show_versions()
[paste the output of pd.show_versions()
here below this line]
pandas.show_versions()
/Users/miker985/miniconda3/envs/general3/lib/python3.6/site-packages/psycopg2/init.py:144: UserWarning: The psycopg2 wheel package will be renamed from release 2.8; in order to keep installing from binary please use "pip install psycopg2-binary" instead. For details see: http://initd.org/psycopg/docs/install.html#binary-install-from-pypi.
""")
INSTALLED VERSIONS
commit: None
python: 3.6.0.final.0
python-bits: 64
OS: Darwin
OS-release: 17.3.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.23.2
pytest: 3.6.3
pip: 10.0.1
setuptools: 27.2.0
Cython: None
numpy: 1.12.1
scipy: None
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.0.0
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.6.0
html5lib: None
sqlalchemy: 1.2.6
pymysql: None
psycopg2: 2.7.4 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None