pandas.Categorical.from_codes incorrectly converts NaN codes to 0. · Issue #21767 · pandas-dev/pandas (original) (raw)

Code Sample, a copy-pastable example if possible

import pandas

def test_categorical_from_codes(): "Demonstrate issue with pandas.Categorical.from_codes." codes = pandas.Series([1, 2, float('NaN')]) categories = ['foo', 'bar', 'baz'] c = pandas.Categorical.from_codes(codes, categories, ordered=True)

expected = pandas.Categorical.from_codes([1, 2, -1], categories,
                                         ordered=True)

pandas.util.testing.assert_categorical_equal(c, expected)

Problem description

pandas.Categorical.from_codes is incorrectly coercing NaN code values into 0 values. I think it should be raising a ValueError. I believe this is because pandas._libs.algos.ensure_int8/16/32/64 is behaving in an unanticipated manner on numpy arrays.

The call chain is:

  1. pandas.core.arrays.categorical.Categorical.from_codes
  2. pandas.core.dtypes.cast.coerce_indexer_dtype
  3. pandas._libs.algos.ensure_int8/16/32/64 (aliased to pandas.core.dtypes.common.ensure_int8/16/32/64)

When given a single value all the ensure functions behave correctly. But when given a numpy.array they do not.

import pandas import pytest

def test_ensure_on_value(): """ Test that ensure_int functions raise a ValueError when converting a NaN.

These tests ALL PASS, and should pass.
"""
with pytest.raises(ValueError):
    pandas._libs.algos.ensure_int8(float('NaN'))

with pytest.raises(ValueError):
    pandas._libs.algos.ensure_int16(float('NaN'))

with pytest.raises(ValueError):
    pandas._libs.algos.ensure_int32(float('NaN'))

with pytest.raises(ValueError):
    pandas._libs.algos.ensure_int64(float('NaN'))

def test_ensure_on_array(): """ These tests ALL PASS.

I assume all of these are supposed to raise ValueErrors.

Interestingly I did not get 0 values with ensure_int32 and ensure_int64.
"""
a8 = numpy.array([float('NaN')])
c8 = pandas._libs.algos.ensure_int8(a8)
numpy.testing.assert_array_equal(c8, numpy.array([0]))

a16 = numpy.array([float('NaN')])
c16 = pandas._libs.algos.ensure_int16(a16)
numpy.testing.assert_array_equal(c16, numpy.array([0]))

a32 = numpy.array([float('NaN')])
c32 = pandas._libs.algos.ensure_int32(a32)
numpy.testing.assert_array_equal(c32, numpy.array([-2147483648]))

a64 = numpy.array([float('NaN')])
c64 = pandas._libs.algos.ensure_int64(a64)
numpy.testing.assert_array_equal(c64, numpy.array([-9223372036854775808]))

Expected Output

test_categorical_from_codes should pass. Instead it contains the codes [1, 2, 0].

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]

pandas.show_versions()
/Users/miker985/miniconda3/envs/general3/lib/python3.6/site-packages/psycopg2/init.py:144: UserWarning: The psycopg2 wheel package will be renamed from release 2.8; in order to keep installing from binary please use "pip install psycopg2-binary" instead. For details see: http://initd.org/psycopg/docs/install.html#binary-install-from-pypi.
""")

INSTALLED VERSIONS

commit: None
python: 3.6.0.final.0
python-bits: 64
OS: Darwin
OS-release: 17.3.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.2
pytest: 3.6.3
pip: 10.0.1
setuptools: 27.2.0
Cython: None
numpy: 1.12.1
scipy: None
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.0.0
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.6.0
html5lib: None
sqlalchemy: 1.2.6
pymysql: None
psycopg2: 2.7.4 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None