qcut: Using cut with IntervalIndex provided by qcut producing wrong NaN values · Issue #17284 · pandas-dev/pandas (original) (raw)

xref #17282

Code Sample

x.isnull().sum() 0

x.value_counts() 0.000000 693 12.561725 1 13.568112 1 12.521249 1 13.007628 1 6.993961 1 14.815512 1 6.017280 1 12.944714 1 Name: 0, dtype: int64

categorized = pd.qcut(x, 10, duplicates='drop') categorized.isnull().sum() 0

categorized.cat.categories # Notice how all values of x are contained in the only interval IntervalIndex([(-0.001, 14.816]] closed='right', dtype='interval[float64]')

res = pd.cut(x, categorized.cat.categories) res.isnull().sum() 701

Copy pastable

x = pd.read_csv('x.csv', header=None).iloc[:, 0] # x.csv is provided in a comment below categorized = pd.qcut(x, 10, duplicates='drop') res = pd.cut(x, categorized.cat.categories) res.isnull().sum()

Problem description

When I use qcut to get the IntervalIndex corresponding to the quantiles of a float64 series, and than use this as the bins of cut on the same float64 series, it doesn't work. It produces a new series with a lot of NaN values, while the original series contained no NaN and all of its values are contained at the interval of IntervalIndex.

Expected Output

The result of both qcut and cut should also be the same, but they are not.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.1.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.0-514.26.2.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.20.1
pytest: 3.0.7
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.12.1
scipy: 0.19.0
xarray: None
IPython: 5.3.0
sphinx: 1.5.6
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: 3.3.0
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.4.7
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.7.3
bs4: 4.6.0
html5lib: 0.9999999
sqlalchemy: 1.1.9
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None