Concatenating two series of categoricals results in data corruption without warning · Issue #19096 · pandas-dev/pandas (original) (raw)

Code Sample, a copy-pastable example if possible

I'm sadly unable to share the underlying data, and have not yet been able to product a minimised reproduction.

In [202]: s1 = df1.symbol

In [203]: s2 = df2.symbol

In [204]: s1.dtype Out[204]: CategoricalDtype(categories=['RE00012ME6MA', 'RE00002YE6MA', 'RE00018ME6MA', 'RE00012YE6MA', 'RE00013YE6MA', 'RE00010YE6MA', 'RE00014YE6MA', 'RE00015YE6MA', 'RE00016YE6MA', 'RE00017YE6MA', 'RE00018YE6MA' , 'RE00019YE6MA', 'RE00020YE6MA', 'RE00025YE6MA', 'RE00011YE6MA', 'RE00003YE6MA', 'RE00005YE6MA', 'RE00009YE6MA', 'RE00004YE6MA', 'RE00008YE6MA', 'RE00006YE6MA', 'RE00007YE6MA', 'RE00030YE6MA'], ordered=False)

In [205]: s1.shape Out[205]: (2084,)

In [206]: s2.dtype Out[206]: CategoricalDtype(categories=['RE00030YE6MA', 'RE00008YE6MA', 'RE00016YE6MA', 'RE00015YE6MA', 'RE00018YE6MA', 'RE00017YE6MA', 'RE00020YE6MA', 'RE00006YE6MA', 'RE00005YE6MA', 'RE00004YE6MA', 'RE00014YE6MA' , 'RE00025YE6MA', 'RE00003YE6MA', 'RE00013YE6MA', 'RE00002YE6MA', 'RE00009YE6MA', 'RE00018ME6MA', 'RE00011YE6MA', 'RE00019YE6MA', 'RE00010YE6MA', 'RE00007YE6MA', 'RE00012YE6MA', 'RE00012ME6MA'], ordered=False)

In [207]: s2.shape Out[207]: (1030,)

In [208]: pd.concat([s1, s2]).astype('object') == pd.concat([s1.astype('object'), s2.astype('object')]) Out[208]: 0 True 1 True 2 True 3 True 4 True ... 1025 False 1026 False 1027 False 1028 False 1029 False Name: symbol, Length: 3114, dtype: bool

In [209]: pd.concat([s1, s2], ignore_index=True).astype('object') == pd.concat([s1.astype('object'), s2.astype('object')], ignore_index=True) Out[209]: 0 True 1 True 2 True 3 True 4 True ... 3109 False 3110 False 3111 False 3112 False 3113 False Name: symbol, Length: 3114, dtype: bool

In [210]: pd.concat([s1.astype('object'), s2.astype('object')], ignore_index=True).iloc[-5:] Out[210]: 3109 RE00012ME6MA 3110 RE00012ME6MA 3111 RE00005YE6MA 3112 RE00015YE6MA 3113 RE00015YE6MA Name: symbol, dtype: object

In [211]: pd.concat([s1, s2], ignore_index=True).astype('object').iloc[-5:] Out[211]: 3109 RE00030YE6MA 3110 RE00030YE6MA 3111 RE00016YE6MA 3112 RE00012YE6MA 3113 RE00012YE6MA Name: symbol, dtype: object

Problem description

The row values have changed without warning. This seems to be extremely suprising behaviour!

Expected Output

Concatenating two series with categories of the same values in different orders should not result in the row values changing

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.3.final.0
python-bits: 64
OS: Linux
OS-release: 4.9.0-4-amd64
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: C.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.22.0
pytest: None
pip: 10.0.0.subpip_fix
setuptools: 36.5.0
Cython: None
numpy: 1.13.3
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.1.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: 4.1.0
bs4: 4.6.0
html5lib: 1.0b10
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: 0.5.0