Concatenating two series of categoricals results in data corruption without warning · Issue #19096 · pandas-dev/pandas (original) (raw)
Code Sample, a copy-pastable example if possible
I'm sadly unable to share the underlying data, and have not yet been able to product a minimised reproduction.
In [202]: s1 = df1.symbol
In [203]: s2 = df2.symbol
In [204]: s1.dtype Out[204]: CategoricalDtype(categories=['RE00012ME6MA', 'RE00002YE6MA', 'RE00018ME6MA', 'RE00012YE6MA', 'RE00013YE6MA', 'RE00010YE6MA', 'RE00014YE6MA', 'RE00015YE6MA', 'RE00016YE6MA', 'RE00017YE6MA', 'RE00018YE6MA' , 'RE00019YE6MA', 'RE00020YE6MA', 'RE00025YE6MA', 'RE00011YE6MA', 'RE00003YE6MA', 'RE00005YE6MA', 'RE00009YE6MA', 'RE00004YE6MA', 'RE00008YE6MA', 'RE00006YE6MA', 'RE00007YE6MA', 'RE00030YE6MA'], ordered=False)
In [205]: s1.shape Out[205]: (2084,)
In [206]: s2.dtype Out[206]: CategoricalDtype(categories=['RE00030YE6MA', 'RE00008YE6MA', 'RE00016YE6MA', 'RE00015YE6MA', 'RE00018YE6MA', 'RE00017YE6MA', 'RE00020YE6MA', 'RE00006YE6MA', 'RE00005YE6MA', 'RE00004YE6MA', 'RE00014YE6MA' , 'RE00025YE6MA', 'RE00003YE6MA', 'RE00013YE6MA', 'RE00002YE6MA', 'RE00009YE6MA', 'RE00018ME6MA', 'RE00011YE6MA', 'RE00019YE6MA', 'RE00010YE6MA', 'RE00007YE6MA', 'RE00012YE6MA', 'RE00012ME6MA'], ordered=False)
In [207]: s2.shape Out[207]: (1030,)
In [208]: pd.concat([s1, s2]).astype('object') == pd.concat([s1.astype('object'), s2.astype('object')]) Out[208]: 0 True 1 True 2 True 3 True 4 True ... 1025 False 1026 False 1027 False 1028 False 1029 False Name: symbol, Length: 3114, dtype: bool
In [209]: pd.concat([s1, s2], ignore_index=True).astype('object') == pd.concat([s1.astype('object'), s2.astype('object')], ignore_index=True) Out[209]: 0 True 1 True 2 True 3 True 4 True ... 3109 False 3110 False 3111 False 3112 False 3113 False Name: symbol, Length: 3114, dtype: bool
In [210]: pd.concat([s1.astype('object'), s2.astype('object')], ignore_index=True).iloc[-5:] Out[210]: 3109 RE00012ME6MA 3110 RE00012ME6MA 3111 RE00005YE6MA 3112 RE00015YE6MA 3113 RE00015YE6MA Name: symbol, dtype: object
In [211]: pd.concat([s1, s2], ignore_index=True).astype('object').iloc[-5:] Out[211]: 3109 RE00030YE6MA 3110 RE00030YE6MA 3111 RE00016YE6MA 3112 RE00012YE6MA 3113 RE00012YE6MA Name: symbol, dtype: object
Problem description
The row values have changed without warning. This seems to be extremely suprising behaviour!
Expected Output
Concatenating two series with categories of the same values in different orders should not result in the row values changing
Output of pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.6.3.final.0
python-bits: 64
OS: Linux
OS-release: 4.9.0-4-amd64
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: C.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.22.0
pytest: None
pip: 10.0.0.subpip_fix
setuptools: 36.5.0
Cython: None
numpy: 1.13.3
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.1.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: 4.1.0
bs4: 4.6.0
html5lib: 1.0b10
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: 0.5.0