pd.merge on Categoricals duplicating unique rows · Issue #16767 · pandas-dev/pandas (original) (raw)

Code Sample, a copy-pastable example if possible

import pandas as pd

create our dataframe

m = 5 temp = pd.DataFrame({ 'a': ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'] * m, 'b': ['t', 'w', 'x', 'y', 'z'] * 2 * m, 'c': [letter for each in ['m', 'n', 'u', 'p', 'o'] for letter in [each] * 2 * m], 'd': [letter for each in ['aa', 'bb', 'cc', 'dd', 'ee', 'ff', 'gg', 'hh','ii', 'jj'] for letter in [each] * m],

})

change them all to categorical variables

for c in temp.columns: temp[c] = temp[c].astype('category')

get the dimensions before we do anything

print(temp.shape)

drop duplicates to make sure this is unqiue

it should be unique

id_df = temp.drop_duplicates() print(id_df.shape)

join a row-wise unique dataset to itself on all variables

when they're categorical variables it duplicates rows

when they're strings things behave as they're suppposed to

temp1 = pd.merge(temp, id_df, on = list(temp.columns)) print(temp1.shape)

Problem description

Using merge on Categorical dtypes doesn't appear to be checking equality correctly. Merging a unique dataframe to itself on 4 Categorical columns appears to duplicate rows. The above code example is simpler than what I experienced the issue on but the behavior is there.

The dataframe as it is created is a 50 row by 4 column dataframe of strings. Casting the strings to Categoricals to save on RAM appears to work well. Running the drop_duplicates method and checking the dimensions shows that each row is unique. Then simply merging the dataframes together results in a 54 row by 4 column dataframe.

My guess is that there is something about the way the values are assigned that underlie the labels differs and that the underlying values may be equal when the labels aren't. It appears to be a fairly specific case, as commenting any of those columns out results in what I'd expect in terms of output.

Expected Output

Running the same code and steps as illustrated in the prior paragraph without casting the columns to a Categorical dtype results in what I would expect: a 50 row by 4 column dataframe.

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.5.1.final.0 python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 94 Stepping 3, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None

pandas: 0.20.2
pytest: 2.8.5
pip: 9.0.1
setuptools: 20.3
Cython: 0.23.4
numpy: 1.13.0
scipy: 0.17.0
xarray: None
IPython: 4.1.2
sphinx: 1.3.1
patsy: 0.4.0
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.5
feather: None
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.8.4
lxml: 3.6.0
bs4: 4.4.1
html5lib: None
sqlalchemy: 1.0.12
pymysql: None
psycopg2: None
jinja2: 2.8
s3fs: None
pandas_gbq: None
pandas_datareader: None