pd.merge on Categoricals duplicating unique rows · Issue #16767 · pandas-dev/pandas (original) (raw)
Code Sample, a copy-pastable example if possible
import pandas as pd
create our dataframe
m = 5 temp = pd.DataFrame({ 'a': ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'] * m, 'b': ['t', 'w', 'x', 'y', 'z'] * 2 * m, 'c': [letter for each in ['m', 'n', 'u', 'p', 'o'] for letter in [each] * 2 * m], 'd': [letter for each in ['aa', 'bb', 'cc', 'dd', 'ee', 'ff', 'gg', 'hh','ii', 'jj'] for letter in [each] * m],
})
change them all to categorical variables
for c in temp.columns: temp[c] = temp[c].astype('category')
get the dimensions before we do anything
print(temp.shape)
drop duplicates to make sure this is unqiue
it should be unique
id_df = temp.drop_duplicates() print(id_df.shape)
join a row-wise unique dataset to itself on all variables
when they're categorical variables it duplicates rows
when they're strings things behave as they're suppposed to
temp1 = pd.merge(temp, id_df, on = list(temp.columns)) print(temp1.shape)
Problem description
Using merge on Categorical dtypes doesn't appear to be checking equality correctly. Merging a unique dataframe to itself on 4 Categorical columns appears to duplicate rows. The above code example is simpler than what I experienced the issue on but the behavior is there.
The dataframe as it is created is a 50 row by 4 column dataframe of strings. Casting the strings to Categoricals to save on RAM appears to work well. Running the drop_duplicates method and checking the dimensions shows that each row is unique. Then simply merging the dataframes together results in a 54 row by 4 column dataframe.
My guess is that there is something about the way the values are assigned that underlie the labels differs and that the underlying values may be equal when the labels aren't. It appears to be a fairly specific case, as commenting any of those columns out results in what I'd expect in terms of output.
Expected Output
Running the same code and steps as illustrated in the prior paragraph without casting the columns to a Categorical dtype results in what I would expect: a 50 row by 4 column dataframe.
Output of pd.show_versions()
INSTALLED VERSIONS ------------------ commit: None python: 3.5.1.final.0 python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 94 Stepping 3, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None
pandas: 0.20.2
pytest: 2.8.5
pip: 9.0.1
setuptools: 20.3
Cython: 0.23.4
numpy: 1.13.0
scipy: 0.17.0
xarray: None
IPython: 4.1.2
sphinx: 1.3.1
patsy: 0.4.0
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.5
feather: None
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.8.4
lxml: 3.6.0
bs4: 4.4.1
html5lib: None
sqlalchemy: 1.0.12
pymysql: None
psycopg2: None
jinja2: 2.8
s3fs: None
pandas_gbq: None
pandas_datareader: None