BUG: merging with a boolean/int categorical column · Issue #17187 · pandas-dev/pandas (original) (raw)
Code Sample, a copy-pastable example if possible
dfA = pd.DataFrame({'id':[1,2,3,4,5,6,7,8,9,10],'colA':[3,4,2,4,3,4,5,4,5,6],'colB':[7,6,5,6,5,7,8,7,6,7],'colC':[False,True,True,False,False,True,False,True,True,True]}) dfA['colC'] = dfA['colC'].astype('category',categories=[True,False],ordered=True) dfB = pd.DataFrame({'id':[2,5,7,8],'colD':[1,9,7,3]})
print("Before\n====") print('dfA dtypes\n------') print(dfA.dtypes) print('\ndfA\n---') print(dfA) print('\ndfB\n---') print(dfB)
dfA = pd.merge(left=dfA,right=dfB,how='left',on='id') print("\nAfter\n=====") print(dfA)
Problem description
This problem was asked on StackOverflow at https://stackoverflow.com/questions/45538092/merging-pandas-dataframes-containing-a-categorical-variable-fails-with-valueerr where it was suggested that it was a bug.
Two dataframes containing different columns can be combined using the pandas.merge() method. This works well but in the above example, converting one of the columns in the dataframe to a categorical variable causes the method to fail with error:
/Users/.../env3/lib/python3.4/site-packages/pandas/core/internals.py in init(self, values, placement, ndim, fastpath) 104 ndim = values.ndim 105 elif values.ndim != ndim: --> 106 raise ValueError('Wrong number of dimensions') 107 self.ndim = ndim 108
ValueError: Wrong number of dimensions
Using df.ndim() indicates that both dataframes have 2 dimensions.
Expected Output
The expected output can be generated simply by commenting out the second line in the above code, the line that converts one of the columns to a categorical variable.
colA colB colC id colD 0 3 7 False 1 NaN 1 4 6 True 2 1.0 2 2 5 True 3 NaN 3 4 6 False 4 NaN 4 3 5 False 5 9.0 5 4 7 True 6 NaN 6 5 8 False 7 7.0 7 4 7 True 8 3.0 8 5 6 True 9 NaN 9 6 7 True 10 NaN
Output of pd.show_versions()
[paste the output of pd.show_versions()
here below this line]
INSTALLED VERSIONS
commit: None
python: 3.4.1.final.0
python-bits: 64
OS: Darwin
OS-release: 15.6.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8
LOCALE: en_GB.UTF-8
pandas: 0.20.1
pytest: None
pip: 9.0.1
setuptools: 34.1.0
Cython: None
numpy: 1.12.1
scipy: 0.16.1
xarray: None
IPython: 4.1.1
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 1.5.3
openpyxl: 2.4.7
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: 0.7.11.None
psycopg2: None
jinja2: 2.8
s3fs: None
pandas_gbq: None
pandas_datareader: None