pd.get_dummies incorrectly encodes unicode characters in dataframe column names · Issue #22084 · pandas-dev/pandas (original) (raw)

Problem description

In Python 2.x, calling pd.get_dummies on a data-frame containing Unicode column names with characters out of ASCII range leads to an UnicodeEncodeError. Problem first appeared in version 0.21.0 and is still present in 0.23.3, as well as master branch. It was first introduced in this commit: 133a208#diff-fef81b7e498e469973b2da18d19ff6f3L1256.

Reason behind the problem is that older pandas versions used % formatting operator, which automatically converts string to Unicode string if one or more arguments are themselves Unicode strings, while new code uses .format function and chooses unicode/str exclusively based on the type of level variable.

Series.str.get_dummies is not affected, but it might be worth it to check for similar issues with other .format calls.

Code Sample, a copy-pastable example if possible

In pandas 0.23.3

pd.get_dummies(pd.DataFrame({u'ä': ['a']})).columns.tolist()

---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-17-61dd26c6814f> in <module>()
----> 1 pd.get_dummies(pd.DataFrame({u'ä': ['a']})).columns.tolist()

/usr/local/lib/python2.7/site-packages/pandas/core/reshape/reshape.pyc in get_dummies(data, prefix, prefix_sep, dummy_na, columns, sparse, drop_first, dtype)
    890             dummy = _get_dummies_1d(col[1], prefix=pre, prefix_sep=sep,
    891                                     dummy_na=dummy_na, sparse=sparse,
--> 892                                     drop_first=drop_first, dtype=dtype)
    893             with_dummies.append(dummy)
    894         result = concat(with_dummies, axis=1)

/usr/local/lib/python2.7/site-packages/pandas/core/reshape/reshape.pyc in _get_dummies_1d(data, prefix, prefix_sep, dummy_na, sparse, drop_first, dtype)
    942                       else '{prefix}{sep}{level}' for v in levels]
    943         dummy_cols = [dummy_str.format(prefix=prefix, sep=prefix_sep, level=v)
--> 944                       for dummy_str, v in zip(dummy_strs, levels)]
    945     else:
    946         dummy_cols = levels

UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 0: ordinal not in range(128)

Expected Output

As in pandas 0.19.2:

pd.get_dummies(pd.DataFrame({u'ä': ['a']})).columns.tolist()

On a sidenote: setting default system encoding to 'utf-8' in your system with:

import sys reload(sys) sys.setdefaultencoding('utf-8')

(which, of course, is a bad idea anyway, but people still do) makes the encoding problems even worse. get_dummies will encode the Unicode string into normal string, and it will be impossible to lookup the column name with expectedly correct Unicode string later. This hides an error and makes it very hard to debug, since exception is far away from the root cause:

pandas v0.23.3

pd.get_dummies(pd.DataFrame({u'ä': ['a']})).columns.tolist() ['\xc3\xa4_a']

pd.get_dummies(pd.DataFrame({u'ä': ['a']}))[u'ä_a'] ... traceback ... KeyError: u'\xe4_a'

Similar problem will appear with:

pd.get_dummies(pd.DataFrame({'a': ['a']}), prefix=u'ä').columns.tolist()

pd.get_dummies(pd.DataFrame({'a': ['a']}), prefix_sep=u'ä').columns.tolist()

Output of pd.show_versions()

commit: dfd58e8d1b32daddde18f40c289af1f77ad219b7
python: 2.7.15.final.0
python-bits: 64
OS: Darwin
OS-release: 17.5.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.24.0.dev0+364.gdfd58e8d1.dirty
pytest: 3.6.3
pip: 10.0.1
setuptools: 39.2.0
Cython: 0.28.4
numpy: 1.14.5
scipy: None
pyarrow: None
xarray: None
IPython: None
sphinx: 1.7.6
patsy: None
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

Since the problem is clear, creating PR will be rather simple. I will try to write it this weekend, if this issue is approved as a bug.