pd.get_dummies incorrectly encodes unicode characters in dataframe column names · Issue #22084 · pandas-dev/pandas (original) (raw)
Problem description
In Python 2.x, calling pd.get_dummies
on a data-frame containing Unicode column names with characters out of ASCII range leads to an UnicodeEncodeError. Problem first appeared in version 0.21.0 and is still present in 0.23.3, as well as master branch. It was first introduced in this commit: 133a208#diff-fef81b7e498e469973b2da18d19ff6f3L1256.
Reason behind the problem is that older pandas versions used %
formatting operator, which automatically converts string to Unicode string if one or more arguments are themselves Unicode strings, while new code uses .format
function and chooses unicode/str exclusively based on the type of level
variable.
Series.str.get_dummies is not affected, but it might be worth it to check for similar issues with other .format
calls.
Code Sample, a copy-pastable example if possible
In pandas 0.23.3
pd.get_dummies(pd.DataFrame({u'ä': ['a']})).columns.tolist()
---------------------------------------------------------------------------
UnicodeEncodeError Traceback (most recent call last)
<ipython-input-17-61dd26c6814f> in <module>()
----> 1 pd.get_dummies(pd.DataFrame({u'ä': ['a']})).columns.tolist()
/usr/local/lib/python2.7/site-packages/pandas/core/reshape/reshape.pyc in get_dummies(data, prefix, prefix_sep, dummy_na, columns, sparse, drop_first, dtype)
890 dummy = _get_dummies_1d(col[1], prefix=pre, prefix_sep=sep,
891 dummy_na=dummy_na, sparse=sparse,
--> 892 drop_first=drop_first, dtype=dtype)
893 with_dummies.append(dummy)
894 result = concat(with_dummies, axis=1)
/usr/local/lib/python2.7/site-packages/pandas/core/reshape/reshape.pyc in _get_dummies_1d(data, prefix, prefix_sep, dummy_na, sparse, drop_first, dtype)
942 else '{prefix}{sep}{level}' for v in levels]
943 dummy_cols = [dummy_str.format(prefix=prefix, sep=prefix_sep, level=v)
--> 944 for dummy_str, v in zip(dummy_strs, levels)]
945 else:
946 dummy_cols = levels
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 0: ordinal not in range(128)
Expected Output
As in pandas 0.19.2:
pd.get_dummies(pd.DataFrame({u'ä': ['a']})).columns.tolist()
On a sidenote: setting default system encoding to 'utf-8' in your system with:
import sys reload(sys) sys.setdefaultencoding('utf-8')
(which, of course, is a bad idea anyway, but people still do) makes the encoding problems even worse. get_dummies
will encode the Unicode string into normal string, and it will be impossible to lookup the column name with expectedly correct Unicode string later. This hides an error and makes it very hard to debug, since exception is far away from the root cause:
pandas v0.23.3
pd.get_dummies(pd.DataFrame({u'ä': ['a']})).columns.tolist() ['\xc3\xa4_a']
pd.get_dummies(pd.DataFrame({u'ä': ['a']}))[u'ä_a'] ... traceback ... KeyError: u'\xe4_a'
Similar problem will appear with:
pd.get_dummies(pd.DataFrame({'a': ['a']}), prefix=u'ä').columns.tolist()
pd.get_dummies(pd.DataFrame({'a': ['a']}), prefix_sep=u'ä').columns.tolist()
Output of pd.show_versions()
commit: dfd58e8d1b32daddde18f40c289af1f77ad219b7
python: 2.7.15.final.0
python-bits: 64
OS: Darwin
OS-release: 17.5.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: None.None
pandas: 0.24.0.dev0+364.gdfd58e8d1.dirty
pytest: 3.6.3
pip: 10.0.1
setuptools: 39.2.0
Cython: 0.28.4
numpy: 1.14.5
scipy: None
pyarrow: None
xarray: None
IPython: None
sphinx: 1.7.6
patsy: None
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None
Since the problem is clear, creating PR will be rather simple. I will try to write it this weekend, if this issue is approved as a bug.