ENH: Added str.normalize to use unicodedata.normalize by sinhrks · Pull Request #10031 · pandas-dev/pandas (original) (raw)

Derived from #9111. Can this be considered in v0.16.1? Otherwise will change the milestone.

unicodedata.normalize is quite useful to standardize multi-bytes characters. I think it is nice if StringMethods.normalize can perform this.

import pandas as pd
s = pd.Series([u'ＡＢＣＤＥ', u'１２３４５'])
s
#0    ＡＢＣＤＥ
#1    １２３４５
# dtype: object

s.str.normalize()
#0    ABCDE
#1    12345
# dtype: object

Another point I'd like to discuss here is the condition Index.str can be used. Currently, inferred_type must be string. I think the preferable condition is:

Index must be normal Index, not MultiIndex.
Its inferred_type should be either string, unicode or mixed.

This PR adds unicode currently, not mixed.

pd.Index([u'a', u'B']).inferred_type
# unicode
pd.Index(['a', u'B']).inferred_type
# mixed

# when we allow "mixed" to show str, we should exclude MultiIndex case.
pd.MultiIndex.from_tuples([('a', 'a'), ('a', 'b')]).inferred_type
# mixed

CC: @mortada