ENH: Added str.normalize to use unicodedata.normalize by sinhrks · Pull Request #10031 · pandas-dev/pandas (original) (raw)
Derived from #9111. Can this be considered in v0.16.1? Otherwise will change the milestone.
unicodedata.normalize is quite useful to standardize multi-bytes characters. I think it is nice if
StringMethods.normalize
can perform this.
import pandas as pd
s = pd.Series([u'ABCDE', u'12345'])
s
#0 ABCDE
#1 12345
# dtype: object
s.str.normalize()
#0 ABCDE
#1 12345
# dtype: object
Another point I'd like to discuss here is the condition Index.str
can be used. Currently, inferred_type
must be string
. I think the preferable condition is:
Index
must be normal Index, notMultiIndex
.- Its
inferred_type
should be eitherstring
,unicode
ormixed
.
This PR adds unicode
currently, not mixed
.
pd.Index([u'a', u'B']).inferred_type
# unicode
pd.Index(['a', u'B']).inferred_type
# mixed
# when we allow "mixed" to show str, we should exclude MultiIndex case.
pd.MultiIndex.from_tuples([('a', 'a'), ('a', 'b')]).inferred_type
# mixed
CC: @mortada