ENH: Data formatting with unicode length by sinhrks · Pull Request #11102 · pandas-dev/pandas (original) (raw)

OK, this PR should work all cases which I'm aware of. Appreciated if anyone provide further test cases if any concerns.

@shoyer Yes, east-asian prefer this to be default True. But it is almost 2 times slower in below case.

DataFrame contains 10000 data, 100 rows * 100 columns, each item contains 10 Unicode chars
Display options are default:
- pd.options.display.max_rows: 60
- pd.options.display.max_columns: 20

import numpy as np
import pandas as pd

chars = list(u'あいうえおかきくけこさしすせそたちつてとなにぬねの')

def rand_jp(x):
    return ''.join(np.random.choice(chars) for _ in range(x))

df = pd.DataFrame(np.empty((100, 100)))
df = df.applymap(lambda x: rand_jp(10))

%timeit unicode(df)
# 10 loops, best of 3: 177 ms per loop

# Enable Unicode handling
pd.options.display.unicode.east_asian_width = True

%timeit unicode(df)
# 1 loops, best of 3: 381 ms per loop

The affect is almost the same as all ascii (same condition except for characters are all ascii):

Default: 10 loops, best of 3: 131 ms per loop
Enable Unicode handling: 1 loops, best of 3: 302 ms per loop

@kawochen I may not properly understand, but reducing East Asian Width category will not affect to the performance because dict lookup is O(1).

I think ambiguous characters need special handling (?).

Do you have any idea about affected characters and special handling logic? My concern is these characters can not be aligned properly even if we tried so.

CC: @ayapi