ENH: parse categoricals in read_csv by chris-b1 · Pull Request #13406 · pandas-dev/pandas (original) (raw)

Closes #10153 (at least partly)

Adds the ability to directly parse a Categorical through the dtype parameter to read_csv. Currently just uses whatever is there as the categories, a possible enhancement would be to allow and enforce user-specified categories, through not quite sure what the api would be.

This only parses string categories - originally I had an impl that did type inference on the categories, but it added a lot of complication without much benefit, now the recommendation in the docs is to convert after parsing.

Here's an example timing. For reasonably sparse data, a slightly worse than 2x speedup is what I'm typically seeing, along with much better memory usage.

group1 = ['aaaaa', 'bbbbb', 'cccccc', 'ddddddd', 'eeeeeeee']

df = pd.DataFrame({'a': np.random.choice(group1, 10000000).astype('object'), 'b': np.random.choice(group1, 10000000).astype('object'), 'c': np.random.choice(group1, 10000000).astype('object')}) df.to_csv('strings.csv', index=False)

In [14]: %timeit pd.read_csv('strings.csv').apply(pd.Categorical) 1 loops, best of 3: 6.66 s per loop

In [13]: %timeit pd.read_csv('strings.csv', dtype='category') 1 loop, best of 3: 3.68 s per loop