ENH: Add support for Categoricals in BlockManager · Issue #5313 · pandas-dev/pandas (original) (raw)

Skip to content

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sign up

Appearance settings

@jtratner

Description

@jtratner

tl;dr - add true support for Categoricals in NDFrame.

There was an issue on the mailing list about using cut and sorting the results that brought this to mind. The issue is both that (I believe) a categorical loses its representation when you put it in a DataFrame and so the output of cut has to just be strings. I propose the following:

  1. Add a CategoricalBlock (or FactorBlock) internally that can handle categoricals like those produced from cut that could share most of MI's internals, as a 2D int ndarray with an associated list of indexes for each column (again, nearly the same as MI except most ops would be working on just one 'level' and underlying could/would be 2D rather than list of Int64Index). Probably also would mean abstracting common operations to a separate mixin class.
  2. Change Categorical to be a Series subclass with a SingleBlockManager that's a CategoricalBlock. This would not change its API, but it would gain Series methods.
  3. Add a to_categorical method to Series (bonus points if we change convert_objects to detect if there are < Some_Max number of labels and convert object dtypes to categoricals).
  4. Add a registration method to make_block so it iterates over a set of functions that either return a klass or None before falling back to ObjectBlock (so abstract existing else clause into a function and make the list of functions semi-public).

I'm going to work on this and I don't think it will be that difficult to implement, but it would make pandas more useful for representing level sets and other normalized data.