pandas (original) (raw)

update for 2019-10-07: We have a StringDtype extension dtype. It's memory model is the same as the old implementation, an object-dtype ndarray of strings. The next step is to store & process it natively.

xref #8627
xref #8643, #8350

Since we introduced Categorical in 0.15.0, I think we have found 2 main uses.

as a 'real' Categorical/Factor type to represent a limited of subset of values that the column can take on
as a memory saving representation for object dtypes.

I could see introducting a dtype='string' where String is a slightly specialized sub-class of Categroical, with 2 differences compared to a 'regular' Categorical:

it allows unions of arbitrary other string types, currently Categorical will complain if you do this:

In [1]: df = DataFrame({'A' : Series(list('abc'),dtype='category')})
In [2]: df2 = DataFrame({'A' : Series(list('abd'),dtype='category')})
In [3]: pd.concat([df,df2])
ValueError: incompatible levels in categorical block merge

Note that this works if they are Series (and prob should raise as well, side -issue)

But, if these were both 'string' dtypes, then its a simple matter to combine (efficiently).

you can restrict the 'sub-dtype' (e.g. the dtype of the categories) to string/unicode (iow, don't allow numbers / arbitrary objects), makes the constructor a bit simpler, but more importantly, you now have a 'real' non-object string dtype.

I don't think this would be that complicated to do. The big change here would be to essentially convert any object dtypes that are strings to dtype='string' e.g. on reading/conversion/etc. might be a perf issue for some things, but I think the memory savings greatly outweigh.

We would then have a 'real' looking object dtype (and object would be relegated to actual python object types, so would be used much less).

cc @shoyer
cc @JanSchulz
cc @jorisvandenbossche
cc @mwiebe
thoughts?