API/ENH: dtype='string' / pd.String · Issue #8640 · pandas-dev/pandas (original) (raw)
update for 2019-10-07: We have a StringDtype extension dtype. It's memory model is the same as the old implementation, an object-dtype ndarray of strings. The next step is to store & process it natively.
Since we introduced Categorical
in 0.15.0, I think we have found 2 main uses.
- as a 'real' Categorical/Factor type to represent a limited of subset of values that the column can take on
- as a memory saving representation for object dtypes.
I could see introducting a dtype='string'
where String
is a slightly specialized sub-class of Categroical
, with 2 differences compared to a 'regular' Categorical:
- it allows unions of arbitrary other string types, currently
Categorical
will complain if you do this:
In [1]: df = DataFrame({'A' : Series(list('abc'),dtype='category')})
In [2]: df2 = DataFrame({'A' : Series(list('abd'),dtype='category')})
In [3]: pd.concat([df,df2])
ValueError: incompatible levels in categorical block merge
Note that this works if they are Series
(and prob should raise as well, side -issue)
But, if these were both 'string' dtypes, then its a simple matter to combine (efficiently).
- you can restrict the 'sub-dtype' (e.g. the dtype of the categories) to
string/unicode
(iow, don't allow numbers / arbitrary objects), makes the constructor a bit simpler, but more importantly, you now have a 'real' non-object string dtype.
I don't think this would be that complicated to do. The big change here would be to essentially convert any object dtypes that are strings to dtype='string'
e.g. on reading/conversion/etc. might be a perf issue for some things, but I think the memory savings greatly outweigh.
We would then have a 'real' looking object dtype (and object
would be relegated to actual python object types, so would be used much less).
cc @shoyer
cc @JanSchulz
cc @jorisvandenbossche
cc @mwiebe
thoughts?