DOC: Add info on dtype strings · Issue #30590 · pandas-dev/pandas (original) (raw)

Problem description

I've been studying the new string, boolean and Intxx dtypes and think it would be worthwhile to add something about the strings that you are allowed to use with extension arrays in specifying the dtypes. It could be an additional column in the dtypes table here:
https://dev.pandas.io/docs/getting_started/basics.html#dtypes

I think the following table is correct:

Data Type Array Possible Strings
DatetimeTZDtype DatetimeArray 'datetime64[ns, ]'
CategoricalDtype Categorical 'category'
PeriodDtype PeriodArray 'period[]' or 'Period[]'
SparseDtype SparseArray 'Sparse', 'Sparse[int]', 'Sparse[int32, 0]', 'Sparse[int64, 0]', 'Sparse[float64, nan]', 'Sparse[float32, nan]'
IntervalDtype IntervalArray 'interval', 'Interval', 'Interval[<np.numeric>]', 'Interval[datetime64[ns, ]]', 'Interval[timedelta64[]]'
Int64Dtype (and others) IntegerArray 'Int8', 'Int16', 'Int32', 'Int64', 'UInt8', 'UInt16', 'UInt32', 'UInt64'
StringDtype StringArray 'string'
BooleanDtype BooleanArray 'boolean'

I also think we may want to make it clear that if you specify a string not in that table, it needs to be a string acceptable as a numpy dtype.

If people like @TomAugspurger and @jorisvandenbossche think this is useful, I'll add a column to that table in the docs (or maybe have to use a separate table because of the length of the last column above).

Also, should we consider allowing 'Boolean' and 'String' and 'Category', i.e. type names with a leading capital letter? We're inconsistent in terms of what case is allowed in different places for the strings representing dtypes (see period/Period and interval/Interval)