pandas (original) (raw)

So I did a rough search / inventory of the different use cases internally of infer_dtypes. The main groups I see:

Many use cases are to infer a specific subset (eg "string", or "floating"/"integer"/"mixed-integer-float", or "boolean") from a list or object-dtype array
-> since for those case we know we don't start with array with a specific dtype (except object type), this will never take the EA path
Infer "period"/"interval"/"datetime"/.. dtype from a non-EA (again not impacted by this discussion)
Infer "mixed-integer" for sorting (also not impacted by this discussion)
Infer dtype from an np.ndarray (idem)
Infer "integer" key type for indexing
-> here we can potentially pass any EA (once we can use them for indexing), so for this use case it is actually important that infer_dtype(EA) doesn't raise an error (and the actual return value then doesn't matter for non-integer EAs)
...

So from that, I think that it will actually be good to change that infer_dtype(EA) never raises an error (as it does now on master for unknown array types).

The question is then which value? Infer by converting to object dtype numpy array, return an existing value "mixed", or return a new value like "unknown-array" ? Or let the EA dtype register something?

Given the potential expensive nature of coercing to object dtype, that might be something to avoid.
Given that "mixed" is already being used and has some use cases (eg the validation for the str accessor), it might be better to not re-use that.

So two ideas:

if we get an EA, can just return values.dtype.name or something like that

part of the register_dtype process could add stuff to _TYPE_MAP (may just be a more complicated version of 1?)

If we let the EA control this, I think it only can make sense if they return one of the existing categories? (what would we otherwise ever do with it, except ignoring it?)
So that would rule out the first option, I think?

Long term, it might be useful to let the dtype register its "inferred_dtype", but then I think we should first have a better idea of some specific use cases for which this would be used/useful (currently, many use case I checked was to do some dtype inference when not yet having an array like to start with).

So on the short term, maybe we can use the "unknown-array" return value? That would also not be used in practice, so it would mean it is basically ignored, but then at least without raising an error.