BUG: Fix concat of frames with extension types (no reindexed columns) by jorisvandenbossche · Pull Request #34339 · pandas-dev/pandas (original) (raw)

This PR tries to fix the case of concatting dataframes with extension dtypes and non-extension dtypes, like this example (giving object dtype on master, instead of preserving Int64 dtype):

In [15]: df1 = pd.DataFrame({"a": pd.array([1, 2, 3], dtype="Int64")}) 
    ...: df2 = pd.DataFrame({"a": np.array([4, 5, 6])})  

In [16]: pd.concat([df1, df2]) 
Out[16]: 
   a
0  1
1  2
2  3
0  4
1  5
2  6

In [17]: pd.concat([df1, df2]).dtypes 
Out[17]: 
a    object
dtype: object

Up to now, the _get_common_dtype machinery of the ExtensionArrays was only used for concatting Series (axis=0 in concat_compat). Now I also try to make use of this for DataFrame columns, if there is any extension dtype in the to-concat arrays. This is checked in _concatenate_join_units. And we need to handle the dimension-mismach. I decided to handle that inside the internals code (reducing the dimension for non-EAs, and if the concat result is not an EA, adding back a dimension), so concat_compat does not need to be aware of this.

Partly overlapping with #33522 (cc @TomAugspurger), but this PR does not yet fix the original case the other PR is fixing: concat with reindexed (thus all missing) columns becoming object instead of preserving object dtype.