What’s new in 0.23.0 (May 15, 2018) — pandas 2.2.3 documentation (original) (raw)

This is a major release from 0.22.0 and includes a number of API changes, deprecations, new features, enhancements, and performance improvements along with a large number of bug fixes. We recommend that all users upgrade to this version.

Highlights include:

Check the API Changes and deprecations before updating.

Warning

Starting January 1, 2019, pandas feature releases will support Python 3 only. See Dropping Python 2.7 for more.

What’s new in v0.23.0

New features#

JSON read/write round-trippable with orient='table'#

A DataFrame can now be written to and subsequently read back via JSON while preserving metadata through usage of the orient='table' argument (see GH 18912 and GH 9146). Previously, none of the available orient values guaranteed the preservation of dtypes and index names, amongst other metadata.

In [1]: df = pd.DataFrame({'foo': [1, 2, 3, 4], ...: 'bar': ['a', 'b', 'c', 'd'], ...: 'baz': pd.date_range('2018-01-01', freq='d', periods=4), ...: 'qux': pd.Categorical(['a', 'b', 'c', 'c'])}, ...: index=pd.Index(range(4), name='idx')) ...:

In [2]: df Out[2]: foo bar baz qux idx
0 1 a 2018-01-01 a 1 2 b 2018-01-02 b 2 3 c 2018-01-03 c 3 4 d 2018-01-04 c

[4 rows x 4 columns]

In [3]: df.dtypes Out[3]: foo int64 bar object baz datetime64[ns] qux category Length: 4, dtype: object

In [4]: df.to_json('test.json', orient='table')

In [5]: new_df = pd.read_json('test.json', orient='table')

In [6]: new_df Out[6]: foo bar baz qux idx
0 1 a 2018-01-01 a 1 2 b 2018-01-02 b 2 3 c 2018-01-03 c 3 4 d 2018-01-04 c

[4 rows x 4 columns]

In [7]: new_df.dtypes Out[7]: foo int64 bar object baz datetime64[ns] qux category Length: 4, dtype: object

Please note that the string index is not supported with the round trip format, as it is used by default in write_json to indicate a missing index name.

In [8]: df.index.name = 'index'

In [9]: df.to_json('test.json', orient='table')

In [10]: new_df = pd.read_json('test.json', orient='table')

In [11]: new_df Out[11]: foo bar baz qux 0 1 a 2018-01-01 a 1 2 b 2018-01-02 b 2 3 c 2018-01-03 c 3 4 d 2018-01-04 c

[4 rows x 4 columns]

In [12]: new_df.dtypes Out[12]: foo int64 bar object baz datetime64[ns] qux category Length: 4, dtype: object

Method .assign() accepts dependent arguments#

The DataFrame.assign() now accepts dependent keyword arguments for python version later than 3.6 (see also PEP 468). Later keyword arguments may now refer to earlier ones if the argument is a callable. See thedocumentation here (GH 14207)

In [13]: df = pd.DataFrame({'A': [1, 2, 3]})

In [14]: df Out[14]: A 0 1 1 2 2 3

[3 rows x 1 columns]

In [15]: df.assign(B=df.A, C=lambda x: x['A'] + x['B']) Out[15]: A B C 0 1 1 2 1 2 2 4 2 3 3 6

[3 rows x 3 columns]

Warning

This may subtly change the behavior of your code when you’re using .assign() to update an existing column. Previously, callables referring to other variables being updated would get the “old” values

Previous behavior:

In [2]: df = pd.DataFrame({"A": [1, 2, 3]})

In [3]: df.assign(A=lambda df: df.A + 1, C=lambda df: df.A * -1) Out[3]: A C 0 2 -1 1 3 -2 2 4 -3

New behavior:

In [16]: df.assign(A=df.A + 1, C=lambda df: df.A * -1) Out[16]: A C 0 2 -2 1 3 -3 2 4 -4

[3 rows x 2 columns]

Merging on a combination of columns and index levels#

Strings passed to DataFrame.merge() as the on, left_on, and right_onparameters may now refer to either column names or index level names. This enables merging DataFrame instances on a combination of index levels and columns without resetting indexes. See the Merge on columns and levels documentation section. (GH 14355)

In [17]: left_index = pd.Index(['K0', 'K0', 'K1', 'K2'], name='key1')

In [18]: left = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'], ....: 'B': ['B0', 'B1', 'B2', 'B3'], ....: 'key2': ['K0', 'K1', 'K0', 'K1']}, ....: index=left_index) ....:

In [19]: right_index = pd.Index(['K0', 'K1', 'K2', 'K2'], name='key1')

In [20]: right = pd.DataFrame({'C': ['C0', 'C1', 'C2', 'C3'], ....: 'D': ['D0', 'D1', 'D2', 'D3'], ....: 'key2': ['K0', 'K0', 'K0', 'K1']}, ....: index=right_index) ....:

In [21]: left.merge(right, on=['key1', 'key2']) Out[21]: A B key2 C D key1
K0 A0 B0 K0 C0 D0 K1 A2 B2 K0 C1 D1 K2 A3 B3 K1 C3 D3

[3 rows x 5 columns]

Sorting by a combination of columns and index levels#

Strings passed to DataFrame.sort_values() as the by parameter may now refer to either column names or index level names. This enables sortingDataFrame instances by a combination of index levels and columns without resetting indexes. See the Sorting by Indexes and Values documentation section. (GH 14353)

Build MultiIndex

In [22]: idx = pd.MultiIndex.from_tuples([('a', 1), ('a', 2), ('a', 2), ....: ('b', 2), ('b', 1), ('b', 1)]) ....:

In [23]: idx.names = ['first', 'second']

Build DataFrame

In [24]: df_multi = pd.DataFrame({'A': np.arange(6, 0, -1)}, ....: index=idx) ....:

In [25]: df_multi Out[25]: A first second
a 1 6 2 5 2 4 b 2 3 1 2 1 1

[6 rows x 1 columns]

Sort by 'second' (index) and 'A' (column)

In [26]: df_multi.sort_values(by=['second', 'A']) Out[26]: A first second
b 1 1 1 2 a 1 6 b 2 3 a 2 4 2 5

[6 rows x 1 columns]

Extending pandas with custom types (experimental)#

pandas now supports storing array-like objects that aren’t necessarily 1-D NumPy arrays as columns in a DataFrame or values in a Series. This allows third-party libraries to implement extensions to NumPy’s types, similar to how pandas implemented categoricals, datetimes with timezones, periods, and intervals.

As a demonstration, we’ll use cyberpandas, which provides an IPArray type for storing ip addresses.

In [1]: from cyberpandas import IPArray

In [2]: values = IPArray([ ...: 0, ...: 3232235777, ...: 42540766452641154071740215577757643572 ...: ]) ...: ...:

IPArray isn’t a normal 1-D NumPy array, but because it’s a pandasExtensionArray, it can be stored properly inside pandas’ containers.

In [3]: ser = pd.Series(values)

In [4]: ser Out[4]: 0 0.0.0.0 1 192.168.1.1 2 2001:db8:85a3::8a2e:370:7334 dtype: ip

Notice that the dtype is ip. The missing value semantics of the underlying array are respected:

In [5]: ser.isna() Out[5]: 0 True 1 False 2 False dtype: bool

For more, see the extension typesdocumentation. If you build an extension array, publicize it on the ecosystem page.

New observed keyword for excluding unobserved categories in GroupBy#

Grouping by a categorical includes the unobserved categories in the output. When grouping by multiple categorical columns, this means you get the cartesian product of all the categories, including combinations where there are no observations, which can result in a large number of groups. We have added a keyword observed to control this behavior, it defaults toobserved=False for backward-compatibility. (GH 14942, GH 8138, GH 15217, GH 17594, GH 8669, GH 20583, GH 20902)

In [27]: cat1 = pd.Categorical(["a", "a", "b", "b"], ....: categories=["a", "b", "z"], ordered=True) ....:

In [28]: cat2 = pd.Categorical(["c", "d", "c", "d"], ....: categories=["c", "d", "y"], ordered=True) ....:

In [29]: df = pd.DataFrame({"A": cat1, "B": cat2, "values": [1, 2, 3, 4]})

In [30]: df['C'] = ['foo', 'bar'] * 2

In [31]: df Out[31]: A B values C 0 a c 1 foo 1 a d 2 bar 2 b c 3 foo 3 b d 4 bar

[4 rows x 4 columns]

To show all values, the previous behavior:

In [32]: df.groupby(['A', 'B', 'C'], observed=False).count() Out[32]: values A B C
a c bar 0 foo 1 d bar 1 foo 0 y bar 0 ... ... z c foo 0 d bar 0 foo 0 y bar 0 foo 0

[18 rows x 1 columns]

To show only observed values:

In [33]: df.groupby(['A', 'B', 'C'], observed=True).count() Out[33]: values A B C
a c foo 1 d bar 1 b c foo 1 d bar 1

[4 rows x 1 columns]

For pivoting operations, this behavior is already controlled by the dropna keyword:

In [34]: cat1 = pd.Categorical(["a", "a", "b", "b"], ....: categories=["a", "b", "z"], ordered=True) ....:

In [35]: cat2 = pd.Categorical(["c", "d", "c", "d"], ....: categories=["c", "d", "y"], ordered=True) ....:

In [36]: df = pd.DataFrame({"A": cat1, "B": cat2, "values": [1, 2, 3, 4]})

In [37]: df Out[37]: A B values 0 a c 1 1 a d 2 2 b c 3 3 b d 4

[4 rows x 3 columns]

In [1]: pd.pivot_table(df, values='values', index=['A', 'B'], dropna=True)

Out[1]: values A B a c 1.0 d 2.0 b c 3.0 d 4.0

In [2]: pd.pivot_table(df, values='values', index=['A', 'B'], dropna=False)

Out[2]: values A B a c 1.0 d 2.0 y NaN b c 3.0 d 4.0 y NaN z c NaN d NaN y NaN

Rolling/Expanding.apply() accepts raw=False to pass a Series to the function#

Series.rolling().apply(), DataFrame.rolling().apply(),Series.expanding().apply(), and DataFrame.expanding().apply() have gained a raw=None parameter. This is similar to DataFame.apply(). This parameter, if True allows one to send a np.ndarray to the applied function. If False a Series will be passed. The default is None, which preserves backward compatibility, so this will default to True, sending an np.ndarray. In a future version the default will be changed to False, sending a Series. (GH 5071, GH 20584)

In [38]: s = pd.Series(np.arange(5), np.arange(5) + 1)

In [39]: s Out[39]: 1 0 2 1 3 2 4 3 5 4 Length: 5, dtype: int64

Pass a Series:

In [40]: s.rolling(2, min_periods=1).apply(lambda x: x.iloc[-1], raw=False) Out[40]: 1 0.0 2 1.0 3 2.0 4 3.0 5 4.0 Length: 5, dtype: float64

Mimic the original behavior of passing a ndarray:

In [41]: s.rolling(2, min_periods=1).apply(lambda x: x[-1], raw=True) Out[41]: 1 0.0 2 1.0 3 2.0 4 3.0 5 4.0 Length: 5, dtype: float64

DataFrame.interpolate has gained the limit_area kwarg#

DataFrame.interpolate() has gained a limit_area parameter to allow further control of which NaN s are replaced. Use limit_area='inside' to fill only NaNs surrounded by valid values or use limit_area='outside' to fill only NaN s outside the existing valid values while preserving those inside. (GH 16284) See the full documentation here.

In [42]: ser = pd.Series([np.nan, np.nan, 5, np.nan, np.nan, ....: np.nan, 13, np.nan, np.nan]) ....:

In [43]: ser Out[43]: 0 NaN 1 NaN 2 5.0 3 NaN 4 NaN 5 NaN 6 13.0 7 NaN 8 NaN Length: 9, dtype: float64

Fill one consecutive inside value in both directions

In [44]: ser.interpolate(limit_direction='both', limit_area='inside', limit=1) Out[44]: 0 NaN 1 NaN 2 5.0 3 7.0 4 NaN 5 11.0 6 13.0 7 NaN 8 NaN Length: 9, dtype: float64

Fill all consecutive outside values backward

In [45]: ser.interpolate(limit_direction='backward', limit_area='outside') Out[45]: 0 5.0 1 5.0 2 5.0 3 NaN 4 NaN 5 NaN 6 13.0 7 NaN 8 NaN Length: 9, dtype: float64

Fill all consecutive outside values in both directions

In [46]: ser.interpolate(limit_direction='both', limit_area='outside') Out[46]: 0 5.0 1 5.0 2 5.0 3 NaN 4 NaN 5 NaN 6 13.0 7 13.0 8 13.0 Length: 9, dtype: float64

Function get_dummies now supports dtype argument#

The get_dummies() now accepts a dtype argument, which specifies a dtype for the new columns. The default remains uint8. (GH 18330)

In [47]: df = pd.DataFrame({'a': [1, 2], 'b': [3, 4], 'c': [5, 6]})

In [48]: pd.get_dummies(df, columns=['c']).dtypes Out[48]: a int64 b int64 c_5 bool c_6 bool Length: 4, dtype: object

In [49]: pd.get_dummies(df, columns=['c'], dtype=bool).dtypes Out[49]: a int64 b int64 c_5 bool c_6 bool Length: 4, dtype: object

Timedelta mod method#

mod (%) and divmod operations are now defined on Timedelta objects when operating with either timedelta-like or with numeric arguments. See the documentation here. (GH 19365)

In [50]: td = pd.Timedelta(hours=37)

In [51]: td % pd.Timedelta(minutes=45) Out[51]: Timedelta('0 days 00:15:00')

Method .rank() handles inf values when NaN are present#

In previous versions, .rank() would assign inf elements NaN as their ranks. Now ranks are calculated properly. (GH 6945)

In [52]: s = pd.Series([-np.inf, 0, 1, np.nan, np.inf])

In [53]: s Out[53]: 0 -inf 1 0.0 2 1.0 3 NaN 4 inf Length: 5, dtype: float64

Previous behavior:

In [11]: s.rank() Out[11]: 0 1.0 1 2.0 2 3.0 3 NaN 4 NaN dtype: float64

Current behavior:

In [54]: s.rank() Out[54]: 0 1.0 1 2.0 2 3.0 3 NaN 4 4.0 Length: 5, dtype: float64

Furthermore, previously if you rank inf or -inf values together with NaN values, the calculation won’t distinguish NaN from infinity when using ‘top’ or ‘bottom’ argument.

In [55]: s = pd.Series([np.nan, np.nan, -np.inf, -np.inf])

In [56]: s Out[56]: 0 NaN 1 NaN 2 -inf 3 -inf Length: 4, dtype: float64

Previous behavior:

In [15]: s.rank(na_option='top') Out[15]: 0 2.5 1 2.5 2 2.5 3 2.5 dtype: float64

Current behavior:

In [57]: s.rank(na_option='top') Out[57]: 0 1.5 1 1.5 2 3.5 3 3.5 Length: 4, dtype: float64

These bugs were squashed:

Series.str.cat has gained the join kwarg#

Previously, Series.str.cat() did not – in contrast to most of pandas – align Series on their index before concatenation (see GH 18657). The method has now gained a keyword join to control the manner of alignment, see examples below and here.

In v.0.23 join will default to None (meaning no alignment), but this default will change to 'left' in a future version of pandas.

In [58]: s = pd.Series(['a', 'b', 'c', 'd'])

In [59]: t = pd.Series(['b', 'd', 'e', 'c'], index=[1, 3, 4, 2])

In [60]: s.str.cat(t) Out[60]: 0 NaN 1 bb 2 cc 3 dd Length: 4, dtype: object

In [61]: s.str.cat(t, join='left', na_rep='-') Out[61]: 0 a- 1 bb 2 cc 3 dd Length: 4, dtype: object

Furthermore, Series.str.cat() now works for CategoricalIndex as well (previously raised a ValueError; see GH 20842).

DataFrame.astype performs column-wise conversion to Categorical#

DataFrame.astype() can now perform column-wise conversion to Categorical by supplying the string 'category' or a CategoricalDtype. Previously, attempting this would raise a NotImplementedError. See theObject creation section of the documentation for more details and examples. (GH 12860, GH 18099)

Supplying the string 'category' performs column-wise conversion, with only labels appearing in a given column set as categories:

In [62]: df = pd.DataFrame({'A': list('abca'), 'B': list('bccd')})

In [63]: df = df.astype('category')

In [64]: df['A'].dtype Out[64]: CategoricalDtype(categories=['a', 'b', 'c'], ordered=False, categories_dtype=object)

In [65]: df['B'].dtype Out[65]: CategoricalDtype(categories=['b', 'c', 'd'], ordered=False, categories_dtype=object)

Supplying a CategoricalDtype will make the categories in each column consistent with the supplied dtype:

In [66]: from pandas.api.types import CategoricalDtype

In [67]: df = pd.DataFrame({'A': list('abca'), 'B': list('bccd')})

In [68]: cdt = CategoricalDtype(categories=list('abcd'), ordered=True)

In [69]: df = df.astype(cdt)

In [70]: df['A'].dtype Out[70]: CategoricalDtype(categories=['a', 'b', 'c', 'd'], ordered=True, categories_dtype=object)

In [71]: df['B'].dtype Out[71]: CategoricalDtype(categories=['a', 'b', 'c', 'd'], ordered=True, categories_dtype=object)

Other enhancements#

Backwards incompatible API changes#

Dependencies have increased minimum versions#

We have updated our minimum supported versions of dependencies (GH 15184). If installed, we now require:

Package Minimum Version Required Issue
python-dateutil 2.5.0 X GH 15184
openpyxl 2.4.0 GH 15184
beautifulsoup4 4.2.1 GH 20082
setuptools 24.2.0 GH 20698

Instantiation from dicts preserves dict insertion order for Python 3.6+#

Until Python 3.6, dicts in Python had no formally defined ordering. For Python version 3.6 and later, dicts are ordered by insertion order, seePEP 468. pandas will use the dict’s insertion order, when creating a Series orDataFrame from a dict and you’re using Python version 3.6 or higher. (GH 19884)

Previous behavior (and current behavior if on Python < 3.6):

In [16]: pd.Series({'Income': 2000, ....: 'Expenses': -1500, ....: 'Taxes': -200, ....: 'Net result': 300}) Out[16]: Expenses -1500 Income 2000 Net result 300 Taxes -200 dtype: int64

Note the Series above is ordered alphabetically by the index values.

New behavior (for Python >= 3.6):

In [72]: pd.Series({'Income': 2000, ....: 'Expenses': -1500, ....: 'Taxes': -200, ....: 'Net result': 300}) ....: Out[72]: Income 2000 Expenses -1500 Taxes -200 Net result 300 Length: 4, dtype: int64

Notice that the Series is now ordered by insertion order. This new behavior is used for all relevant pandas types (Series, DataFrame, SparseSeriesand SparseDataFrame).

If you wish to retain the old behavior while using Python >= 3.6, you can use.sort_index():

In [73]: pd.Series({'Income': 2000, ....: 'Expenses': -1500, ....: 'Taxes': -200, ....: 'Net result': 300}).sort_index() ....: Out[73]: Expenses -1500 Income 2000 Net result 300 Taxes -200 Length: 4, dtype: int64

Deprecate Panel#

Panel was deprecated in the 0.20.x release, showing as a DeprecationWarning. Using Panel will now show a FutureWarning. The recommended way to represent 3-D data are with a MultiIndex on a DataFrame via the to_frame() or with the xarray package. pandas provides a to_xarray() method to automate this conversion (GH 13563, GH 18324).

In [75]: import pandas._testing as tm

In [76]: p = tm.makePanel()

In [77]: p Out[77]: <class 'pandas.core.panel.Panel'> Dimensions: 3 (items) x 3 (major_axis) x 4 (minor_axis) Items axis: ItemA to ItemC Major_axis axis: 2000-01-03 00:00:00 to 2000-01-05 00:00:00 Minor_axis axis: A to D

Convert to a MultiIndex DataFrame

In [78]: p.to_frame() Out[78]: ItemA ItemB ItemC major minor 2000-01-03 A 0.469112 0.721555 0.404705 B -1.135632 0.271860 -1.039268 C 0.119209 0.276232 -1.344312 D -2.104569 0.113648 -0.109050 2000-01-04 A -0.282863 -0.706771 0.577046 B 1.212112 -0.424972 -0.370647 C -1.044236 -1.087401 0.844885 D -0.494929 -1.478427 1.643563 2000-01-05 A -1.509059 -1.039575 -1.715002 B -0.173215 0.567020 -1.157892 C -0.861849 -0.673690 1.075770 D 1.071804 0.524988 -1.469388

[12 rows x 3 columns]

Convert to an xarray DataArray

In [79]: p.to_xarray() Out[79]: <xarray.DataArray (items: 3, major_axis: 3, minor_axis: 4)> array([[[ 0.469112, -1.135632, 0.119209, -2.104569], [-0.282863, 1.212112, -1.044236, -0.494929], [-1.509059, -0.173215, -0.861849, 1.071804]],

   [[ 0.721555,  0.27186 ,  0.276232,  0.113648],
    [-0.706771, -0.424972, -1.087401, -1.478427],
    [-1.039575,  0.56702 , -0.67369 ,  0.524988]],

   [[ 0.404705, -1.039268, -1.344312, -0.10905 ],
    [ 0.577046, -0.370647,  0.844885,  1.643563],
    [-1.715002, -1.157892,  1.07577 , -1.469388]]])

Coordinates:

pandas.core.common removals#

The following error & warning messages are removed from pandas.core.common (GH 13634, GH 19769):

These are available from import from pandas.errors (since 0.19.0).

Changes to make output of DataFrame.apply consistent#

DataFrame.apply() was inconsistent when applying an arbitrary user-defined-function that returned a list-like with axis=1. Several bugs and inconsistencies are resolved. If the applied function returns a Series, then pandas will return a DataFrame; otherwise a Series will be returned, this includes the case where a list-like (e.g. tuple or list is returned) (GH 16353, GH 17437, GH 17970, GH 17348, GH 17892, GH 18573,GH 17602, GH 18775, GH 18901, GH 18919).

In [74]: df = pd.DataFrame(np.tile(np.arange(3), 6).reshape(6, -1) + 1, ....: columns=['A', 'B', 'C']) ....:

In [75]: df Out[75]: A B C 0 1 2 3 1 1 2 3 2 1 2 3 3 1 2 3 4 1 2 3 5 1 2 3

[6 rows x 3 columns]

Previous behavior: if the returned shape happened to match the length of original columns, this would return a DataFrame. If the return shape did not match, a Series with lists was returned.

In [3]: df.apply(lambda x: [1, 2, 3], axis=1) Out[3]: A B C 0 1 2 3 1 1 2 3 2 1 2 3 3 1 2 3 4 1 2 3 5 1 2 3

In [4]: df.apply(lambda x: [1, 2], axis=1) Out[4]: 0 [1, 2] 1 [1, 2] 2 [1, 2] 3 [1, 2] 4 [1, 2] 5 [1, 2] dtype: object

New behavior: When the applied function returns a list-like, this will now always return a Series.

In [76]: df.apply(lambda x: [1, 2, 3], axis=1) Out[76]: 0 [1, 2, 3] 1 [1, 2, 3] 2 [1, 2, 3] 3 [1, 2, 3] 4 [1, 2, 3] 5 [1, 2, 3] Length: 6, dtype: object

In [77]: df.apply(lambda x: [1, 2], axis=1) Out[77]: 0 [1, 2] 1 [1, 2] 2 [1, 2] 3 [1, 2] 4 [1, 2] 5 [1, 2] Length: 6, dtype: object

To have expanded columns, you can use result_type='expand'

In [78]: df.apply(lambda x: [1, 2, 3], axis=1, result_type='expand') Out[78]: 0 1 2 0 1 2 3 1 1 2 3 2 1 2 3 3 1 2 3 4 1 2 3 5 1 2 3

[6 rows x 3 columns]

To broadcast the result across the original columns (the old behaviour for list-likes of the correct length), you can use result_type='broadcast'. The shape must match the original columns.

In [79]: df.apply(lambda x: [1, 2, 3], axis=1, result_type='broadcast') Out[79]: A B C 0 1 2 3 1 1 2 3 2 1 2 3 3 1 2 3 4 1 2 3 5 1 2 3

[6 rows x 3 columns]

Returning a Series allows one to control the exact return structure and column names:

In [80]: df.apply(lambda x: pd.Series([1, 2, 3], index=['D', 'E', 'F']), axis=1) Out[80]: D E F 0 1 2 3 1 1 2 3 2 1 2 3 3 1 2 3 4 1 2 3 5 1 2 3

[6 rows x 3 columns]

Concatenation will no longer sort#

In a future version of pandas pandas.concat() will no longer sort the non-concatenation axis when it is not already aligned. The current behavior is the same as the previous (sorting), but now a warning is issued when sort is not specified and the non-concatenation axis is not aligned (GH 4588).

In [81]: df1 = pd.DataFrame({"a": [1, 2], "b": [1, 2]}, columns=['b', 'a'])

In [82]: df2 = pd.DataFrame({"a": [4, 5]})

In [83]: pd.concat([df1, df2]) Out[83]: b a 0 1.0 1 1 2.0 2 0 NaN 4 1 NaN 5

[4 rows x 2 columns]

To keep the previous behavior (sorting) and silence the warning, pass sort=True

In [84]: pd.concat([df1, df2], sort=True) Out[84]: a b 0 1 1.0 1 2 2.0 0 4 NaN 1 5 NaN

[4 rows x 2 columns]

To accept the future behavior (no sorting), pass sort=False

Note that this change also applies to DataFrame.append(), which has also received a sort keyword for controlling this behavior.

Build changes#

Index division by zero fills correctly#

Division operations on Index and subclasses will now fill division of positive numbers by zero with np.inf, division of negative numbers by zero with -np.inf and 0 / 0 with np.nan. This matches existing Series behavior. (GH 19322, GH 19347)

Previous behavior:

In [6]: index = pd.Int64Index([-1, 0, 1])

In [7]: index / 0 Out[7]: Int64Index([0, 0, 0], dtype='int64')

Previous behavior yielded different results depending on the type of zero in the divisor

In [8]: index / 0.0 Out[8]: Float64Index([-inf, nan, inf], dtype='float64')

In [9]: index = pd.UInt64Index([0, 1])

In [10]: index / np.array([0, 0], dtype=np.uint64) Out[10]: UInt64Index([0, 0], dtype='uint64')

In [11]: pd.RangeIndex(1, 5) / 0 ZeroDivisionError: integer division or modulo by zero

Current behavior:

In [12]: index = pd.Int64Index([-1, 0, 1])

division by zero gives -infinity where negative,

+infinity where positive, and NaN for 0 / 0

In [13]: index / 0

The result of division by zero should not depend on

whether the zero is int or float

In [14]: index / 0.0

In [15]: index = pd.UInt64Index([0, 1]) In [16]: index / np.array([0, 0], dtype=np.uint64)

In [17]: pd.RangeIndex(1, 5) / 0

Default value for the ordered parameter of CategoricalDtype#

The default value of the ordered parameter for CategoricalDtype has changed from False to None to allow updating of categories without impacting ordered. Behavior should remain consistent for downstream objects, such as Categorical (GH 18790)

In previous versions, the default value for the ordered parameter was False. This could potentially lead to the ordered parameter unintentionally being changed from True to False when users attempt to update categories if ordered is not explicitly specified, as it would silently default to False. The new behavior for ordered=None is to retain the existing value of ordered.

New behavior:

In [2]: from pandas.api.types import CategoricalDtype

In [3]: cat = pd.Categorical(list('abcaba'), ordered=True, categories=list('cba'))

In [4]: cat Out[4]: [a, b, c, a, b, a] Categories (3, object): [c < b < a]

In [5]: cdt = CategoricalDtype(categories=list('cbad'))

In [6]: cat.astype(cdt) Out[6]: [a, b, c, a, b, a] Categories (4, object): [c < b < a < d]

Notice in the example above that the converted Categorical has retained ordered=True. Had the default value for ordered remained as False, the converted Categorical would have become unordered, despite ordered=False never being explicitly specified. To change the value of ordered, explicitly pass it to the new dtype, e.g. CategoricalDtype(categories=list('cbad'), ordered=False).

Note that the unintentional conversion of ordered discussed above did not arise in previous versions due to separate bugs that prevented astype from doing any type of category to category conversion (GH 10696, GH 18593). These bugs have been fixed in this release, and motivated changing the default value of ordered.

Better pretty-printing of DataFrames in a terminal#

Previously, the default value for the maximum number of columns waspd.options.display.max_columns=20. This meant that relatively wide data frames would not fit within the terminal width, and pandas would introduce line breaks to display these 20 columns. This resulted in an output that was relatively difficult to read:

../_images/print_df_old.png

If Python runs in a terminal, the maximum number of columns is now determined automatically so that the printed data frame fits within the current terminal width (pd.options.display.max_columns=0) (GH 17023). If Python runs as a Jupyter kernel (such as the Jupyter QtConsole or a Jupyter notebook, as well as in many IDEs), this value cannot be inferred automatically and is thus set to 20 as in previous versions. In a terminal, this results in a much nicer output:

../_images/print_df_new.png

Note that if you don’t like the new default, you can always set this option yourself. To revert to the old setting, you can run this line:

pd.options.display.max_columns = 20

Datetimelike API changes#

Other API changes#

Deprecations#

Removal of prior version deprecations/changes#

Performance improvements#

Documentation changes#

Thanks to all of the contributors who participated in the pandas Documentation Sprint, which took place on March 10th. We had about 500 participants from over 30 locations across the world. You should notice that many of theAPI docstrings have greatly improved.

There were too many simultaneous contributions to include a release note for each improvement, but this GitHub search should give you an idea of how many docstrings were improved.

Special thanks to Marc Garcia for organizing the sprint. For more information, read the NumFOCUS blogpost recapping the sprint.

Bug fixes#

Categorical#

Warning

A class of bugs were introduced in pandas 0.21 with CategoricalDtype that affects the correctness of operations like merge, concat, and indexing when comparing multiple unordered Categorical arrays that have the same categories, but in a different order. We highly recommend upgrading or manually aligning your categories before doing these operations.

Datetimelike#

Timedelta#

Timezones#

Offsets#

Numeric#

Strings#

Indexing#

MultiIndex#

IO#

Plotting#

GroupBy/resample/rolling#

Sparse#

Reshaping#

Other#

Contributors#

A total of 328 people contributed patches to this release. People with a “+” by their names contributed a patch for the first time.