What’s new in 2.1.0 (Aug 30, 2023) — pandas 2.2.3 documentation (original) (raw)

These are the changes in pandas 2.1.0. See Release notes for a full changelog including other versions of pandas.

Enhancements#

PyArrow will become a required dependency with pandas 3.0#

PyArrow will become a required dependency of pandas starting with pandas 3.0. This decision was made based onPDEP 10.

This will enable more changes that are hugely beneficial to pandas users, including but not limited to:

We are collecting feedback on this decision here.

Avoid NumPy object dtype for strings by default#

Previously, all strings were stored in columns with NumPy object dtype by default. This release introduces an option future.infer_string that infers all strings as PyArrow backed strings with dtype "string[pyarrow_numpy]" instead. This is a new string dtype implementation that follows NumPy semantics in comparison operations and will return np.nan as the missing value indicator. Setting the option will also infer the dtype "string" as a StringDtype with storage set to "pyarrow_numpy", ignoring the value behind the optionmode.string_storage.

This option only works if PyArrow is installed. PyArrow backed strings have a significantly reduced memory footprint and provide a big performance improvement compared to NumPy object (GH 54430).

The option can be enabled with:

pd.options.future.infer_string = True

This behavior will become the default with pandas 3.0.

DataFrame reductions preserve extension dtypes#

In previous versions of pandas, the results of DataFrame reductions (DataFrame.sum() DataFrame.mean() etc.) had NumPy dtypes, even when the DataFrames were of extension dtypes. Pandas can now keep the dtypes when doing reductions over DataFrame columns with a common dtype (GH 52788).

Old Behavior

In [1]: df = pd.DataFrame({"a": [1, 1, 2, 1], "b": [np.nan, 2.0, 3.0, 4.0]}, dtype="Int64") In [2]: df.sum() Out[2]: a 5 b 9 dtype: int64 In [3]: df = df.astype("int64[pyarrow]") In [4]: df.sum() Out[4]: a 5 b 9 dtype: int64

New Behavior

In [1]: df = pd.DataFrame({"a": [1, 1, 2, 1], "b": [np.nan, 2.0, 3.0, 4.0]}, dtype="Int64")

In [2]: df.sum() Out[2]: a 5 b 9 dtype: Int64

In [3]: df = df.astype("int64[pyarrow]")

In [4]: df.sum() Out[4]: a 5 b 9 dtype: int64[pyarrow]

Notice that the dtype is now a masked dtype and PyArrow dtype, respectively, while previously it was a NumPy integer dtype.

To allow DataFrame reductions to preserve extension dtypes, ExtensionArray._reduce() has gotten a new keyword parameter keepdims. Calling ExtensionArray._reduce() with keepdims=True should return an array of length 1 along the reduction axis. In order to maintain backward compatibility, the parameter is not required, but will it become required in the future. If the parameter is not found in the signature, DataFrame reductions can not preserve extension dtypes. Also, if the parameter is not found, a FutureWarning will be emitted and type checkers like mypy may complain about the signature not being compatible with ExtensionArray._reduce().

Copy-on-Write improvements#

New DataFrame.map() method and support for ExtensionArrays#

The DataFrame.map() been added and DataFrame.applymap() has been deprecated. DataFrame.map() has the same functionality as DataFrame.applymap(), but the new name better communicates that this is the DataFrame version of Series.map() (GH 52353).

When given a callable, Series.map() applies the callable to all elements of the Series. Similarly, DataFrame.map() applies the callable to all elements of the DataFrame, while Index.map() applies the callable to all elements of the Index.

Frequently, it is not desirable to apply the callable to nan-like values of the array and to avoid doing that, the map method could be called with na_action="ignore", i.e. ser.map(func, na_action="ignore"). However, na_action="ignore" was not implemented for many ExtensionArray and Index types and na_action="ignore" did not work correctly for any ExtensionArray subclass except the nullable numeric ones (i.e. with dtype Int64 etc.).

na_action="ignore" now works for all array types (GH 52219, GH 51645, GH 51809, GH 51936, GH 52033; GH 52096).

Previous behavior:

In [1]: ser = pd.Series(["a", "b", np.nan], dtype="category") In [2]: ser.map(str.upper, na_action="ignore") NotImplementedError In [3]: df = pd.DataFrame(ser) In [4]: df.applymap(str.upper, na_action="ignore") # worked for DataFrame 0 0 A 1 B 2 NaN In [5]: idx = pd.Index(ser) In [6]: idx.map(str.upper, na_action="ignore") TypeError: CategoricalIndex.map() got an unexpected keyword argument 'na_action'

New behavior:

In [5]: ser = pd.Series(["a", "b", np.nan], dtype="category")

In [6]: ser.map(str.upper, na_action="ignore") Out[6]: 0 A 1 B 2 NaN dtype: category Categories (2, object): ['A', 'B']

In [7]: df = pd.DataFrame(ser)

In [8]: df.map(str.upper, na_action="ignore") Out[8]: 0 0 A 1 B 2 NaN

In [9]: idx = pd.Index(ser)

In [10]: idx.map(str.upper, na_action="ignore") Out[10]: CategoricalIndex(['A', 'B', nan], categories=['A', 'B'], ordered=False, dtype='category')

Also, note that Categorical.map() implicitly has had its na_action set to "ignore" by default. This has been deprecated and the default for Categorical.map() will change to na_action=None, consistent with all the other array types.

New implementation of DataFrame.stack()#

pandas has reimplemented DataFrame.stack(). To use the new implementation, pass the argument future_stack=True. This will become the only option in pandas 3.0.

The previous implementation had two main behavioral downsides.

  1. The previous implementation would unnecessarily introduce NA values into the result. The user could have NA values automatically removed by passing dropna=True (the default), but doing this could also remove NA values from the result that existed in the input. See the examples below.
  2. The previous implementation with sort=True (the default) would sometimes sort part of the resulting index, and sometimes not. If the input’s columns are not a MultiIndex, then the resulting index would never be sorted. If the columns are a MultiIndex, then in most cases the level(s) in the resulting index that come from stacking the column level(s) would be sorted. In rare cases such level(s) would be sorted in a non-standard order, depending on how the columns were created.

The new implementation (future_stack=True) will no longer unnecessarily introduce NA values when stacking multiple levels and will never sort. As such, the arguments dropna and sort are not utilized and must remain unspecified when using future_stack=True. These arguments will be removed in the next major release.

In [11]: columns = pd.MultiIndex.from_tuples([("B", "d"), ("A", "c")])

In [12]: df = pd.DataFrame([[0, 2], [1, 3]], index=["z", "y"], columns=columns)

In [13]: df Out[13]: B A d c z 0 2 y 1 3

In the previous version (future_stack=False), the default of dropna=True would remove unnecessarily introduced NA values but still coerce the dtype to float64 in the process. In the new version, no NAs are introduced and so there is no coercion of the dtype.

In [14]: df.stack([0, 1], future_stack=False, dropna=True) Out[14]: z A c 2.0 B d 0.0 y A c 3.0 B d 1.0 dtype: float64

In [15]: df.stack([0, 1], future_stack=True) Out[15]: z B d 0 A c 2 y B d 1 A c 3 dtype: int64

If the input contains NA values, the previous version would drop those as well with dropna=True or introduce new NA values with dropna=False. The new version persists all values from the input.

In [16]: df = pd.DataFrame([[0, 2], [np.nan, np.nan]], columns=columns)

In [17]: df Out[17]: B A d c 0 0.0 2.0 1 NaN NaN

In [18]: df.stack([0, 1], future_stack=False, dropna=True) Out[18]: 0 A c 2.0 B d 0.0 dtype: float64

In [19]: df.stack([0, 1], future_stack=False, dropna=False) Out[19]: 0 A d NaN c 2.0 B d 0.0 c NaN 1 A d NaN c NaN B d NaN c NaN dtype: float64

In [20]: df.stack([0, 1], future_stack=True) Out[20]: 0 B d 0.0 A c 2.0 1 B d NaN A c NaN dtype: float64

Other enhancements#

Backwards incompatible API changes#

Increased minimum version for Python#

pandas 2.1.0 supports Python 3.9 and higher.

Increased minimum versions for dependencies#

Some minimum supported versions of dependencies were updated. If installed, we now require:

Package Minimum Version Required Changed
numpy 1.22.4 X X
mypy (dev) 1.4.1 X
beautifulsoup4 4.11.1 X
bottleneck 1.3.4 X
dataframe-api-compat 0.1.7 X
fastparquet 0.8.1 X
fsspec 2022.05.0 X
hypothesis 6.46.1 X
gcsfs 2022.05.0 X
jinja2 3.1.2 X
lxml 4.8.0 X
numba 0.55.2 X
numexpr 2.8.0 X
openpyxl 3.0.10 X
pandas-gbq 0.17.5 X
psycopg2 2.9.3 X
pyreadstat 1.1.5 X
pyqt5 5.15.6 X
pytables 3.7.0 X
pytest 7.3.2 X
python-snappy 0.6.1 X
pyxlsb 1.0.9 X
s3fs 2022.05.0 X
scipy 1.8.1 X
sqlalchemy 1.4.36 X
tabulate 0.8.10 X
xarray 2022.03.0 X
xlsxwriter 3.0.3 X
zstandard 0.17.0 X

For optional libraries the general recommendation is to use the latest version.

See Dependencies and Optional dependencies for more.

Other API changes#

Deprecations#

Deprecated silent upcasting in setitem-like Series operations#

PDEP-6: https://pandas.pydata.org/pdeps/0006-ban-upcasting.html

Setitem-like operations on Series (or DataFrame columns) which silently upcast the dtype are deprecated and show a warning. Examples of affected operations are:

where ser is a Series, df is a DataFrame, and indexercould be a slice, a mask, a single value, a list or array of values, or any other allowed indexer.

In a future version, these will raise an error and you should cast to a common dtype first.

Previous behavior:

In [1]: ser = pd.Series([1, 2, 3])

In [2]: ser Out[2]: 0 1 1 2 2 3 dtype: int64

In [3]: ser[0] = 'not an int64'

In [4]: ser Out[4]: 0 not an int64 1 2 2 3 dtype: object

New behavior:

In [1]: ser = pd.Series([1, 2, 3])

In [2]: ser Out[2]: 0 1 1 2 2 3 dtype: int64

In [3]: ser[0] = 'not an int64' FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'not an int64' has dtype incompatible with int64, please explicitly cast to a compatible dtype first.

In [4]: ser Out[4]: 0 not an int64 1 2 2 3 dtype: object

To retain the current behaviour, in the case above you could cast ser to object dtype first:

In [21]: ser = pd.Series([1, 2, 3])

In [22]: ser = ser.astype('object')

In [23]: ser[0] = 'not an int64'

In [24]: ser Out[24]: 0 not an int64 1 2 2 3 dtype: object

Depending on the use-case, it might be more appropriate to cast to a different dtype. In the following, for example, we cast to float64:

In [25]: ser = pd.Series([1, 2, 3])

In [26]: ser = ser.astype('float64')

In [27]: ser[0] = 1.1

In [28]: ser Out[28]: 0 1.1 1 2.0 2 3.0 dtype: float64

For further reading, please see https://pandas.pydata.org/pdeps/0006-ban-upcasting.html.

Deprecated parsing datetimes with mixed time zones#

Parsing datetimes with mixed time zones is deprecated and shows a warning unless user passes utc=True to to_datetime() (GH 50887)

Previous behavior:

In [7]: data = ["2020-01-01 00:00:00+06:00", "2020-01-01 00:00:00+01:00"]

In [8]: pd.to_datetime(data, utc=False) Out[8]: Index([2020-01-01 00:00:00+06:00, 2020-01-01 00:00:00+01:00], dtype='object')

New behavior:

In [9]: pd.to_datetime(data, utc=False) FutureWarning: In a future version of pandas, parsing datetimes with mixed time zones will raise a warning unless utc=True. Please specify utc=True to opt in to the new behaviour and silence this warning. To create a Series with mixed offsets and object dtype, please use apply and datetime.datetime.strptime. Index([2020-01-01 00:00:00+06:00, 2020-01-01 00:00:00+01:00], dtype='object')

In order to silence this warning and avoid an error in a future version of pandas, please specify utc=True:

In [29]: data = ["2020-01-01 00:00:00+06:00", "2020-01-01 00:00:00+01:00"]

In [30]: pd.to_datetime(data, utc=True) Out[30]: DatetimeIndex(['2019-12-31 18:00:00+00:00', '2019-12-31 23:00:00+00:00'], dtype='datetime64[ns, UTC]', freq=None)

To create a Series with mixed offsets and object dtype, please use applyand datetime.datetime.strptime:

In [31]: import datetime as dt

In [32]: data = ["2020-01-01 00:00:00+06:00", "2020-01-01 00:00:00+01:00"]

In [33]: pd.Series(data).apply(lambda x: dt.datetime.strptime(x, '%Y-%m-%d %H:%M:%S%z')) Out[33]: 0 2020-01-01 00:00:00+06:00 1 2020-01-01 00:00:00+01:00 dtype: object

Other Deprecations#

Performance improvements#

Bug fixes#

Categorical#

Datetimelike#

Timedelta#

Timezones#

Numeric#

Conversion#

Strings#

Interval#

Indexing#

Missing#

MultiIndex#

I/O#

Period#

Plotting#

Groupby/resample/rolling#

Reshaping#

Sparse#

ExtensionArray#

Styler#

Metadata#

Other#

Contributors#

A total of 266 people contributed patches to this release. People with a “+” by their names contributed a patch for the first time.