What’s new in 1.5.0 (September 19, 2022) — pandas 2.2.3 documentation (original) (raw)

These are the changes in pandas 1.5.0. See Release notes for a full changelog including other versions of pandas.

Enhancements#

pandas-stubs#

The pandas-stubs library is now supported by the pandas development team, providing type stubs for the pandas API. Please visitpandas-dev/pandas-stubs for more information.

We thank VirtusLab and Microsoft for their initial, significant contributions to pandas-stubs

Native PyArrow-backed ExtensionArray#

With Pyarrow installed, users can now create pandas objects that are backed by a pyarrow.ChunkedArray and pyarrow.DataType.

The dtype argument can accept a string of a pyarrow data typewith pyarrow in brackets e.g. "int64[pyarrow]" or, for pyarrow data types that take parameters, a ArrowDtypeinitialized with a pyarrow.DataType.

In [1]: import pyarrow as pa

In [2]: ser_float = pd.Series([1.0, 2.0, None], dtype="float32[pyarrow]")

In [3]: ser_float Out[3]: 0 1.0 1 2.0 2 dtype: float[pyarrow]

In [4]: list_of_int_type = pd.ArrowDtype(pa.list_(pa.int64()))

In [5]: ser_list = pd.Series([[1, 2], [3, None]], dtype=list_of_int_type)

In [6]: ser_list Out[6]: 0 [1. 2.] 1 [ 3. nan] dtype: list<item: int64>[pyarrow]

In [7]: ser_list.take([1, 0]) Out[7]: 1 [ 3. nan] 0 [1. 2.] dtype: list<item: int64>[pyarrow]

In [8]: ser_float * 5 Out[8]: 0 5.0 1 10.0 2 dtype: float[pyarrow]

In [9]: ser_float.mean() Out[9]: 1.5

In [10]: ser_float.dropna() Out[10]: 0 1.0 1 2.0 dtype: float[pyarrow]

Most operations are supported and have been implemented using pyarrow compute functions. We recommend installing the latest version of PyArrow to access the most recently implemented compute functions.

Warning

This feature is experimental, and the API can change in a future release without warning.

DataFrame interchange protocol implementation#

Pandas now implement the DataFrame interchange API spec. See the full details on the API at https://data-apis.org/dataframe-protocol/latest/index.html

The protocol consists of two parts:

Styler#

The most notable development is the new method Styler.concat() which allows adding customised footer rows to visualise additional calculations on the data, e.g. totals and counts etc. (GH 43875, GH 46186)

Additionally there is an alternative output method Styler.to_string(), which allows using the Styler’s formatting methods to create, for example, CSVs (GH 44502).

A new feature Styler.relabel_index() is also made available to provide full customisation of the display of index or column headers (GH 47864)

Minor feature improvements are:

Control of index with group_keys in DataFrame.resample()#

The argument group_keys has been added to the method DataFrame.resample(). As with DataFrame.groupby(), this argument controls the whether each group is added to the index in the resample when Resampler.apply() is used.

Warning

Not specifying the group_keys argument will retain the previous behavior and emit a warning if the result will change by specifying group_keys=False. In a future version of pandas, not specifying group_keys will default to the same behavior as group_keys=False.

In [11]: df = pd.DataFrame( ....: {'a': range(6)}, ....: index=pd.date_range("2021-01-01", periods=6, freq="8H") ....: ) ....:

In [12]: df.resample("D", group_keys=True).apply(lambda x: x) Out[12]: a 2021-01-01 2021-01-01 00:00:00 0 2021-01-01 08:00:00 1 2021-01-01 16:00:00 2 2021-01-02 2021-01-02 00:00:00 3 2021-01-02 08:00:00 4 2021-01-02 16:00:00 5

In [13]: df.resample("D", group_keys=False).apply(lambda x: x) Out[13]: a 2021-01-01 00:00:00 0 2021-01-01 08:00:00 1 2021-01-01 16:00:00 2 2021-01-02 00:00:00 3 2021-01-02 08:00:00 4 2021-01-02 16:00:00 5

Previously, the resulting index would depend upon the values returned by apply, as seen in the following example.

In [1]: # pandas 1.3 In [2]: df.resample("D").apply(lambda x: x) Out[2]: a 2021-01-01 00:00:00 0 2021-01-01 08:00:00 1 2021-01-01 16:00:00 2 2021-01-02 00:00:00 3 2021-01-02 08:00:00 4 2021-01-02 16:00:00 5

In [3]: df.resample("D").apply(lambda x: x.reset_index()) Out[3]: index a 2021-01-01 0 2021-01-01 00:00:00 0 1 2021-01-01 08:00:00 1 2 2021-01-01 16:00:00 2 2021-01-02 0 2021-01-02 00:00:00 3 1 2021-01-02 08:00:00 4 2 2021-01-02 16:00:00 5

from_dummies#

Added new function from_dummies() to convert a dummy coded DataFrame into a categorical DataFrame.

In [11]: import pandas as pd

In [12]: df = pd.DataFrame({"col1_a": [1, 0, 1], "col1_b": [0, 1, 0], ....: "col2_a": [0, 1, 0], "col2_b": [1, 0, 0], ....: "col2_c": [0, 0, 1]}) ....:

In [13]: pd.from_dummies(df, sep="_") Out[13]: col1 col2 0 a b 1 b a 2 a c

Writing to ORC files#

The new method DataFrame.to_orc() allows writing to ORC files (GH 43864).

This functionality depends the pyarrow library. For more details, see the IO docs on ORC.

Warning

df = pd.DataFrame(data={"col1": [1, 2], "col2": [3, 4]}) df.to_orc("./out.orc")

Reading directly from TAR archives#

I/O methods like read_csv() or DataFrame.to_json() now allow reading and writing directly on TAR archives (GH 44787).

df = pd.read_csv("./movement.tar.gz")

...

df.to_csv("./out.tar.gz")

This supports .tar, .tar.gz, .tar.bz and .tar.xz2 archives. The used compression method is inferred from the filename. If the compression method cannot be inferred, use the compression argument:

df = pd.read_csv(some_file_obj, compression={"method": "tar", "mode": "r:gz"}) # noqa F821

(mode being one of tarfile.open’s modes: https://docs.python.org/3/library/tarfile.html#tarfile.open)

read_xml now supports dtype, converters, and parse_dates#

Similar to other IO methods, pandas.read_xml() now supports assigning specific dtypes to columns, apply converter methods, and parse dates (GH 43567).

In [14]: from io import StringIO

In [15]: xml_dates = """ ....: ....: ....: square ....: 00360 ....: 4.0 ....: 2020-01-01 ....: ....: ....: circle ....: 00360 ....: ....: 2021-01-01 ....: ....: ....: triangle ....: 00180 ....: 3.0 ....: 2022-01-01 ....: ....: """ ....:

In [16]: df = pd.read_xml( ....: StringIO(xml_dates), ....: dtype={'sides': 'Int64'}, ....: converters={'degrees': str}, ....: parse_dates=['date'] ....: ) ....:

In [17]: df Out[17]: shape degrees sides date 0 square 00360 4 2020-01-01 1 circle 00360 2021-01-01 2 triangle 00180 3 2022-01-01

In [18]: df.dtypes Out[18]: shape object degrees object sides Int64 date datetime64[ns] dtype: object

read_xml now supports large XML using iterparse#

For very large XML files that can range in hundreds of megabytes to gigabytes, pandas.read_xml()now supports parsing such sizeable files using lxml’s iterparse and etree’s iterparsewhich are memory-efficient methods to iterate through XML trees and extract specific elements and attributes without holding entire tree in memory (GH 45442).

In [1]: df = pd.read_xml( ... "/path/to/downloaded/enwikisource-latest-pages-articles.xml", ... iterparse = {"page": ["title", "ns", "id"]}) ... ) df Out[2]: title ns id 0 Gettysburg Address 0 21450 1 Main Page 0 42950 2 Declaration by United Nations 0 8435 3 Constitution of the United States of America 0 8435 4 Declaration of Independence (Israel) 0 17858 ... ... ... ... 3578760 Page:Black cat 1897 07 v2 n10.pdf/17 104 219649 3578761 Page:Black cat 1897 07 v2 n10.pdf/43 104 219649 3578762 Page:Black cat 1897 07 v2 n10.pdf/44 104 219649 3578763 The History of Tom Jones, a Foundling/Book IX 0 12084291 3578764 Page:Shakespeare of Stratford (1926) Yale.djvu/91 104 21450

[3578765 rows x 3 columns]

Copy on Write#

A new feature copy_on_write was added (GH 46958). Copy on write ensures that any DataFrame or Series derived from another in any way always behaves as a copy. Copy on write disallows updating any other object than the object the method was applied to.

Copy on write can be enabled through:

pd.set_option("mode.copy_on_write", True) pd.options.mode.copy_on_write = True

Alternatively, copy on write can be enabled locally through:

with pd.option_context("mode.copy_on_write", True): ...

Without copy on write, the parent DataFrame is updated when updating a childDataFrame that was derived from this DataFrame.

In [19]: df = pd.DataFrame({"foo": [1, 2, 3], "bar": 1})

In [20]: view = df["foo"]

In [21]: view.iloc[0] Out[21]: 1

In [22]: df Out[22]: foo bar 0 1 1 1 2 1 2 3 1

With copy on write enabled, df won’t be updated anymore:

In [23]: with pd.option_context("mode.copy_on_write", True): ....: df = pd.DataFrame({"foo": [1, 2, 3], "bar": 1}) ....: view = df["foo"] ....: view.iloc[0] ....: df ....:

A more detailed explanation can be found here.

Other enhancements#

Notable bug fixes#

These are bug fixes that might have notable behavior changes.

Using dropna=True with groupby transforms#

A transform is an operation whose result has the same size as its input. When the result is a DataFrame or Series, it is also required that the index of the result matches that of the input. In pandas 1.4, usingDataFrameGroupBy.transform() or SeriesGroupBy.transform() with null values in the groups and dropna=True gave incorrect results. Demonstrated by the examples below, the incorrect results either contained incorrect values, or the result did not have the same index as the input.

In [24]: df = pd.DataFrame({'a': [1, 1, np.nan], 'b': [2, 3, 4]})

Old behavior:

In [3]: # Value in the last row should be np.nan df.groupby('a', dropna=True).transform('sum') Out[3]: b 0 5 1 5 2 5

In [3]: # Should have one additional row with the value np.nan df.groupby('a', dropna=True).transform(lambda x: x.sum()) Out[3]: b 0 5 1 5

In [3]: # The value in the last row is np.nan interpreted as an integer df.groupby('a', dropna=True).transform('ffill') Out[3]: b 0 2 1 3 2 -9223372036854775808

In [3]: # Should have one additional row with the value np.nan df.groupby('a', dropna=True).transform(lambda x: x) Out[3]: b 0 2 1 3

New behavior:

In [25]: df.groupby('a', dropna=True).transform('sum') Out[25]: b 0 5.0 1 5.0 2 NaN

In [26]: df.groupby('a', dropna=True).transform(lambda x: x.sum()) Out[26]: b 0 5.0 1 5.0 2 NaN

In [27]: df.groupby('a', dropna=True).transform('ffill') Out[27]: b 0 2.0 1 3.0 2 NaN

In [28]: df.groupby('a', dropna=True).transform(lambda x: x) Out[28]: b 0 2.0 1 3.0 2 NaN

Serializing tz-naive Timestamps with to_json() with iso_dates=True#

DataFrame.to_json(), Series.to_json(), and Index.to_json()would incorrectly localize DatetimeArrays/DatetimeIndexes with tz-naive Timestamps to UTC. (GH 38760)

Note that this patch does not fix the localization of tz-aware Timestamps to UTC upon serialization. (Related issue GH 12997)

Old Behavior

In [32]: index = pd.date_range( ....: start='2020-12-28 00:00:00', ....: end='2020-12-28 02:00:00', ....: freq='1H', ....: ) ....:

In [33]: a = pd.Series( ....: data=range(3), ....: index=index, ....: ) ....:

In [4]: from io import StringIO

In [5]: a.to_json(date_format='iso') Out[5]: '{"2020-12-28T00:00:00.000Z":0,"2020-12-28T01:00:00.000Z":1,"2020-12-28T02:00:00.000Z":2}'

In [6]: pd.read_json(StringIO(a.to_json(date_format='iso')), typ="series").index == a.index Out[6]: array([False, False, False])

New Behavior

In [34]: from io import StringIO

In [35]: a.to_json(date_format='iso') Out[35]: '{"2020-12-28T00:00:00.000Z":0,"2020-12-28T01:00:00.000Z":1,"2020-12-28T02:00:00.000Z":2}'

Roundtripping now works

In [36]: pd.read_json(StringIO(a.to_json(date_format='iso')), typ="series").index == a.index Out[36]: array([ True, True, True])

DataFrameGroupBy.value_counts with non-grouping categorical columns and observed=True#

Calling DataFrameGroupBy.value_counts() with observed=True would incorrectly drop non-observed categories of non-grouping columns (GH 46357).

In [6]: df = pd.DataFrame(["a", "b", "c"], dtype="category").iloc[0:2] In [7]: df Out[7]: 0 0 a 1 b

Old Behavior

In [8]: df.groupby(level=0, observed=True).value_counts() Out[8]: 0 a 1 1 b 1 dtype: int64

New Behavior

In [9]: df.groupby(level=0, observed=True).value_counts() Out[9]: 0 a 1 1 a 0 b 1 0 b 0 c 0 1 c 0 dtype: int64

Backwards incompatible API changes#

Increased minimum versions for dependencies#

Some minimum supported versions of dependencies were updated. If installed, we now require:

Package Minimum Version Required Changed
numpy 1.20.3 X X
mypy (dev) 0.971 X
beautifulsoup4 4.9.3 X
blosc 1.21.0 X
bottleneck 1.3.2 X
fsspec 2021.07.0 X
hypothesis 6.13.0 X
gcsfs 2021.07.0 X
jinja2 3.0.0 X
lxml 4.6.3 X
numba 0.53.1 X
numexpr 2.7.3 X
openpyxl 3.0.7 X
pandas-gbq 0.15.0 X
psycopg2 2.8.6 X
pymysql 1.0.2 X
pyreadstat 1.1.2 X
pyxlsb 1.0.8 X
s3fs 2021.08.0 X
scipy 1.7.1 X
sqlalchemy 1.4.16 X
tabulate 0.8.9 X
xarray 0.19.0 X
xlsxwriter 1.4.3 X

For optional libraries the general recommendation is to use the latest version. The following table lists the lowest version per library that is currently being tested throughout the development of pandas. Optional libraries below the lowest tested version may still work, but are not considered supported.

Package Minimum Version Changed
beautifulsoup4 4.9.3 X
blosc 1.21.0 X
bottleneck 1.3.2 X
brotlipy 0.7.0
fastparquet 0.4.0
fsspec 2021.08.0 X
html5lib 1.1
hypothesis 6.13.0 X
gcsfs 2021.08.0 X
jinja2 3.0.0 X
lxml 4.6.3 X
matplotlib 3.3.2
numba 0.53.1 X
numexpr 2.7.3 X
odfpy 1.4.1
openpyxl 3.0.7 X
pandas-gbq 0.15.0 X
psycopg2 2.8.6 X
pyarrow 1.0.1
pymysql 1.0.2 X
pyreadstat 1.1.2 X
pytables 3.6.1
python-snappy 0.6.0
pyxlsb 1.0.8 X
s3fs 2021.08.0 X
scipy 1.7.1 X
sqlalchemy 1.4.16 X
tabulate 0.8.9 X
tzdata 2022a
xarray 0.19.0 X
xlrd 2.0.1
xlsxwriter 1.4.3 X
xlwt 1.3.0
zstandard 0.15.2

See Dependencies and Optional dependencies for more.

Other API changes#

Deprecations#

Warning

In the next major version release, 2.0, several larger API changes are being considered without a formal deprecation such as making the standard library zoneinfo the default timezone implementation instead of pytz, having the Index support all data types instead of having multiple subclasses (CategoricalIndex, Int64Index, etc.), and more. The changes under consideration are logged in this GitHub issue, and any feedback or concerns are welcome.

Label-based integer slicing on a Series with an Int64Index or RangeIndex#

In a future version, integer slicing on a Series with a Int64Index or RangeIndex will be treated as label-based, not positional. This will make the behavior consistent with other Series.__getitem__() and Series.__setitem__() behaviors (GH 45162).

For example:

In [29]: ser = pd.Series([1, 2, 3, 4, 5], index=[2, 3, 5, 7, 11])

In the old behavior, ser[2:4] treats the slice as positional:

Old behavior:

In [3]: ser[2:4] Out[3]: 5 3 7 4 dtype: int64

In a future version, this will be treated as label-based:

Future behavior:

In [4]: ser.loc[2:4] Out[4]: 2 1 3 2 dtype: int64

To retain the old behavior, use series.iloc[i:j]. To get the future behavior, use series.loc[i:j].

Slicing on a DataFrame will not be affected.

ExcelWriter attributes#

All attributes of ExcelWriter were previously documented as not public. However some third party Excel engines documented accessingExcelWriter.book or ExcelWriter.sheets, and users were utilizing these and possibly other attributes. Previously these attributes were not safe to use; e.g. modifications to ExcelWriter.book would not update ExcelWriter.sheetsand conversely. In order to support this, pandas has made some attributes public and improved their implementations so that they may now be safely used. (GH 45572)

The following attributes are now public and considered safe to access.

The following attributes have been deprecated. They now raise a FutureWarningwhen accessed and will be removed in a future version. Users should be aware that their usage is considered unsafe, and can lead to unexpected results.

See the documentation of ExcelWriter for further details.

Using group_keys with transformers in DataFrameGroupBy.apply() and SeriesGroupBy.apply()#

In previous versions of pandas, if it was inferred that the function passed toDataFrameGroupBy.apply() or SeriesGroupBy.apply() was a transformer (i.e. the resulting index was equal to the input index), the group_keys argument of DataFrame.groupby() andSeries.groupby() was ignored and the group keys would never be added to the index of the result. In the future, the group keys will be added to the index when the user specifies group_keys=True.

As group_keys=True is the default value of DataFrame.groupby() andSeries.groupby(), not specifying group_keys with a transformer will raise a FutureWarning. This can be silenced and the previous behavior retained by specifying group_keys=False.

Inplace operation when setting values with loc and iloc#

Most of the time setting values with DataFrame.iloc() attempts to set values inplace, only falling back to inserting a new array if necessary. There are some cases where this rule is not followed, for example when setting an entire column from an array with different dtype:

In [30]: df = pd.DataFrame({'price': [11.1, 12.2]}, index=['book1', 'book2'])

In [31]: original_prices = df['price']

In [32]: new_prices = np.array([98, 99])

Old behavior:

In [3]: df.iloc[:, 0] = new_prices In [4]: df.iloc[:, 0] Out[4]: book1 98 book2 99 Name: price, dtype: int64 In [5]: original_prices Out[5]: book1 11.1 book2 12.2 Name: price, float: 64

This behavior is deprecated. In a future version, setting an entire column with iloc will attempt to operate inplace.

Future behavior:

In [3]: df.iloc[:, 0] = new_prices In [4]: df.iloc[:, 0] Out[4]: book1 98.0 book2 99.0 Name: price, dtype: float64 In [5]: original_prices Out[5]: book1 98.0 book2 99.0 Name: price, dtype: float64

To get the old behavior, use DataFrame.__setitem__() directly:

In [3]: df[df.columns[0]] = new_prices In [4]: df.iloc[:, 0] Out[4] book1 98 book2 99 Name: price, dtype: int64 In [5]: original_prices Out[5]: book1 11.1 book2 12.2 Name: price, dtype: float64

To get the old behaviour when df.columns is not unique and you want to change a single column by index, you can use DataFrame.isetitem(), which has been added in pandas 1.5:

In [3]: df_with_duplicated_cols = pd.concat([df, df], axis='columns') In [3]: df_with_duplicated_cols.isetitem(0, new_prices) In [4]: df_with_duplicated_cols.iloc[:, 0] Out[4]: book1 98 book2 99 Name: price, dtype: int64 In [5]: original_prices Out[5]: book1 11.1 book2 12.2 Name: 0, dtype: float64

numeric_only default value#

Across the DataFrame, DataFrameGroupBy, and Resampler operations such asmin, sum, and idxmax, the default value of the numeric_only argument, if it exists at all, was inconsistent. Furthermore, operations with the default value None can lead to surprising results. (GH 46560)

In [1]: df = pd.DataFrame({"a": [1, 2], "b": ["x", "y"]})

In [2]: # Reading the next line without knowing the contents of df, one would # expect the result to contain the products for both columns a and b. df[["a", "b"]].prod() Out[2]: a 2 dtype: int64

To avoid this behavior, the specifying the value numeric_only=None has been deprecated, and will be removed in a future version of pandas. In the future, all operations with a numeric_only argument will default to False. Users should either call the operation only with columns that can be operated on, or specify numeric_only=True to operate only on Boolean, integer, and float columns.

In order to support the transition to the new behavior, the following methods have gained the numeric_only argument.

Other Deprecations#

Performance improvements#

Bug fixes#

Categorical#

Datetimelike#

Timedelta#

Time Zones#

Numeric#

Conversion#

Strings#

Interval#

Indexing#

Missing#

MultiIndex#

I/O#

Period#

Plotting#

Groupby/resample/rolling#

Reshaping#

Sparse#

ExtensionArray#

Styler#

Metadata#

Other#

Contributors#

A total of 271 people contributed patches to this release. People with a “+” by their names contributed a patch for the first time.