What’s new in 1.3.0 (July 2, 2021) — pandas 2.2.3 documentation (original) (raw)

These are the changes in pandas 1.3.0. See Release notes for a full changelog including other versions of pandas.

Warning

When reading new Excel 2007+ (.xlsx) files, the default argumentengine=None to read_excel() will now result in using theopenpyxl engine in all cases when the option io.excel.xlsx.reader is set to "auto". Previously, some cases would use thexlrd engine instead. SeeWhat’s new 1.2.0 for background on this change.

Enhancements#

Read and write XML documents#

We added I/O support to read and render shallow versions of XML documents withread_xml() and DataFrame.to_xml(). Using lxml as parser, both XPath 1.0 and XSLT 1.0 are available. (GH 27554)

In [1]: xml = """ ...: ...: ...: square ...: 360 ...: 4.0 ...: ...: ...: circle ...: 360 ...: ...: ...: ...: triangle ...: 180 ...: 3.0 ...: ...: """

In [2]: df = pd.read_xml(xml) In [3]: df Out[3]: shape degrees sides 0 square 360 4.0 1 circle 360 NaN 2 triangle 180 3.0

In [4]: df.to_xml() Out[4]: 0 square 360 4.0 1 circle 360 2 triangle 180 3.0

For more, see Writing XML in the user guide on IO tools.

Styler enhancements#

We provided some focused development on Styler. See also the Styler documentationwhich has been revised and improved (GH 39720, GH 39317, GH 40493).

DataFrame constructor honors copy=False with dict#

When passing a dictionary to DataFrame with copy=False, a copy will no longer be made (GH 32960).

In [1]: arr = np.array([1, 2, 3])

In [2]: df = pd.DataFrame({"A": arr, "B": arr.copy()}, copy=False)

In [3]: df Out[3]: A B 0 1 1 1 2 2 2 3 3

df["A"] remains a view on arr:

In [4]: arr[0] = 0

In [5]: assert df.iloc[0, 0] == 0

The default behavior when not passing copy will remain unchanged, i.e. a copy will be made.

PyArrow backed string data type#

We’ve enhanced the StringDtype, an extension type dedicated to string data. (GH 39908)

It is now possible to specify a storage keyword option to StringDtype. Use pandas options or specify the dtype using dtype='string[pyarrow]' to allow the StringArray to be backed by a PyArrow array instead of a NumPy array of Python objects.

The PyArrow backed StringArray requires pyarrow 1.0.0 or greater to be installed.

Warning

string[pyarrow] is currently considered experimental. The implementation and parts of the API may change without warning.

In [6]: pd.Series(['abc', None, 'def'], dtype=pd.StringDtype(storage="pyarrow")) Out[6]: 0 abc 1 2 def dtype: string

You can use the alias "string[pyarrow]" as well.

In [7]: s = pd.Series(['abc', None, 'def'], dtype="string[pyarrow]")

In [8]: s Out[8]: 0 abc 1 2 def dtype: string

You can also create a PyArrow backed string array using pandas options.

In [9]: with pd.option_context("string_storage", "pyarrow"): ...: s = pd.Series(['abc', None, 'def'], dtype="string") ...:

In [10]: s Out[10]: 0 abc 1 2 def dtype: string

The usual string accessor methods work. Where appropriate, the return type of the Series or columns of a DataFrame will also have string dtype.

In [11]: s.str.upper() Out[11]: 0 ABC 1 2 DEF dtype: string

In [12]: s.str.split('b', expand=True).dtypes Out[12]: 0 string[pyarrow] 1 string[pyarrow] dtype: object

String accessor methods returning integers will return a value with Int64Dtype

In [13]: s.str.count("a") Out[13]: 0 1 1 2 0 dtype: Int64

Centered datetime-like rolling windows#

When performing rolling calculations on DataFrame and Series objects with a datetime-like index, a centered datetime-like window can now be used (GH 38780). For example:

In [14]: df = pd.DataFrame( ....: {"A": [0, 1, 2, 3, 4]}, index=pd.date_range("2020", periods=5, freq="1D") ....: ) ....:

In [15]: df Out[15]: A 2020-01-01 0 2020-01-02 1 2020-01-03 2 2020-01-04 3 2020-01-05 4

In [16]: df.rolling("2D", center=True).mean() Out[16]: A 2020-01-01 0.5 2020-01-02 1.5 2020-01-03 2.5 2020-01-04 3.5 2020-01-05 4.0

Other enhancements#

Notable bug fixes#

These are bug fixes that might have notable behavior changes.

Categorical.unique now always maintains same dtype as original#

Previously, when calling Categorical.unique() with categorical data, unused categories in the new array would be removed, making the dtype of the new array different than the original (GH 18291)

As an example of this, given:

In [17]: dtype = pd.CategoricalDtype(['bad', 'neutral', 'good'], ordered=True)

In [18]: cat = pd.Categorical(['good', 'good', 'bad', 'bad'], dtype=dtype)

In [19]: original = pd.Series(cat)

In [20]: unique = original.unique()

Previous behavior:

In [1]: unique ['good', 'bad'] Categories (2, object): ['bad' < 'good'] In [2]: original.dtype == unique.dtype False

New behavior:

In [21]: unique Out[21]: ['good', 'bad'] Categories (3, object): ['bad' < 'neutral' < 'good']

In [22]: original.dtype == unique.dtype Out[22]: True

Preserve dtypes in DataFrame.combine_first()#

DataFrame.combine_first() will now preserve dtypes (GH 7509)

In [23]: df1 = pd.DataFrame({"A": [1, 2, 3], "B": [1, 2, 3]}, index=[0, 1, 2])

In [24]: df1 Out[24]: A B 0 1 1 1 2 2 2 3 3

In [25]: df2 = pd.DataFrame({"B": [4, 5, 6], "C": [1, 2, 3]}, index=[2, 3, 4])

In [26]: df2 Out[26]: B C 2 4 1 3 5 2 4 6 3

In [27]: combined = df1.combine_first(df2)

Previous behavior:

In [1]: combined.dtypes Out[2]: A float64 B float64 C float64 dtype: object

New behavior:

In [28]: combined.dtypes Out[28]: A float64 B int64 C float64 dtype: object

Groupby methods agg and transform no longer changes return dtype for callables#

Previously the methods DataFrameGroupBy.aggregate(),SeriesGroupBy.aggregate(), DataFrameGroupBy.transform(), andSeriesGroupBy.transform() might cast the result dtype when the argument funcis callable, possibly leading to undesirable results (GH 21240). The cast would occur if the result is numeric and casting back to the input dtype does not change any values as measured by np.allclose. Now no such casting occurs.

In [29]: df = pd.DataFrame({'key': [1, 1], 'a': [True, False], 'b': [True, True]})

In [30]: df Out[30]: key a b 0 1 True True 1 1 False True

Previous behavior:

In [5]: df.groupby('key').agg(lambda x: x.sum()) Out[5]: a b key 1 True 2

New behavior:

In [31]: df.groupby('key').agg(lambda x: x.sum()) Out[31]: a b key
1 1 2

float result for DataFrameGroupBy.mean(), DataFrameGroupBy.median(), and GDataFrameGroupBy.var(), SeriesGroupBy.mean(), SeriesGroupBy.median(), and SeriesGroupBy.var()#

Previously, these methods could result in different dtypes depending on the input values. Now, these methods will always return a float dtype. (GH 41137)

In [32]: df = pd.DataFrame({'a': [True], 'b': [1], 'c': [1.0]})

Previous behavior:

In [5]: df.groupby(df.index).mean() Out[5]: a b c 0 True 1 1.0

New behavior:

In [33]: df.groupby(df.index).mean() Out[33]: a b c 0 1.0 1.0 1.0

Try operating inplace when setting values with loc and iloc#

When setting an entire column using loc or iloc, pandas will try to insert the values into the existing data rather than create an entirely new array.

In [34]: df = pd.DataFrame(range(3), columns=["A"], dtype="float64")

In [35]: values = df.values

In [36]: new = np.array([5, 6, 7], dtype="int64")

In [37]: df.loc[[0, 1, 2], "A"] = new

In both the new and old behavior, the data in values is overwritten, but in the old behavior the dtype of df["A"] changed to int64.

Previous behavior:

In [1]: df.dtypes Out[1]: A int64 dtype: object In [2]: np.shares_memory(df["A"].values, new) Out[2]: False In [3]: np.shares_memory(df["A"].values, values) Out[3]: False

In pandas 1.3.0, df continues to share data with values

New behavior:

In [38]: df.dtypes Out[38]: A float64 dtype: object

In [39]: np.shares_memory(df["A"], new) Out[39]: False

In [40]: np.shares_memory(df["A"], values) Out[40]: True

Never operate inplace when setting frame[keys] = values#

When setting multiple columns using frame[keys] = values new arrays will replace pre-existing arrays for these keys, which will not be over-written (GH 39510). As a result, the columns will retain the dtype(s) of values, never casting to the dtypes of the existing arrays.

In [41]: df = pd.DataFrame(range(3), columns=["A"], dtype="float64")

In [42]: df[["A"]] = 5

In the old behavior, 5 was cast to float64 and inserted into the existing array backing df:

Previous behavior:

In [1]: df.dtypes Out[1]: A float64

In the new behavior, we get a new array, and retain an integer-dtyped 5:

New behavior:

In [43]: df.dtypes Out[43]: A int64 dtype: object

Consistent casting with setting into Boolean Series#

Setting non-boolean values into a Series with dtype=bool now consistently casts to dtype=object (GH 38709)

In [1]: orig = pd.Series([True, False])

In [2]: ser = orig.copy()

In [3]: ser.iloc[1] = np.nan

In [4]: ser2 = orig.copy()

In [5]: ser2.iloc[1] = 2.0

Previous behavior:

In [1]: ser Out [1]: 0 1.0 1 NaN dtype: float64

In [2]:ser2 Out [2]: 0 True 1 2.0 dtype: object

New behavior:

In [1]: ser Out [1]: 0 True 1 NaN dtype: object

In [2]:ser2 Out [2]: 0 True 1 2.0 dtype: object

DataFrameGroupBy.rolling and SeriesGroupBy.rolling no longer return grouped-by column in values#

The group-by column will now be dropped from the result of agroupby.rolling operation (GH 32262)

In [44]: df = pd.DataFrame({"A": [1, 1, 2, 3], "B": [0, 1, 2, 3]})

In [45]: df Out[45]: A B 0 1 0 1 1 1 2 2 2 3 3 3

Previous behavior:

In [1]: df.groupby("A").rolling(2).sum() Out[1]: A B A 1 0 NaN NaN 1 2.0 1.0 2 2 NaN NaN 3 3 NaN NaN

New behavior:

In [46]: df.groupby("A").rolling(2).sum() Out[46]: B A
1 0 NaN 1 1.0 2 2 NaN 3 3 NaN

Removed artificial truncation in rolling variance and standard deviation#

Rolling.std() and Rolling.var() will no longer artificially truncate results that are less than ~1e-8 and ~1e-15 respectively to zero (GH 37051, GH 40448, GH 39872).

However, floating point artifacts may now exist in the results when rolling over larger values.

In [47]: s = pd.Series([7, 5, 5, 5])

In [48]: s.rolling(3).var() Out[48]: 0 NaN 1 NaN 2 1.333333 3 0.000000 dtype: float64

DataFrameGroupBy.rolling and SeriesGroupBy.rolling with MultiIndex no longer drop levels in the result#

DataFrameGroupBy.rolling() and SeriesGroupBy.rolling() will no longer drop levels of a DataFramewith a MultiIndex in the result. This can lead to a perceived duplication of levels in the resultingMultiIndex, but this change restores the behavior that was present in version 1.1.3 (GH 38787, GH 38523).

In [49]: index = pd.MultiIndex.from_tuples([('idx1', 'idx2')], names=['label1', 'label2'])

In [50]: df = pd.DataFrame({'a': [1], 'b': [2]}, index=index)

In [51]: df Out[51]: a b label1 label2
idx1 idx2 1 2

Previous behavior:

In [1]: df.groupby('label1').rolling(1).sum() Out[1]: a b label1 idx1 1.0 2.0

New behavior:

In [52]: df.groupby('label1').rolling(1).sum() Out[52]: a b label1 label1 label2
idx1 idx1 idx2 1.0 2.0

Backwards incompatible API changes#

Increased minimum versions for dependencies#

Some minimum supported versions of dependencies were updated. If installed, we now require:

Package Minimum Version Required Changed
numpy 1.17.3 X X
pytz 2017.3 X
python-dateutil 2.7.3 X
bottleneck 1.2.1
numexpr 2.7.0 X
pytest (dev) 6.0 X
mypy (dev) 0.812 X
setuptools 38.6.0 X

For optional libraries the general recommendation is to use the latest version. The following table lists the lowest version per library that is currently being tested throughout the development of pandas. Optional libraries below the lowest tested version may still work, but are not considered supported.

Package Minimum Version Changed
beautifulsoup4 4.6.0
fastparquet 0.4.0 X
fsspec 0.7.4
gcsfs 0.6.0
lxml 4.3.0
matplotlib 2.2.3
numba 0.46.0
openpyxl 3.0.0 X
pyarrow 0.17.0 X
pymysql 0.8.1 X
pytables 3.5.1
s3fs 0.4.0
scipy 1.2.0
sqlalchemy 1.3.0 X
tabulate 0.8.7 X
xarray 0.12.0
xlrd 1.2.0
xlsxwriter 1.0.2
xlwt 1.3.0
pandas-gbq 0.12.0

See Dependencies and Optional dependencies for more.

Other API changes#

Build#

Deprecations#

Deprecated dropping nuisance columns in DataFrame reductions and DataFrameGroupBy operations#

Calling a reduction (e.g. .min, .max, .sum) on a DataFrame withnumeric_only=None (the default), columns where the reduction raises a TypeErrorare silently ignored and dropped from the result.

This behavior is deprecated. In a future version, the TypeError will be raised, and users will need to select only valid columns before calling the function.

For example:

In [53]: df = pd.DataFrame({"A": [1, 2, 3, 4], "B": pd.date_range("2016-01-01", periods=4)})

In [54]: df Out[54]: A B 0 1 2016-01-01 1 2 2016-01-02 2 3 2016-01-03 3 4 2016-01-04

Old behavior:

In [3]: df.prod() Out[3]: Out[3]: A 24 dtype: int64

Future behavior:

In [4]: df.prod() ... TypeError: 'DatetimeArray' does not implement reduction 'prod'

In [5]: df[["A"]].prod() Out[5]: A 24 dtype: int64

Similarly, when applying a function to DataFrameGroupBy, columns on which the function raises TypeError are currently silently ignored and dropped from the result.

This behavior is deprecated. In a future version, the TypeErrorwill be raised, and users will need to select only valid columns before calling the function.

For example:

In [55]: df = pd.DataFrame({"A": [1, 2, 3, 4], "B": pd.date_range("2016-01-01", periods=4)})

In [56]: gb = df.groupby([1, 1, 2, 2])

Old behavior:

In [4]: gb.prod(numeric_only=False) Out[4]: A 1 2 2 12

Future behavior:

In [5]: gb.prod(numeric_only=False) ... TypeError: datetime64 type does not support prod operations

In [6]: gb[["A"]].prod(numeric_only=False) Out[6]: A 1 2 2 12

Other Deprecations#

Performance improvements#

Bug fixes#

Categorical#

Datetimelike#

Timedelta#

Timezones#

Numeric#

Conversion#

Strings#

Interval#

Indexing#

Missing#

MultiIndex#

I/O#

Period#

Plotting#

Groupby/resample/rolling#

Reshaping#

Sparse#

ExtensionArray#

Styler#

Other#

Contributors#

A total of 251 people contributed patches to this release. People with a “+” by their names contributed a patch for the first time.