What’s new in 1.4.0 (January 22, 2022) — pandas 3.0.0.dev0+2107.g341f1612a9 documentation (original) (raw)

These are the changes in pandas 1.4.0. See Release notes for a full changelog including other versions of pandas.

Enhancements#

Improved warning messages#

Previously, warning messages may have pointed to lines within the pandas library. Running the script setting_with_copy_warning.py

import pandas as pd

df = pd.DataFrame({'a': [1, 2, 3]}) df[:2].loc[:, 'a'] = 5

with pandas 1.3 resulted in:

.../site-packages/pandas/core/indexing.py:1951: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.

This made it difficult to determine where the warning was being generated from. Now pandas will inspect the call stack, reporting the first line outside of the pandas library that gave rise to the warning. The output of the above script is now:

setting_with_copy_warning.py:4: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.

Index can hold arbitrary ExtensionArrays#

Until now, passing a custom ExtensionArray to pd.Index would cast the array to object dtype. Now Index can directly hold arbitrary ExtensionArrays (GH 43930).

Previous behavior:

In [1]: arr = pd.array([1, 2, pd.NA])

In [2]: idx = pd.Index(arr)

In the old behavior, idx would be object-dtype:

Previous behavior:

In [1]: idx Out[1]: Index([1, 2, ], dtype='object')

With the new behavior, we keep the original dtype:

New behavior:

In [3]: idx Out[3]: Index([1, 2, ], dtype='Int64')

One exception to this is SparseArray, which will continue to cast to numpy dtype until pandas 2.0. At that point it will retain its dtype like other ExtensionArrays.

Styler#

Styler has been further developed in 1.4.0. The following general enhancements have been made:

Additionally there are specific enhancements to the HTML specific rendering:

There are also some LaTeX specific enhancements:

Multi-threaded CSV reading with a new CSV Engine based on pyarrow#

pandas.read_csv() now accepts engine="pyarrow" (requires at leastpyarrow 1.0.1) as an argument, allowing for faster csv parsing on multicore machines with pyarrow installed. See the I/O docs for more info. (GH 23697, GH 43706)

Rank function for rolling and expanding windows#

Added rank function to Rolling and Expanding. The new function supports the method, ascending, and pct flags ofDataFrame.rank(). The method argument supports min, max, andaverage ranking methods. Example:

In [4]: s = pd.Series([1, 4, 2, 3, 5, 3])

In [5]: s.rolling(3).rank() Out[5]: 0 NaN 1 NaN 2 2.0 3 2.0 4 3.0 5 1.5 dtype: float64

In [6]: s.rolling(3).rank(method="max") Out[6]: 0 NaN 1 NaN 2 2.0 3 2.0 4 3.0 5 2.0 dtype: float64

Groupby positional indexing#

It is now possible to specify positional ranges relative to the ends of each group.

Negative arguments for DataFrameGroupBy.head(), SeriesGroupBy.head(), DataFrameGroupBy.tail(), and SeriesGroupBy.tail() now work correctly and result in ranges relative to the end and start of each group, respectively. Previously, negative arguments returned empty frames.

In [7]: df = pd.DataFrame([["g", "g0"], ["g", "g1"], ["g", "g2"], ["g", "g3"], ...: ["h", "h0"], ["h", "h1"]], columns=["A", "B"]) ...:

In [8]: df.groupby("A").head(-1) Out[8]: A B 0 g g0 1 g g1 2 g g2 4 h h0

DataFrameGroupBy.nth() and SeriesGroupBy.nth() now accept a slice or list of integers and slices.

In [9]: df.groupby("A").nth(slice(1, -1)) Out[9]: A B 1 g g1 2 g g2

In [10]: df.groupby("A").nth([slice(None, 1), slice(-1, None)]) Out[10]: A B 0 g g0 3 g g3 4 h h0 5 h h1

DataFrameGroupBy.nth() and SeriesGroupBy.nth() now accept index notation.

In [11]: df.groupby("A").nth[1, -1] Out[11]: A B 1 g g1 3 g g3 5 h h1

In [12]: df.groupby("A").nth[1:-1] Out[12]: A B 1 g g1 2 g g2

In [13]: df.groupby("A").nth[:1, -1:] Out[13]: A B 0 g g0 3 g g3 4 h h0 5 h h1

DataFrame.from_dict and DataFrame.to_dict have new 'tight' option#

A new 'tight' dictionary format that preserves MultiIndex entries and names is now available with the DataFrame.from_dict() andDataFrame.to_dict() methods and can be used with the standard jsonlibrary to produce a tight representation of DataFrame objects (GH 4889).

In [14]: df = pd.DataFrame.from_records( ....: [[1, 3], [2, 4]], ....: index=pd.MultiIndex.from_tuples([("a", "b"), ("a", "c")], ....: names=["n1", "n2"]), ....: columns=pd.MultiIndex.from_tuples([("x", 1), ("y", 2)], ....: names=["z1", "z2"]), ....: ) ....:

In [15]: df Out[15]: z1 x y z2 1 2 n1 n2
a b 1 3 c 2 4

In [16]: df.to_dict(orient='tight') Out[16]: {'index': [('a', 'b'), ('a', 'c')], 'columns': [('x', 1), ('y', 2)], 'data': [[1, 3], [2, 4]], 'index_names': ['n1', 'n2'], 'column_names': ['z1', 'z2']}

Other enhancements#

Notable bug fixes#

These are bug fixes that might have notable behavior changes.

Inconsistent date string parsing#

The dayfirst option of to_datetime() isn’t strict, and this can lead to surprising behavior:

In [17]: pd.to_datetime(["31-12-2021"], dayfirst=False) Out[17]: DatetimeIndex(['2021-12-31'], dtype='datetime64[s]', freq=None)

Now, a warning will be raised if a date string cannot be parsed accordance to the given dayfirst value when the value is a delimited date string (e.g.31-12-2012).

Ignoring dtypes in concat with empty or all-NA columns#

Note

This behaviour change has been reverted in pandas 1.4.3.

When using concat() to concatenate two or more DataFrame objects, if one of the DataFrames was empty or had all-NA values, its dtype was_sometimes_ ignored when finding the concatenated dtype. These are now consistently not ignored (GH 43507).

In [3]: df1 = pd.DataFrame({"bar": [pd.Timestamp("2013-01-01")]}, index=range(1)) In [4]: df2 = pd.DataFrame({"bar": np.nan}, index=range(1, 2)) In [5]: res = pd.concat([df1, df2])

Previously, the float-dtype in df2 would be ignored so the result dtype would be datetime64[ns]. As a result, the np.nan would be cast toNaT.

Previous behavior:

In [6]: res Out[6]: bar 0 2013-01-01 1 NaT

Now the float-dtype is respected. Since the common dtype for these DataFrames is object, the np.nan is retained.

New behavior:

In [6]: res Out[6]: bar 0 2013-01-01 00:00:00 1 NaN

Null-values are no longer coerced to NaN-value in value_counts and mode#

Series.value_counts() and Series.mode() no longer coerce None,NaT and other null-values to a NaN-value for np.object_-dtype. This behavior is now consistent with unique, isin and others (GH 42688).

In [18]: s = pd.Series([True, None, pd.NaT, None, pd.NaT, None])

In [19]: res = s.value_counts(dropna=False)

Previously, all null-values were replaced by a NaN-value.

Previous behavior:

In [3]: res Out[3]: NaN 5 True 1 dtype: int64

Now null-values are no longer mangled.

New behavior:

In [20]: res Out[20]: None 3 NaT 2 True 1 Name: count, dtype: int64

mangle_dupe_cols in read_csv no longer renames unique columns conflicting with target names#

read_csv() no longer renames unique column labels which conflict with the target names of duplicated columns. Already existing columns are skipped, i.e. the next available index is used for the target column name (GH 14704).

In [21]: import io

In [22]: data = "a,a,a.1\n1,2,3"

In [23]: res = pd.read_csv(io.StringIO(data))

Previously, the second column was called a.1, while the third column was also renamed to a.1.1.

Previous behavior:

In [3]: res Out[3]: a a.1 a.1.1 0 1 2 3

Now the renaming checks if a.1 already exists when changing the name of the second column and jumps this index. The second column is instead renamed toa.2.

New behavior:

In [24]: res Out[24]: a a.2 a.1 0 1 2 3

unstack and pivot_table no longer raises ValueError for result that would exceed int32 limit#

Previously DataFrame.pivot_table() and DataFrame.unstack() would raise a ValueError if the operation could produce a result with more than2**31 - 1 elements. This operation now raises aerrors.PerformanceWarning instead (GH 26314).

Previous behavior:

In [3]: df = DataFrame({"ind1": np.arange(2 ** 16), "ind2": np.arange(2 ** 16), "count": 0}) In [4]: df.pivot_table(index="ind1", columns="ind2", values="count", aggfunc="count") ValueError: Unstacked DataFrame is too big, causing int32 overflow

New behavior:

In [4]: df.pivot_table(index="ind1", columns="ind2", values="count", aggfunc="count") PerformanceWarning: The following operation may generate 4294967296 cells in the resulting pandas object.

groupby.apply consistent transform detection#

DataFrameGroupBy.apply() and SeriesGroupBy.apply() are designed to be flexible, allowing users to perform aggregations, transformations, filters, and use it with user-defined functions that might not fall into any of these categories. As part of this, apply will attempt to detect when an operation is a transform, and in such a case, the result will have the same index as the input. In order to determine if the operation is a transform, pandas compares the input’s index to the result’s and determines if it has been mutated. Previously in pandas 1.3, different code paths used different definitions of “mutated”: some would use Python’s iswhereas others would test only up to equality.

This inconsistency has been removed, pandas now tests up to equality.

In [25]: def func(x): ....: return x.copy() ....:

In [26]: df = pd.DataFrame({'a': [1, 2], 'b': [3, 4], 'c': [5, 6]})

In [27]: df Out[27]: a b c 0 1 3 5 1 2 4 6

Previous behavior:

In [3]: df.groupby(['a']).apply(func) Out[3]: a b c a 1 0 1 3 5 2 1 2 4 6

In [4]: df.set_index(['a', 'b']).groupby(['a']).apply(func) Out[4]: c a b 1 3 5 2 4 6

In the examples above, the first uses a code path where pandas uses is and determines that func is not a transform whereas the second tests up to equality and determines that func is a transform. In the first case, the result’s index is not the same as the input’s.

New behavior:

In [5]: df.groupby(['a']).apply(func) Out[5]: a b c 0 1 3 5 1 2 4 6

In [6]: df.set_index(['a', 'b']).groupby(['a']).apply(func) Out[6]: c a b 1 3 5 2 4 6

Now in both cases it is determined that func is a transform. In each case, the result has the same index as the input.

Backwards incompatible API changes#

Increased minimum version for Python#

pandas 1.4.0 supports Python 3.8 and higher.

Increased minimum versions for dependencies#

Some minimum supported versions of dependencies were updated. If installed, we now require:

For optional libraries the general recommendation is to use the latest version. The following table lists the lowest version per library that is currently being tested throughout the development of pandas. Optional libraries below the lowest tested version may still work, but are not considered supported.

See Dependencies and Optional dependencies for more.

Other API changes#

Deprecations#

Deprecated Int64Index, UInt64Index & Float64Index#

Int64Index, UInt64Index and Float64Index have been deprecated in favor of the base Index class and will be removed in pandas 2.0 (GH 43028).

For constructing a numeric index, you can use the base Index class instead specifying the data type (which will also work on older pandas releases):

replace

pd.Int64Index([1, 2, 3])

with

pd.Index([1, 2, 3], dtype="int64")

For checking the data type of an index object, you can replace isinstancechecks with checking the dtype:

replace

isinstance(idx, pd.Int64Index)

with

idx.dtype == "int64"

Currently, in order to maintain backward compatibility, calls to Indexwill continue to return Int64Index, UInt64Index andFloat64Index when given numeric data, but in the future, anIndex will be returned.

Current behavior:

In [1]: pd.Index([1, 2, 3], dtype="int32") Out [1]: Int64Index([1, 2, 3], dtype='int64') In [1]: pd.Index([1, 2, 3], dtype="uint64") Out [1]: UInt64Index([1, 2, 3], dtype='uint64')

Future behavior:

In [3]: pd.Index([1, 2, 3], dtype="int32") Out [3]: Index([1, 2, 3], dtype='int32') In [4]: pd.Index([1, 2, 3], dtype="uint64") Out [4]: Index([1, 2, 3], dtype='uint64')

Deprecated DataFrame.append and Series.append#

DataFrame.append() and Series.append() have been deprecated and will be removed in a future version. Use pandas.concat() instead (GH 35407).

Deprecated syntax

In [1]: pd.Series([1, 2]).append(pd.Series([3, 4]) Out [1]: :1: FutureWarning: The series.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead. 0 1 1 2 0 3 1 4 dtype: int64

In [2]: df1 = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB')) In [3]: df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB')) In [4]: df1.append(df2) Out [4]: :1: FutureWarning: The series.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead. A B 0 1 2 1 3 4 0 5 6 1 7 8

Recommended syntax

In [28]: pd.concat([pd.Series([1, 2]), pd.Series([3, 4])]) Out[28]: 0 1 1 2 0 3 1 4 dtype: int64

In [29]: df1 = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))

In [30]: df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB'))

In [31]: pd.concat([df1, df2]) Out[31]: A B 0 1 2 1 3 4 0 5 6 1 7 8

Other Deprecations#

Performance improvements#

Bug fixes#

Categorical#

Datetimelike#

Timedelta#

Time Zones#

Numeric#

Conversion#

Strings#

Interval#

Indexing#

Missing#

MultiIndex#

I/O#

Period#

Plotting#

Groupby/resample/rolling#

Reshaping#

Sparse#

ExtensionArray#

Styler#

Other#

Contributors#

A total of 275 people contributed patches to this release. People with a “+” by their names contributed a patch for the first time.