What’s new in 1.4.0 (January 22, 2022) — pandas 3.0.0.dev0+2107.g341f1612a9 documentation (original) (raw)
These are the changes in pandas 1.4.0. See Release notes for a full changelog including other versions of pandas.
Enhancements#
Improved warning messages#
Previously, warning messages may have pointed to lines within the pandas library. Running the script setting_with_copy_warning.py
import pandas as pd
df = pd.DataFrame({'a': [1, 2, 3]}) df[:2].loc[:, 'a'] = 5
with pandas 1.3 resulted in:
.../site-packages/pandas/core/indexing.py:1951: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
This made it difficult to determine where the warning was being generated from. Now pandas will inspect the call stack, reporting the first line outside of the pandas library that gave rise to the warning. The output of the above script is now:
setting_with_copy_warning.py:4: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Index can hold arbitrary ExtensionArrays#
Until now, passing a custom ExtensionArray
to pd.Index
would cast the array to object
dtype. Now Index can directly hold arbitrary ExtensionArrays (GH 43930).
Previous behavior:
In [1]: arr = pd.array([1, 2, pd.NA])
In [2]: idx = pd.Index(arr)
In the old behavior, idx
would be object-dtype:
Previous behavior:
In [1]: idx Out[1]: Index([1, 2, ], dtype='object')
With the new behavior, we keep the original dtype:
New behavior:
In [3]: idx Out[3]: Index([1, 2, ], dtype='Int64')
One exception to this is SparseArray
, which will continue to cast to numpy dtype until pandas 2.0. At that point it will retain its dtype like other ExtensionArrays.
Styler#
Styler has been further developed in 1.4.0. The following general enhancements have been made:
- Styling and formatting of indexes has been added, with Styler.apply_index(),
Styler.applymap_index()
and Styler.format_index(). These mirror the signature of the methods already used to style and format data values, and work with both HTML, LaTeX and Excel format (GH 41893, GH 43101, GH 41993, GH 41995)- The new method Styler.hide() deprecates
Styler.hide_index()
andStyler.hide_columns()
(GH 43758)- The keyword arguments
level
andnames
have been added to Styler.hide() (and implicitly to the deprecated methodsStyler.hide_index()
andStyler.hide_columns()
) for additional control of visibility of MultiIndexes and of Index names (GH 25475, GH 43404, GH 43346)- The Styler.export() and Styler.use() have been updated to address all of the added functionality from v1.2.0 and v1.3.0 (GH 40675)
- Global options under the category
pd.options.styler
have been extended to configure defaultStyler
properties which address formatting, encoding, and HTML and LaTeX rendering. Note that formerlyStyler
relied ondisplay.html.use_mathjax
, which has now been replaced bystyler.html.mathjax
(GH 41395)- Validation of certain keyword arguments, e.g.
caption
(GH 43368)- Various bug fixes as recorded below
Additionally there are specific enhancements to the HTML specific rendering:
- Styler.bar() introduces additional arguments to control alignment and display (GH 26070, GH 36419), and it also validates the input arguments
width
andheight
(GH 42511)- Styler.to_html() introduces keyword arguments
sparse_index
,sparse_columns
,bold_headers
,caption
,max_rows
andmax_columns
(GH 41946, GH 43149, GH 42972)- Styler.to_html() omits CSSStyle rules for hidden table elements as a performance enhancement (GH 43619)
- Custom CSS classes can now be directly specified without string replacement (GH 43686)
- Ability to render hyperlinks automatically via a new
hyperlinks
formatting keyword argument (GH 45058)
There are also some LaTeX specific enhancements:
- Styler.to_latex() introduces keyword argument
environment
, which also allows a specific “longtable” entry through a separate jinja2 template (GH 41866)- Naive sparsification is now possible for LaTeX without the necessity of including the multirow package (GH 43369)
- cline support has been added for MultiIndex row sparsification through a keyword argument (GH 45138)
Multi-threaded CSV reading with a new CSV Engine based on pyarrow#
pandas.read_csv() now accepts engine="pyarrow"
(requires at leastpyarrow
1.0.1) as an argument, allowing for faster csv parsing on multicore machines with pyarrow installed. See the I/O docs for more info. (GH 23697, GH 43706)
Rank function for rolling and expanding windows#
Added rank
function to Rolling
and Expanding
. The new function supports the method
, ascending
, and pct
flags ofDataFrame.rank(). The method
argument supports min
, max
, andaverage
ranking methods. Example:
In [4]: s = pd.Series([1, 4, 2, 3, 5, 3])
In [5]: s.rolling(3).rank() Out[5]: 0 NaN 1 NaN 2 2.0 3 2.0 4 3.0 5 1.5 dtype: float64
In [6]: s.rolling(3).rank(method="max") Out[6]: 0 NaN 1 NaN 2 2.0 3 2.0 4 3.0 5 2.0 dtype: float64
Groupby positional indexing#
It is now possible to specify positional ranges relative to the ends of each group.
Negative arguments for DataFrameGroupBy.head(), SeriesGroupBy.head(), DataFrameGroupBy.tail(), and SeriesGroupBy.tail() now work correctly and result in ranges relative to the end and start of each group, respectively. Previously, negative arguments returned empty frames.
In [7]: df = pd.DataFrame([["g", "g0"], ["g", "g1"], ["g", "g2"], ["g", "g3"], ...: ["h", "h0"], ["h", "h1"]], columns=["A", "B"]) ...:
In [8]: df.groupby("A").head(-1) Out[8]: A B 0 g g0 1 g g1 2 g g2 4 h h0
DataFrameGroupBy.nth() and SeriesGroupBy.nth() now accept a slice or list of integers and slices.
In [9]: df.groupby("A").nth(slice(1, -1)) Out[9]: A B 1 g g1 2 g g2
In [10]: df.groupby("A").nth([slice(None, 1), slice(-1, None)]) Out[10]: A B 0 g g0 3 g g3 4 h h0 5 h h1
DataFrameGroupBy.nth() and SeriesGroupBy.nth() now accept index notation.
In [11]: df.groupby("A").nth[1, -1] Out[11]: A B 1 g g1 3 g g3 5 h h1
In [12]: df.groupby("A").nth[1:-1] Out[12]: A B 1 g g1 2 g g2
In [13]: df.groupby("A").nth[:1, -1:] Out[13]: A B 0 g g0 3 g g3 4 h h0 5 h h1
DataFrame.from_dict and DataFrame.to_dict have new 'tight'
option#
A new 'tight'
dictionary format that preserves MultiIndex entries and names is now available with the DataFrame.from_dict() andDataFrame.to_dict() methods and can be used with the standard json
library to produce a tight representation of DataFrame objects (GH 4889).
In [14]: df = pd.DataFrame.from_records( ....: [[1, 3], [2, 4]], ....: index=pd.MultiIndex.from_tuples([("a", "b"), ("a", "c")], ....: names=["n1", "n2"]), ....: columns=pd.MultiIndex.from_tuples([("x", 1), ("y", 2)], ....: names=["z1", "z2"]), ....: ) ....:
In [15]: df
Out[15]:
z1 x y
z2 1 2
n1 n2
a b 1 3
c 2 4
In [16]: df.to_dict(orient='tight') Out[16]: {'index': [('a', 'b'), ('a', 'c')], 'columns': [('x', 1), ('y', 2)], 'data': [[1, 3], [2, 4]], 'index_names': ['n1', 'n2'], 'column_names': ['z1', 'z2']}
Other enhancements#
- concat() will preserve the
attrs
when it is the same for all objects and discard theattrs
when they are different (GH 41828) DataFrameGroupBy
operations withas_index=False
now correctly retainExtensionDtype
dtypes for columns being grouped on (GH 41373)- Add support for assigning values to
by
argument in DataFrame.plot.hist() and DataFrame.plot.box() (GH 15079) - Series.sample(), DataFrame.sample(), DataFrameGroupBy.sample(), and SeriesGroupBy.sample() now accept a
np.random.Generator
as input torandom_state
. A generator will be more performant, especially withreplace=False
(GH 38100) - Series.ewm() and DataFrame.ewm() now support a
method
argument with a'table'
option that performs the windowing operation over an entire DataFrame. See Window Overview for performance and functional benefits (GH 42273) - DataFrameGroupBy.cummin(), SeriesGroupBy.cummin(), DataFrameGroupBy.cummax(), and SeriesGroupBy.cummax() now support the argument
skipna
(GH 34047) - read_table() now supports the argument
storage_options
(GH 39167) - DataFrame.to_stata() and
StataWriter()
now accept the keyword only argumentvalue_labels
to save labels for non-categorical columns (GH 38454) - Methods that relied on hashmap based algos such as
DataFrameGroupBy.value_counts()
,DataFrameGroupBy.count()
and factorize() ignored imaginary component for complex numbers (GH 17927) - Add Series.str.removeprefix() and Series.str.removesuffix() introduced in Python 3.9 to remove pre-/suffixes from string-type Series (GH 36944)
- Attempting to write into a file in missing parent directory with DataFrame.to_csv(), DataFrame.to_html(), DataFrame.to_excel(), DataFrame.to_feather(), DataFrame.to_parquet(), DataFrame.to_stata(), DataFrame.to_json(), DataFrame.to_pickle(), and DataFrame.to_xml() now explicitly mentions missing parent directory, the same is true for Series counterparts (GH 24306)
- Indexing with
.loc
and.iloc
now supportsEllipsis
(GH 37750) IntegerArray.all()
,IntegerArray.any()
,FloatingArray.any()
, andFloatingArray.all()
use Kleene logic (GH 41967)- Added support for nullable boolean and integer types in DataFrame.to_stata(),
StataWriter
,StataWriter117
, andStataWriterUTF8
(GH 40855) DataFrame.__pos__()
andDataFrame.__neg__()
now retainExtensionDtype
dtypes (GH 43883)- The error raised when an optional dependency can’t be imported now includes the original exception, for easier investigation (GH 43882)
- Added ExponentialMovingWindow.sum() (GH 13297)
- Series.str.split() now supports a
regex
argument that explicitly specifies whether the pattern is a regular expression. Default isNone
(GH 43563, GH 32835, GH 25549) - DataFrame.dropna() now accepts a single label as
subset
along with array-like (GH 41021) - Added
DataFrameGroupBy.value_counts()
(GH 43564) - read_csv() now accepts a
callable
function inon_bad_lines
whenengine="python"
for custom handling of bad lines (GH 5686) - ExcelWriter argument
if_sheet_exists="overlay"
option added (GH 40231) - read_excel() now accepts a
decimal
argument that allow the user to specify the decimal point when parsing string columns to numeric (GH 14403) - DataFrameGroupBy.mean(), SeriesGroupBy.mean(), DataFrameGroupBy.std(), SeriesGroupBy.std(), DataFrameGroupBy.var(), SeriesGroupBy.var(), DataFrameGroupBy.sum(), and SeriesGroupBy.sum() now support Numba execution with the
engine
keyword (GH 43731, GH 44862, GH 44939) - Timestamp.isoformat() now handles the
timespec
argument from the basedatetime
class (GH 26131) NaT.to_numpy()
dtype
argument is now respected, sonp.timedelta64
can be returned (GH 44460)- New option
display.max_dir_items
customizes the number of columns added toDataframe.__dir__()
and suggested for tab completion (GH 37996) - Added “Juneteenth National Independence Day” to
USFederalHolidayCalendar
(GH 44574) - Rolling.var(), Expanding.var(), Rolling.std(), and Expanding.std() now support Numba execution with the
engine
keyword (GH 44461) - Series.info() has been added, for compatibility with DataFrame.info() (GH 5167)
- Implemented
IntervalArray.min()
andIntervalArray.max()
, as a result of whichmin
andmax
now work for IntervalIndex, Series and DataFrame withIntervalDtype
(GH 44746) UInt64Index.map()
now retainsdtype
where possible (GH 44609)- read_json() can now parse unsigned long long integers (GH 26068)
- DataFrame.take() now raises a
TypeError
when passed a scalar for the indexer (GH 42875) is_list_like()
now identifies duck-arrays as list-like unless.ndim == 0
(GH 35131)ExtensionDtype
andExtensionArray
are now (de)serialized when exporting a DataFrame with DataFrame.to_json() usingorient='table'
(GH 20612, GH 44705)- Add support for Zstandard compression to DataFrame.to_pickle()/read_pickle() and friends (GH 43925)
- DataFrame.to_sql() now returns an
int
of the number of written rows (GH 23998)
Notable bug fixes#
These are bug fixes that might have notable behavior changes.
Inconsistent date string parsing#
The dayfirst
option of to_datetime() isn’t strict, and this can lead to surprising behavior:
In [17]: pd.to_datetime(["31-12-2021"], dayfirst=False) Out[17]: DatetimeIndex(['2021-12-31'], dtype='datetime64[s]', freq=None)
Now, a warning will be raised if a date string cannot be parsed accordance to the given dayfirst
value when the value is a delimited date string (e.g.31-12-2012
).
Ignoring dtypes in concat with empty or all-NA columns#
Note
This behaviour change has been reverted in pandas 1.4.3.
When using concat() to concatenate two or more DataFrame objects, if one of the DataFrames was empty or had all-NA values, its dtype was_sometimes_ ignored when finding the concatenated dtype. These are now consistently not ignored (GH 43507).
In [3]: df1 = pd.DataFrame({"bar": [pd.Timestamp("2013-01-01")]}, index=range(1)) In [4]: df2 = pd.DataFrame({"bar": np.nan}, index=range(1, 2)) In [5]: res = pd.concat([df1, df2])
Previously, the float-dtype in df2
would be ignored so the result dtype would be datetime64[ns]
. As a result, the np.nan
would be cast toNaT
.
Previous behavior:
In [6]: res Out[6]: bar 0 2013-01-01 1 NaT
Now the float-dtype is respected. Since the common dtype for these DataFrames is object, the np.nan
is retained.
New behavior:
In [6]: res Out[6]: bar 0 2013-01-01 00:00:00 1 NaN
Null-values are no longer coerced to NaN-value in value_counts and mode#
Series.value_counts() and Series.mode() no longer coerce None
,NaT
and other null-values to a NaN-value for np.object_
-dtype. This behavior is now consistent with unique
, isin
and others (GH 42688).
In [18]: s = pd.Series([True, None, pd.NaT, None, pd.NaT, None])
In [19]: res = s.value_counts(dropna=False)
Previously, all null-values were replaced by a NaN-value.
Previous behavior:
In [3]: res Out[3]: NaN 5 True 1 dtype: int64
Now null-values are no longer mangled.
New behavior:
In [20]: res Out[20]: None 3 NaT 2 True 1 Name: count, dtype: int64
mangle_dupe_cols in read_csv no longer renames unique columns conflicting with target names#
read_csv() no longer renames unique column labels which conflict with the target names of duplicated columns. Already existing columns are skipped, i.e. the next available index is used for the target column name (GH 14704).
In [21]: import io
In [22]: data = "a,a,a.1\n1,2,3"
In [23]: res = pd.read_csv(io.StringIO(data))
Previously, the second column was called a.1
, while the third column was also renamed to a.1.1
.
Previous behavior:
In [3]: res Out[3]: a a.1 a.1.1 0 1 2 3
Now the renaming checks if a.1
already exists when changing the name of the second column and jumps this index. The second column is instead renamed toa.2
.
New behavior:
In [24]: res Out[24]: a a.2 a.1 0 1 2 3
unstack and pivot_table no longer raises ValueError for result that would exceed int32 limit#
Previously DataFrame.pivot_table() and DataFrame.unstack() would raise a ValueError
if the operation could produce a result with more than2**31 - 1
elements. This operation now raises aerrors.PerformanceWarning instead (GH 26314).
Previous behavior:
In [3]: df = DataFrame({"ind1": np.arange(2 ** 16), "ind2": np.arange(2 ** 16), "count": 0}) In [4]: df.pivot_table(index="ind1", columns="ind2", values="count", aggfunc="count") ValueError: Unstacked DataFrame is too big, causing int32 overflow
New behavior:
In [4]: df.pivot_table(index="ind1", columns="ind2", values="count", aggfunc="count") PerformanceWarning: The following operation may generate 4294967296 cells in the resulting pandas object.
groupby.apply consistent transform detection#
DataFrameGroupBy.apply() and SeriesGroupBy.apply() are designed to be flexible, allowing users to perform aggregations, transformations, filters, and use it with user-defined functions that might not fall into any of these categories. As part of this, apply will attempt to detect when an operation is a transform, and in such a case, the result will have the same index as the input. In order to determine if the operation is a transform, pandas compares the input’s index to the result’s and determines if it has been mutated. Previously in pandas 1.3, different code paths used different definitions of “mutated”: some would use Python’s is
whereas others would test only up to equality.
This inconsistency has been removed, pandas now tests up to equality.
In [25]: def func(x): ....: return x.copy() ....:
In [26]: df = pd.DataFrame({'a': [1, 2], 'b': [3, 4], 'c': [5, 6]})
In [27]: df Out[27]: a b c 0 1 3 5 1 2 4 6
Previous behavior:
In [3]: df.groupby(['a']).apply(func) Out[3]: a b c a 1 0 1 3 5 2 1 2 4 6
In [4]: df.set_index(['a', 'b']).groupby(['a']).apply(func) Out[4]: c a b 1 3 5 2 4 6
In the examples above, the first uses a code path where pandas uses is
and determines that func
is not a transform whereas the second tests up to equality and determines that func
is a transform. In the first case, the result’s index is not the same as the input’s.
New behavior:
In [5]: df.groupby(['a']).apply(func) Out[5]: a b c 0 1 3 5 1 2 4 6
In [6]: df.set_index(['a', 'b']).groupby(['a']).apply(func) Out[6]: c a b 1 3 5 2 4 6
Now in both cases it is determined that func
is a transform. In each case, the result has the same index as the input.
Backwards incompatible API changes#
Increased minimum version for Python#
pandas 1.4.0 supports Python 3.8 and higher.
Increased minimum versions for dependencies#
Some minimum supported versions of dependencies were updated. If installed, we now require:
For optional libraries the general recommendation is to use the latest version. The following table lists the lowest version per library that is currently being tested throughout the development of pandas. Optional libraries below the lowest tested version may still work, but are not considered supported.
See Dependencies and Optional dependencies for more.
Other API changes#
- Index.get_indexer_for() no longer accepts keyword arguments (other than
target
); in the past these would be silently ignored if the index was not unique (GH 42310) - Change in the position of the
min_rows
argument in DataFrame.to_string() due to change in the docstring (GH 44304) - Reduction operations for DataFrame or Series now raising a
ValueError
whenNone
is passed forskipna
(GH 44178) - read_csv() and read_html() no longer raising an error when one of the header rows consists only of
Unnamed:
columns (GH 13054) - Changed the
name
attribute of several holidays inUSFederalHolidayCalendar
to match official federal holiday namesspecifically:- “New Year’s Day” gains the possessive apostrophe
- “Presidents Day” becomes “Washington’s Birthday”
- “Martin Luther King Jr. Day” is now “Birthday of Martin Luther King, Jr.”
- “July 4th” is now “Independence Day”
- “Thanksgiving” is now “Thanksgiving Day”
- “Christmas” is now “Christmas Day”
- Added “Juneteenth National Independence Day”
Deprecations#
Deprecated Int64Index, UInt64Index & Float64Index#
Int64Index
, UInt64Index
and Float64Index
have been deprecated in favor of the base Index class and will be removed in pandas 2.0 (GH 43028).
For constructing a numeric index, you can use the base Index class instead specifying the data type (which will also work on older pandas releases):
replace
pd.Int64Index([1, 2, 3])
with
pd.Index([1, 2, 3], dtype="int64")
For checking the data type of an index object, you can replace isinstance
checks with checking the dtype
:
replace
isinstance(idx, pd.Int64Index)
with
idx.dtype == "int64"
Currently, in order to maintain backward compatibility, calls to Indexwill continue to return Int64Index
, UInt64Index
andFloat64Index
when given numeric data, but in the future, anIndex will be returned.
Current behavior:
In [1]: pd.Index([1, 2, 3], dtype="int32") Out [1]: Int64Index([1, 2, 3], dtype='int64') In [1]: pd.Index([1, 2, 3], dtype="uint64") Out [1]: UInt64Index([1, 2, 3], dtype='uint64')
Future behavior:
In [3]: pd.Index([1, 2, 3], dtype="int32") Out [3]: Index([1, 2, 3], dtype='int32') In [4]: pd.Index([1, 2, 3], dtype="uint64") Out [4]: Index([1, 2, 3], dtype='uint64')
Deprecated DataFrame.append and Series.append#
DataFrame.append()
and Series.append()
have been deprecated and will be removed in a future version. Use pandas.concat() instead (GH 35407).
Deprecated syntax
In [1]: pd.Series([1, 2]).append(pd.Series([3, 4]) Out [1]: :1: FutureWarning: The series.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead. 0 1 1 2 0 3 1 4 dtype: int64
In [2]: df1 = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB')) In [3]: df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB')) In [4]: df1.append(df2) Out [4]: :1: FutureWarning: The series.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead. A B 0 1 2 1 3 4 0 5 6 1 7 8
Recommended syntax
In [28]: pd.concat([pd.Series([1, 2]), pd.Series([3, 4])]) Out[28]: 0 1 1 2 0 3 1 4 dtype: int64
In [29]: df1 = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
In [30]: df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB'))
In [31]: pd.concat([df1, df2]) Out[31]: A B 0 1 2 1 3 4 0 5 6 1 7 8
Other Deprecations#
- Deprecated
Index.is_type_compatible()
(GH 42113) - Deprecated
method
argument in Index.get_loc(), useindex.get_indexer([label], method=...)
instead (GH 42269) - Deprecated treating integer keys in
Series.__setitem__()
as positional when the index is aFloat64Index
not containing the key, a IntervalIndex with no entries containing the key, or a MultiIndex with leadingFloat64Index
level not containing the key (GH 33469) - Deprecated treating
numpy.datetime64
objects as UTC times when passed to the Timestamp constructor along with a timezone. In a future version, these will be treated as wall-times. To retain the old behavior, useTimestamp(dt64).tz_localize("UTC").tz_convert(tz)
(GH 24559) - Deprecated ignoring missing labels when indexing with a sequence of labels on a level of a MultiIndex (GH 42351)
- Creating an empty Series without a
dtype
will now raise a more visibleFutureWarning
instead of aDeprecationWarning
(GH 30017) - Deprecated the
kind
argument in Index.get_slice_bound(), Index.slice_indexer(), and Index.slice_locs(); in a future version passingkind
will raise (GH 42857) - Deprecated dropping of nuisance columns in
Rolling
,Expanding
, andEWM
aggregations (GH 42738) - Deprecated Index.reindex() with a non-unique Index (GH 42568)
- Deprecated
Styler.render()
in favor of Styler.to_html() (GH 42140) - Deprecated
Styler.hide_index()
andStyler.hide_columns()
in favor of Styler.hide() (GH 43758) - Deprecated passing in a string column label into
times
in DataFrame.ewm() (GH 43265) - Deprecated the
include_start
andinclude_end
arguments in DataFrame.between_time(); in a future version passinginclude_start
orinclude_end
will raise (GH 40245) - Deprecated the
squeeze
argument to read_csv(), read_table(), and read_excel(). Users should squeeze the DataFrame afterwards with.squeeze("columns")
instead (GH 43242) - Deprecated the
index
argument toSparseArray
construction (GH 23089) - Deprecated the
closed
argument in date_range() and bdate_range() in favor ofinclusive
argument; In a future version passingclosed
will raise (GH 40245) - Deprecated
Rolling.validate()
,Expanding.validate()
, andExponentialMovingWindow.validate()
(GH 43665) - Deprecated silent dropping of columns that raised a
TypeError
in Series.transform and DataFrame.transform when used with a dictionary (GH 43740) - Deprecated silent dropping of columns that raised a
TypeError
,DataError
, and some cases ofValueError
in Series.aggregate(), DataFrame.aggregate(),Series.groupby.aggregate()
, andDataFrame.groupby.aggregate()
when used with a list (GH 43740) - Deprecated casting behavior when setting timezone-aware value(s) into a timezone-aware Series or DataFrame column when the timezones do not match. Previously this cast to object dtype. In a future version, the values being inserted will be converted to the series or column’s existing timezone (GH 37605)
- Deprecated casting behavior when passing an item with mismatched-timezone to
DatetimeIndex.insert()
,DatetimeIndex.putmask()
,DatetimeIndex.where()
DatetimeIndex.fillna()
, Series.mask(), Series.where(), Series.fillna(), Series.shift(), Series.replace(), Series.reindex() (and DataFrame column analogues). In the past this has cast to objectdtype
. In a future version, these will cast the passed item to the index or series’s timezone (GH 37605, GH 44940) - Deprecated the
prefix
keyword argument in read_csv() and read_table(), in a future version the argument will be removed (GH 43396) - Deprecated passing non boolean argument to
sort
in concat() (GH 41518) - Deprecated passing arguments as positional for read_fwf() other than
filepath_or_buffer
(GH 41485) - Deprecated passing arguments as positional for read_xml() other than
path_or_buffer
(GH 45133) - Deprecated passing
skipna=None
forDataFrame.mad()
andSeries.mad()
, passskipna=True
instead (GH 44580) - Deprecated the behavior of to_datetime() with the string “now” with
utc=False
; in a future version this will matchTimestamp("now")
, which in turn matches Timestamp.now() returning the local time (GH 18705) - Deprecated
DateOffset.apply()
, useoffset + other
instead (GH 44522) - Deprecated parameter
names
in Index.copy() (GH 44916) - A deprecation warning is now shown for DataFrame.to_latex() indicating the arguments signature may change and emulate more the arguments to Styler.to_latex() in future versions (GH 44411)
- Deprecated behavior of concat() between objects with bool-dtype and numeric-dtypes; in a future version these will cast to object dtype instead of coercing bools to numeric values (GH 39817)
- Deprecated
Categorical.replace()
, use Series.replace() instead (GH 44929) - Deprecated passing
set
ordict
as indexer forDataFrame.loc.__setitem__()
,DataFrame.loc.__getitem__()
,Series.loc.__setitem__()
,Series.loc.__getitem__()
,DataFrame.__getitem__()
,Series.__getitem__()
andSeries.__setitem__()
(GH 42825) - Deprecated
Index.__getitem__()
with a bool key; useindex.values[key]
to get the old behavior (GH 44051) - Deprecated downcasting column-by-column in DataFrame.where() with integer-dtypes (GH 44597)
- Deprecated
DatetimeIndex.union_many()
, useDatetimeIndex.union()
instead (GH 44091) - Deprecated
Groupby.pad()
in favor ofGroupby.ffill()
(GH 33396) - Deprecated
Groupby.backfill()
in favor ofGroupby.bfill()
(GH 33396) - Deprecated
Resample.pad()
in favor ofResample.ffill()
(GH 33396) - Deprecated
Resample.backfill()
in favor ofResample.bfill()
(GH 33396) - Deprecated
numeric_only=None
in DataFrame.rank(); in a future versionnumeric_only
must be eitherTrue
orFalse
(the default) (GH 45036) - Deprecated the behavior of Timestamp.utcfromtimestamp(), in the future it will return a timezone-aware UTC Timestamp (GH 22451)
- Deprecated
NaT.freq()
(GH 45071) - Deprecated behavior of Series and DataFrame construction when passed float-dtype data containing
NaN
and an integer dtype ignoring the dtype argument; in a future version this will raise (GH 40110) - Deprecated the behaviour of Series.to_frame() and Index.to_frame() to ignore the
name
argument whenname=None
. Currently, this means to preserve the existing name, but in the future explicitly passingname=None
will setNone
as the name of the column in the resulting DataFrame (GH 44212)
Performance improvements#
- Performance improvement in DataFrameGroupBy.sample() and SeriesGroupBy.sample(), especially when
weights
argument provided (GH 34483) - Performance improvement when converting non-string arrays to string arrays (GH 34483)
- Performance improvement in DataFrameGroupBy.transform() and SeriesGroupBy.transform() for user-defined functions (GH 41598)
- Performance improvement in constructing DataFrame objects (GH 42631, GH 43142, GH 43147, GH 43307, GH 43144, GH 44826)
- Performance improvement in DataFrameGroupBy.shift() and SeriesGroupBy.shift() when
fill_value
argument is provided (GH 26615) - Performance improvement in DataFrame.corr() for
method=pearson
on data without missing values (GH 40956) - Performance improvement in some DataFrameGroupBy.apply() and SeriesGroupBy.apply() operations (GH 42992, GH 43578)
- Performance improvement in read_stata() (GH 43059, GH 43227)
- Performance improvement in read_sas() (GH 43333)
- Performance improvement in to_datetime() with
uint
dtypes (GH 42606) - Performance improvement in to_datetime() with
infer_datetime_format
set toTrue
(GH 43901) - Performance improvement in Series.sparse.to_coo() (GH 42880)
- Performance improvement in indexing with a
UInt64Index
(GH 43862) - Performance improvement in indexing with a
Float64Index
(GH 43705) - Performance improvement in indexing with a non-unique Index (GH 43792)
- Performance improvement in indexing with a listlike indexer on a MultiIndex (GH 43370)
- Performance improvement in indexing with a MultiIndex indexer on another MultiIndex (GH 43370)
- Performance improvement in DataFrameGroupBy.quantile() and SeriesGroupBy.quantile() (GH 43469, GH 43725)
- Performance improvement in DataFrameGroupBy.count() and SeriesGroupBy.count() (GH 43730, GH 43694)
- Performance improvement in DataFrameGroupBy.any(), SeriesGroupBy.any(), DataFrameGroupBy.all(), and SeriesGroupBy.all() (GH 43675, GH 42841)
- Performance improvement in DataFrameGroupBy.std() and SeriesGroupBy.std() (GH 43115, GH 43576)
- Performance improvement in DataFrameGroupBy.cumsum() and SeriesGroupBy.cumsum() (GH 43309)
SparseArray.min()
andSparseArray.max()
no longer require converting to a dense array (GH 43526)- Indexing into a
SparseArray
with aslice
withstep=1
no longer requires converting to a dense array (GH 43777) - Performance improvement in
SparseArray.take()
withallow_fill=False
(GH 43654) - Performance improvement in Rolling.mean(), Expanding.mean(), Rolling.sum(), Expanding.sum(), Rolling.max(), Expanding.max(), Rolling.min() and Expanding.min() with
engine="numba"
(GH 43612, GH 44176, GH 45170) - Improved performance of pandas.read_csv() with
memory_map=True
when file encoding is UTF-8 (GH 43787) - Performance improvement in
RangeIndex.sort_values()
overriding Index.sort_values() (GH 43666) - Performance improvement in
RangeIndex.insert()
(GH 43988) - Performance improvement in Index.insert() (GH 43953)
- Performance improvement in
DatetimeIndex.tolist()
(GH 43823) - Performance improvement in
DatetimeIndex.union()
(GH 42353) - Performance improvement in Series.nsmallest() (GH 43696)
- Performance improvement in DataFrame.insert() (GH 42998)
- Performance improvement in DataFrame.dropna() (GH 43683)
- Performance improvement in DataFrame.fillna() (GH 43316)
- Performance improvement in DataFrame.values() (GH 43160)
- Performance improvement in DataFrame.select_dtypes() (GH 42611)
- Performance improvement in DataFrame reductions (GH 43185, GH 43243, GH 43311, GH 43609)
- Performance improvement in Series.unstack() and DataFrame.unstack() (GH 43335, GH 43352, GH 42704, GH 43025)
- Performance improvement in Series.to_frame() (GH 43558)
- Performance improvement in
Series.mad()
(GH 43010) - Performance improvement in merge() (GH 43332)
- Performance improvement in
to_csv()
when index column is a datetime and is formatted (GH 39413) - Performance improvement in
to_csv()
when MultiIndex contains a lot of unused levels (GH 37484) - Performance improvement in read_csv() when
index_col
was set with a numeric column (GH 44158) - Performance improvement in concat() (GH 43354)
- Performance improvement in
SparseArray.__getitem__()
(GH 23122) - Performance improvement in constructing a DataFrame from array-like objects like a
Pytorch
tensor (GH 44616)
Bug fixes#
Categorical#
- Bug in setting dtype-incompatible values into a Categorical (or
Series
orDataFrame
backed byCategorical
) raisingValueError
instead ofTypeError
(GH 41919) - Bug in
Categorical.searchsorted()
when passing a dtype-incompatible value raisingKeyError
instead ofTypeError
(GH 41919) - Bug in
Categorical.astype()
casting datetimes and Timestamp to int for dtypeobject
(GH 44930) - Bug in Series.where() with
CategoricalDtype
when passing a dtype-incompatible value raisingValueError
instead ofTypeError
(GH 41919) - Bug in
Categorical.fillna()
when passing a dtype-incompatible value raisingValueError
instead ofTypeError
(GH 41919) - Bug in
Categorical.fillna()
with a tuple-like category raisingValueError
instead ofTypeError
when filling with a non-category tuple (GH 41919)
Datetimelike#
- Bug in DataFrame constructor unnecessarily copying non-datetimelike 2D object arrays (GH 39272)
- Bug in to_datetime() with
format
andpandas.NA
was raisingValueError
(GH 42957) - to_datetime() would silently swap
MM/DD/YYYY
andDD/MM/YYYY
formats if the givendayfirst
option could not be respected - now, a warning is raised in the case of delimited date strings (e.g.31-12-2012
) (GH 12585) - Bug in date_range() and bdate_range() do not return right bound when
start
=end
and set is closed on one side (GH 43394) - Bug in inplace addition and subtraction of DatetimeIndex or TimedeltaIndex with
DatetimeArray
orTimedeltaArray
(GH 43904) - Bug in calling
np.isnan
,np.isfinite
, ornp.isinf
on a timezone-aware DatetimeIndex incorrectly raisingTypeError
(GH 43917) - Bug in constructing a Series from datetime-like strings with mixed timezones incorrectly partially-inferring datetime values (GH 40111)
- Bug in addition of a
Tick
object and anp.timedelta64
object incorrectly raising instead of returning Timedelta (GH 44474) np.maximum.reduce
andnp.minimum.reduce
now correctly return Timestamp and Timedelta objects when operating on Series, DataFrame, or Index withdatetime64[ns]
ortimedelta64[ns]
dtype (GH 43923)- Bug in adding a
np.timedelta64
object to aBusinessDay
orCustomBusinessDay
object incorrectly raising (GH 44532) - Bug in Index.insert() for inserting
np.datetime64
,np.timedelta64
ortuple
into Index withdtype='object'
with negative loc addingNone
and replacing existing value (GH 44509) - Bug in Timestamp.to_pydatetime() failing to retain the
fold
attribute (GH 45087) - Bug in Series.mode() with
DatetimeTZDtype
incorrectly returning timezone-naive andPeriodDtype
incorrectly raising (GH 41927) - Fixed regression in reindex() raising an error when using an incompatible fill value with a datetime-like dtype (or not raising a deprecation warning for using a
datetime.date
as fill value) (GH 42921) - Bug in
DateOffset
addition with Timestamp whereoffset.nanoseconds
would not be included in the result (GH 43968, GH 36589) - Bug in Timestamp.fromtimestamp() not supporting the
tz
argument (GH 45083) - Bug in DataFrame construction from dict of Series with mismatched index dtypes sometimes raising depending on the ordering of the passed dict (GH 44091)
- Bug in Timestamp hashing during some DST transitions caused a segmentation fault (GH 33931 and GH 40817)
Timedelta#
- Bug in division of all-
NaT
TimeDeltaIndex
, Series or DataFrame column with object-dtype array like of numbers failing to infer the result as timedelta64-dtype (GH 39750) - Bug in floor division of
timedelta64[ns]
data with a scalar returning garbage values (GH 44466) - Bug in Timedelta now properly taking into account any nanoseconds contribution of any kwarg (GH 43764, GH 45227)
Time Zones#
- Bug in to_datetime() with
infer_datetime_format=True
failing to parse zero UTC offset (Z
) correctly (GH 41047) - Bug in Series.dt.tz_convert() resetting index in a Series with CategoricalIndex (GH 43080)
- Bug in
Timestamp
andDatetimeIndex
incorrectly raising aTypeError
when subtracting two timezone-aware objects with mismatched timezones (GH 31793)
Numeric#
- Bug in floor-dividing a list or tuple of integers by a Series incorrectly raising (GH 44674)
- Bug in DataFrame.rank() raising
ValueError
withobject
columns andmethod="first"
(GH 41931) - Bug in DataFrame.rank() treating missing values and extreme values as equal (for example
np.nan
andnp.inf
), causing incorrect results whenna_option="bottom"
orna_option="top
used (GH 41931) - Bug in
numexpr
engine still being used when the optioncompute.use_numexpr
is set toFalse
(GH 32556) - Bug in DataFrame arithmetic ops with a subclass whose
_constructor()
attribute is a callable other than the subclass itself (GH 43201) - Bug in arithmetic operations involving RangeIndex where the result would have the incorrect
name
(GH 43962) - Bug in arithmetic operations involving Series where the result could have the incorrect
name
when the operands having matching NA or matching tuple names (GH 44459) - Bug in division with
IntegerDtype
orBooleanDtype
array and NA scalar incorrectly raising (GH 44685) - Bug in multiplying a Series with
FloatingDtype
with a timedelta-like scalar incorrectly raising (GH 44772)
Conversion#
- Bug in
UInt64Index
constructor when passing a list containing both positive integers small enough to cast to int64 and integers too large to hold in int64 (GH 42201) - Bug in Series constructor returning 0 for missing values with dtype
int64
andFalse
for dtypebool
(GH 43017, GH 43018) - Bug in constructing a DataFrame from a
PandasArray
containing Series objects behaving differently than an equivalentnp.ndarray
(GH 43986) - Bug in
IntegerDtype
not allowing coercion from string dtype (GH 25472) - Bug in to_datetime() with
arg:xr.DataArray
andunit="ns"
specified raisesTypeError
(GH 44053) - Bug in DataFrame.convert_dtypes() not returning the correct type when a subclass does not overload
_constructor_sliced()
(GH 43201) - Bug in DataFrame.astype() not propagating
attrs
from the original DataFrame (GH 44414) - Bug in DataFrame.convert_dtypes() result losing
columns.names
(GH 41435) - Bug in constructing a
IntegerArray
from pyarrow data failing to validate dtypes (GH 44891) - Bug in Series.astype() not allowing converting from a
PeriodDtype
todatetime64
dtype, inconsistent with the PeriodIndex behavior (GH 45038)
Strings#
- Bug in checking for
string[pyarrow]
dtype incorrectly raising anImportError
when pyarrow is not installed (GH 44276)
Interval#
- Bug in Series.where() with
IntervalDtype
incorrectly raising when thewhere
call should not replace anything (GH 44181)
Indexing#
- Bug in Series.rename() with MultiIndex and
level
is provided (GH 43659) - Bug in DataFrame.truncate() and Series.truncate() when the object’s Index has a length greater than one but only one unique value (GH 42365)
- Bug in Series.loc() and DataFrame.loc() with a MultiIndex when indexing with a tuple in which one of the levels is also a tuple (GH 27591)
- Bug in Series.loc() with a MultiIndex whose first level contains only
np.nan
values (GH 42055) - Bug in indexing on a Series or DataFrame with a DatetimeIndex when passing a string, the return type depended on whether the index was monotonic (GH 24892)
- Bug in indexing on a MultiIndex failing to drop scalar levels when the indexer is a tuple containing a datetime-like string (GH 42476)
- Bug in DataFrame.sort_values() and Series.sort_values() when passing an ascending value, failed to raise or incorrectly raising
ValueError
(GH 41634) - Bug in updating values of pandas.Series using boolean index, created by using pandas.DataFrame.pop() (GH 42530)
- Bug in Index.get_indexer_non_unique() when index contains multiple
np.nan
(GH 35392) - Bug in DataFrame.query() did not handle the degree sign in a backticked column name, such as `Temp(°C)`, used in an expression to query a DataFrame (GH 42826)
- Bug in DataFrame.drop() where the error message did not show missing labels with commas when raising
KeyError
(GH 42881) - Bug in DataFrame.query() where method calls in query strings led to errors when the
numexpr
package was installed (GH 22435) - Bug in DataFrame.nlargest() and Series.nlargest() where sorted result did not count indexes containing
np.nan
(GH 28984) - Bug in indexing on a non-unique object-dtype Index with an NA scalar (e.g.
np.nan
) (GH 43711) - Bug in
DataFrame.__setitem__()
incorrectly writing into an existing column’s array rather than setting a new array when the new dtype and the old dtype match (GH 43406) - Bug in setting floating-dtype values into a Series with integer dtype failing to set inplace when those values can be losslessly converted to integers (GH 44316)
- Bug in
Series.__setitem__()
with object dtype when setting an array with matching size and dtype=’datetime64[ns]’ or dtype=’timedelta64[ns]’ incorrectly converting the datetime/timedeltas to integers (GH 43868) - Bug in DataFrame.sort_index() where
ignore_index=True
was not being respected when the index was already sorted (GH 43591) - Bug in Index.get_indexer_non_unique() when index contains multiple
np.datetime64("NaT")
andnp.timedelta64("NaT")
(GH 43869) - Bug in setting a scalar Interval value into a Series with
IntervalDtype
when the scalar’s sides are floats and the values’ sides are integers (GH 44201) - Bug when setting string-backed Categorical values that can be parsed to datetimes into a
DatetimeArray
or Series or DataFrame column backed byDatetimeArray
failing to parse these strings (GH 44236) - Bug in
Series.__setitem__()
with an integer dtype other thanint64
setting with arange
object unnecessarily upcasting toint64
(GH 44261) - Bug in
Series.__setitem__()
with a boolean mask indexer setting a listlike value of length 1 incorrectly broadcasting that value (GH 44265) - Bug in Series.reset_index() not ignoring
name
argument whendrop
andinplace
are set toTrue
(GH 44575) - Bug in
DataFrame.loc.__setitem__()
andDataFrame.iloc.__setitem__()
with mixed dtypes sometimes failing to operate in-place (GH 44345) - Bug in
DataFrame.loc.__getitem__()
incorrectly raisingKeyError
when selecting a single column with a boolean key (GH 44322). - Bug in setting DataFrame.iloc() with a single
ExtensionDtype
column and setting 2D values e.g.df.iloc[:] = df.values
incorrectly raising (GH 44514) - Bug in setting values with DataFrame.iloc() with a single
ExtensionDtype
column and a tuple of arrays as the indexer (GH 44703) - Bug in indexing on columns with
loc
oriloc
using a slice with a negative step withExtensionDtype
columns incorrectly raising (GH 44551) - Bug in
DataFrame.loc.__setitem__()
changing dtype when indexer was completelyFalse
(GH 37550) - Bug in
IntervalIndex.get_indexer_non_unique()
returning boolean mask instead of array of integers for a non unique and non monotonic index (GH 44084) - Bug in
IntervalIndex.get_indexer_non_unique()
not handling targets ofdtype
‘object’ with NaNs correctly (GH 44482) - Fixed regression where a single column
np.matrix
was no longer coerced to a 1dnp.ndarray
when added to a DataFrame (GH 42376) - Bug in
Series.__getitem__()
with a CategoricalIndex of integers treating lists of integers as positional indexers, inconsistent with the behavior with a single scalar integer (GH 15470, GH 14865) - Bug in
Series.__setitem__()
when setting floats or integers into integer-dtype Series failing to upcast when necessary to retain precision (GH 45121) - Bug in
DataFrame.iloc.__setitem__()
ignores axis argument (GH 45032)
Missing#
- Bug in DataFrame.fillna() with
limit
and nomethod
ignoresaxis='columns'
oraxis = 1
(GH 40989, GH 17399) - Bug in DataFrame.fillna() not replacing missing values when using a dict-like
value
and duplicate column names (GH 43476) - Bug in constructing a DataFrame with a dictionary
np.datetime64
as a value anddtype='timedelta64[ns]'
, or vice-versa, incorrectly casting instead of raising (GH 44428) - Bug in Series.interpolate() and DataFrame.interpolate() with
inplace=True
not writing to the underlying array(s) in-place (GH 44749) - Bug in Index.fillna() incorrectly returning an unfilled Index when NA values are present and
downcast
argument is specified. This now raisesNotImplementedError
instead; do not passdowncast
argument (GH 44873) - Bug in DataFrame.dropna() changing Index even if no entries were dropped (GH 41965)
- Bug in Series.fillna() with an object-dtype incorrectly ignoring
downcast="infer"
(GH 44241)
MultiIndex#
- Bug in MultiIndex.get_loc() where the first level is a DatetimeIndex and a string key is passed (GH 42465)
- Bug in
MultiIndex.reindex()
when passing alevel
that corresponds to anExtensionDtype
level (GH 42043) - Bug in MultiIndex.get_loc() raising
TypeError
instead ofKeyError
on nested tuple (GH 42440) - Bug in
MultiIndex.union()
setting wrongsortorder
causing errors in subsequent indexing operations with slices (GH 44752) - Bug in
MultiIndex.putmask()
where the other value was also a MultiIndex (GH 43212) - Bug in MultiIndex.dtypes() duplicate level names returned only one dtype per name (GH 45174)
I/O#
- Bug in read_excel() attempting to read chart sheets from .xlsx files (GH 41448)
- Bug in json_normalize() where
errors=ignore
could fail to ignore missing values ofmeta
whenrecord_path
has a length greater than one (GH 41876) - Bug in read_csv() with multi-header input and arguments referencing column names as tuples (GH 42446)
- Bug in read_fwf(), where difference in lengths of
colspecs
andnames
was not raisingValueError
(GH 40830) - Bug in Series.to_json() and DataFrame.to_json() where some attributes were skipped when serializing plain Python objects to JSON (GH 42768, GH 33043)
- Column headers are dropped when constructing a DataFrame from a sqlalchemy’s
Row
object (GH 40682) - Bug in unpickling an Index with object dtype incorrectly inferring numeric dtypes (GH 43188)
- Bug in read_csv() where reading multi-header input with unequal lengths incorrectly raised
IndexError
(GH 43102) - Bug in read_csv() raising
ParserError
when reading file in chunks and some chunk blocks have fewer columns than header forengine="c"
(GH 21211) - Bug in read_csv(), changed exception class when expecting a file path name or file-like object from
OSError
toTypeError
(GH 43366) - Bug in read_csv() and read_fwf() ignoring all
skiprows
except first whennrows
is specified forengine='python'
(GH 44021, GH 10261) - Bug in read_csv() keeping the original column in object format when
keep_date_col=True
is set (GH 13378) - Bug in read_json() not handling non-numpy dtypes correctly (especially
category
) (GH 21892, GH 33205) - Bug in json_normalize() where multi-character
sep
parameter is incorrectly prefixed to every key (GH 43831) - Bug in json_normalize() where reading data with missing multi-level metadata would not respect
errors="ignore"
(GH 44312) - Bug in read_csv() used second row to guess implicit index if
header
was set toNone
forengine="python"
(GH 22144) - Bug in read_csv() not recognizing bad lines when
names
were given forengine="c"
(GH 22144) - Bug in read_csv() with
float_precision="round_trip"
which did not skip initial/trailing whitespace (GH 43713) - Bug when Python is built without the lzma module: a warning was raised at the pandas import time, even if the lzma capability isn’t used (GH 43495)
- Bug in read_csv() not applying dtype for
index_col
(GH 9435) - Bug in dumping/loading a DataFrame with
yaml.dump(frame)
(GH 42748) - Bug in read_csv() raising
ValueError
whennames
was longer thanheader
but equal to data rows forengine="python"
(GH 38453) - Bug in ExcelWriter, where
engine_kwargs
were not passed through to all engines (GH 43442) - Bug in read_csv() raising
ValueError
whenparse_dates
was used with MultiIndex columns (GH 8991) - Bug in read_csv() not raising an
ValueError
when\n
was specified asdelimiter
orsep
which conflicts withlineterminator
(GH 43528) - Bug in
to_csv()
converting datetimes in categorical Series to integers (GH 40754) - Bug in read_csv() converting columns to numeric after date parsing failed (GH 11019)
- Bug in read_csv() not replacing
NaN
values withnp.nan
before attempting date conversion (GH 26203) - Bug in read_csv() raising
AttributeError
when attempting to read a .csv file and infer index column dtype from an nullable integer type (GH 44079) - Bug in
to_csv()
always coercing datetime columns with different formats to the same format (GH 21734) - DataFrame.to_csv() and Series.to_csv() with
compression
set to'zip'
no longer create a zip file containing a file ending with “.zip”. Instead, they try to infer the inner file name more smartly (GH 39465) - Bug in read_csv() where reading a mixed column of booleans and missing values to a float type results in the missing values becoming 1.0 rather than NaN (GH 42808, GH 34120)
- Bug in
to_xml()
raising error forpd.NA
with extension array dtype (GH 43903) - Bug in read_csv() when passing simultaneously a parser in
date_parser
andparse_dates=False
, the parsing was still called (GH 44366) - Bug in read_csv() not setting name of MultiIndex columns correctly when
index_col
is not the first column (GH 38549) - Bug in read_csv() silently ignoring errors when failing to create a memory-mapped file (GH 44766)
- Bug in read_csv() when passing a
tempfile.SpooledTemporaryFile
opened in binary mode (GH 44748) - Bug in read_json() raising
ValueError
when attempting to parse json strings containing “://” (GH 36271) - Bug in read_csv() when
engine="c"
andencoding_errors=None
which caused a segfault (GH 45180) - Bug in read_csv() an invalid value of
usecols
leading to an unclosed file handle (GH 45384) - Bug in DataFrame.to_json() fix memory leak (GH 43877)
Period#
- Bug in adding a Period object to a
np.timedelta64
object incorrectly raisingTypeError
(GH 44182) - Bug in PeriodIndex.to_timestamp() when the index has
freq="B"
inferringfreq="D"
for its result instead offreq="B"
(GH 44105) - Bug in Period constructor incorrectly allowing
np.timedelta64("NaT")
(GH 44507) - Bug in PeriodIndex.to_timestamp() giving incorrect values for indexes with non-contiguous data (GH 44100)
- Bug in Series.where() with
PeriodDtype
incorrectly raising when thewhere
call should not replace anything (GH 45135)
Plotting#
- When given non-numeric data, DataFrame.boxplot() now raises a
ValueError
rather than a crypticKeyError
orZeroDivisionError
, in line with other plotting functions like DataFrame.hist() (GH 43480)
Groupby/resample/rolling#
- Bug in
SeriesGroupBy.apply()
where passing an unrecognized string argument failed to raiseTypeError
when the underlyingSeries
is empty (GH 42021) - Bug in
Series.rolling.apply()
,DataFrame.rolling.apply()
,Series.expanding.apply()
andDataFrame.expanding.apply()
withengine="numba"
where*args
were being cached with the user passed function (GH 42287) - Bug in DataFrameGroupBy.max(), SeriesGroupBy.max(), DataFrameGroupBy.min(), and SeriesGroupBy.min() with nullable integer dtypes losing precision (GH 41743)
- Bug in
DataFrame.groupby.rolling.var()
would calculate the rolling variance only on the first group (GH 42442) - Bug in DataFrameGroupBy.shift() and SeriesGroupBy.shift() that would return the grouping columns if
fill_value
was notNone
(GH 41556) - Bug in
SeriesGroupBy.nlargest()
andSeriesGroupBy.nsmallest()
would have an inconsistent index when the input Series was sorted andn
was greater than or equal to all group sizes (GH 15272, GH 16345, GH 29129) - Bug in pandas.DataFrame.ewm(), where non-float64 dtypes were silently failing (GH 42452)
- Bug in pandas.DataFrame.rolling() operation along rows (
axis=1
) incorrectly omits columns containingfloat16
andfloat32
(GH 41779) - Bug in
Resampler.aggregate()
did not allow the use of Named Aggregation (GH 32803) - Bug in Series.rolling() when the Series
dtype
wasInt64
(GH 43016) - Bug in
DataFrame.rolling.corr()
when the DataFrame columns was a MultiIndex (GH 21157) - Bug in
DataFrame.groupby.rolling()
when specifyingon
and calling__getitem__
would subsequently return incorrect results (GH 43355) - Bug in DataFrameGroupBy.apply() and SeriesGroupBy.apply() with time-based Grouper objects incorrectly raising
ValueError
in corner cases where the grouping vector contains aNaT
(GH 43500, GH 43515) - Bug in DataFrameGroupBy.mean() and SeriesGroupBy.mean() failing with
complex
dtype (GH 43701) - Bug in Series.rolling() and DataFrame.rolling() not calculating window bounds correctly for the first row when
center=True
and index is decreasing (GH 43927) - Bug in Series.rolling() and DataFrame.rolling() for centered datetimelike windows with uneven nanosecond (GH 43997)
- Bug in DataFrameGroupBy.mean() and SeriesGroupBy.mean() raising
KeyError
when column was selected at least twice (GH 44924) - Bug in DataFrameGroupBy.nth() and SeriesGroupBy.nth() failing on
axis=1
(GH 43926) - Bug in Series.rolling() and DataFrame.rolling() not respecting right bound on centered datetime-like windows, if the index contain duplicates (GH 3944)
- Bug in Series.rolling() and DataFrame.rolling() when using a pandas.api.indexers.BaseIndexer subclass that returned unequal start and end arrays would segfault instead of raising a
ValueError
(GH 44470) - Bug in
Groupby.nunique()
not respectingobserved=True
forcategorical
grouping columns (GH 45128) - Bug in DataFrameGroupBy.head(), SeriesGroupBy.head(), DataFrameGroupBy.tail(), and SeriesGroupBy.tail() not dropping groups with
NaN
whendropna=True
(GH 45089) - Bug in
GroupBy.__iter__()
after selecting a subset of columns in aGroupBy
object, which returned all columns instead of the chosen subset (GH 44821) - Bug in
Groupby.rolling()
when non-monotonic data passed, fails to correctly raiseValueError
(GH 43909) - Bug where grouping by a Series that has a
categorical
data type and length unequal to the axis of grouping raisedValueError
(GH 44179)
Reshaping#
- Improved error message when creating a DataFrame column from a multi-dimensional numpy.ndarray (GH 42463)
- Bug in concat() creating MultiIndex with duplicate level entries when concatenating a DataFrame with duplicates in Index and multiple keys (GH 42651)
- Bug in pandas.cut() on Series with duplicate indices and non-exact pandas.CategoricalIndex() (GH 42185, GH 42425)
- Bug in
DataFrame.append()
failing to retain dtypes when appended columns do not match (GH 43392) - Bug in concat() of
bool
andboolean
dtypes resulting inobject
dtype instead ofboolean
dtype (GH 42800) - Bug in crosstab() when inputs are categorical Series, there are categories that are not present in one or both of the Series, and
margins=True
. Previously the margin value for missing categories wasNaN
. It is now correctly reported as 0 (GH 43505) - Bug in concat() would fail when the
objs
argument all had the same index and thekeys
argument contained duplicates (GH 43595) - Bug in concat() which ignored the
sort
parameter (GH 43375) - Bug in merge() with MultiIndex as column index for the
on
argument returning an error when assigning a column internally (GH 43734) - Bug in crosstab() would fail when inputs are lists or tuples (GH 44076)
- Bug in
DataFrame.append()
failing to retainindex.name
when appending a list of Series objects (GH 44109) - Fixed metadata propagation in
Dataframe.apply()
method, consequently fixing the same issue forDataframe.transform()
,Dataframe.nunique()
andDataframe.mode()
(GH 28283) - Bug in concat() casting levels of MultiIndex to float if all levels only consist of missing values (GH 44900)
- Bug in DataFrame.stack() with
ExtensionDtype
columns incorrectly raising (GH 43561) - Bug in merge() raising
KeyError
when joining over differently named indexes with on keywords (GH 45094) - Bug in Series.unstack() with object doing unwanted type inference on resulting columns (GH 44595)
- Bug in
MultiIndex.join()
with overlappingIntervalIndex
levels (GH 44096) - Bug in DataFrame.replace() and Series.replace() results is different
dtype
based onregex
parameter (GH 44864) - Bug in DataFrame.pivot() with
index=None
when the DataFrame index was a MultiIndex (GH 23955)
Sparse#
- Bug in DataFrame.sparse.to_coo() raising
AttributeError
when column names are not unique (GH 29564) - Bug in
SparseArray.max()
andSparseArray.min()
raisingValueError
for arrays with 0 non-null elements (GH 43527) - Bug in DataFrame.sparse.to_coo() silently converting non-zero fill values to zero (GH 24817)
- Bug in
SparseArray
comparison methods with an array-like operand of mismatched length raisingAssertionError
or unclearValueError
depending on the input (GH 43863) - Bug in
SparseArray
arithmetic methodsfloordiv
andmod
behaviors when dividing by zero not matching the non-sparse Series behavior (GH 38172) - Bug in
SparseArray
unary methods as well asSparseArray.isna()
doesn’t recalculate indexes (GH 44955)
ExtensionArray#
- Bug in array() failing to preserve
PandasArray
(GH 43887) - NumPy ufuncs
np.abs
,np.positive
,np.negative
now correctly preserve dtype when called on ExtensionArrays that implement__abs__, __pos__, __neg__
, respectively. In particular this is fixed forTimedeltaArray
(GH 43899, GH 23316) - NumPy ufuncs
np.minimum.reduce
np.maximum.reduce
,np.add.reduce
, andnp.prod.reduce
now work correctly instead of raisingNotImplementedError
on Series withIntegerDtype
orFloatDtype
(GH 43923, GH 44793) - NumPy ufuncs with
out
keyword are now supported by arrays withIntegerDtype
andFloatingDtype
(GH 45122) - Avoid raising
PerformanceWarning
about fragmented DataFrame when using many columns with an extension dtype (GH 44098) - Bug in
IntegerArray
andFloatingArray
construction incorrectly coercing mismatched NA values (e.g.np.timedelta64("NaT")
) to numeric NA (GH 44514) - Bug in
BooleanArray.__eq__()
andBooleanArray.__ne__()
raisingTypeError
on comparison with an incompatible type (like a string). This caused DataFrame.replace() to sometimes raise aTypeError
if a nullable boolean column was included (GH 44499) - Bug in array() incorrectly raising when passed a
ndarray
withfloat16
dtype (GH 44715) - Bug in calling
np.sqrt
onBooleanArray
returning a malformedFloatingArray
(GH 44715) - Bug in Series.where() with
ExtensionDtype
whenother
is a NA scalar incompatible with the Series dtype (e.g.NaT
with a numeric dtype) incorrectly casting to a compatible NA value (GH 44697) - Bug in Series.replace() where explicitly passing
value=None
is treated as if novalue
was passed, andNone
not being in the result (GH 36984, GH 19998) - Bug in Series.replace() with unwanted downcasting being done in no-op replacements (GH 44498)
- Bug in Series.replace() with
FloatDtype
,string[python]
, orstring[pyarrow]
dtype not being preserved when possible (GH 33484, GH 40732, GH 31644, GH 41215, GH 25438)
Styler#
- Bug in Styler where the
uuid
at initialization maintained a floating underscore (GH 43037) - Bug in Styler.to_html() where the
Styler
object was updated if theto_html
method was called with some args (GH 43034) - Bug in
Styler.copy()
whereuuid
was not previously copied (GH 40675) - Bug in
Styler.apply()
where functions which returned Series objects were not correctly handled in terms of aligning their index labels (GH 13657, GH 42014) - Bug when rendering an empty DataFrame with a named Index (GH 43305)
- Bug when rendering a single level MultiIndex (GH 43383)
- Bug when combining non-sparse rendering and
Styler.hide_columns()
orStyler.hide_index()
(GH 43464) - Bug setting a table style when using multiple selectors in Styler (GH 44011)
- Bugs where row trimming and column trimming failed to reflect hidden rows (GH 43703, GH 44247)
Other#
- Bug in DataFrame.astype() with non-unique columns and a Series
dtype
argument (GH 44417) - Bug in
CustomBusinessMonthBegin.__add__()
(CustomBusinessMonthEnd.__add__()
) not applying the extraoffset
parameter when beginning (end) of the target month is already a business day (GH 41356) - Bug in
RangeIndex.union()
with anotherRangeIndex
with matching (even)step
and starts differing by strictly less thanstep / 2
(GH 44019) - Bug in
RangeIndex.difference()
withsort=None
andstep<0
failing to sort (GH 44085) - Bug in Series.replace() and DataFrame.replace() with
value=None
and ExtensionDtypes (GH 44270, GH 37899) - Bug in
FloatingArray.equals()
failing to consider two arrays equal if they containnp.nan
values (GH 44382) - Bug in DataFrame.shift() with
axis=1
andExtensionDtype
columns incorrectly raising when an incompatiblefill_value
is passed (GH 44564) - Bug in DataFrame.shift() with
axis=1
andperiods
larger thanlen(frame.columns)
producing an invalid DataFrame (GH 44978) - Bug in DataFrame.diff() when passing a NumPy integer object instead of an
int
object (GH 44572) - Bug in Series.replace() raising
ValueError
when usingregex=True
with a Series containingnp.nan
values (GH 43344) - Bug in DataFrame.to_records() where an incorrect
n
was used when missing names were replaced bylevel_n
(GH 44818) - Bug in DataFrame.eval() where
resolvers
argument was overriding the default resolvers (GH 34966) Series.__repr__()
andDataFrame.__repr__()
no longer replace all null-values in indexes with “NaN” but use their real string-representations. “NaN” is used only forfloat("nan")
(GH 45263)
Contributors#
A total of 275 people contributed patches to this release. People with a “+” by their names contributed a patch for the first time.
- Abhishek R
- Albert Villanova del Moral
- Alessandro Bisiani +
- Alex Lim
- Alex-Gregory-1 +
- Alexander Gorodetsky
- Alexander Regueiro +
- Alexey Györi
- Alexis Mignon
- Aleš Erjavec
- Ali McMaster
- Alibi +
- Andrei Batomunkuev +
- Andrew Eckart +
- Andrew Hawyrluk
- Andrew Wood
- Anton Lodder +
- Armin Berres +
- Arushi Sharma +
- Benedikt Heidrich +
- Beni Bienz +
- Benoît Vinot
- Bert Palm +
- Boris Rumyantsev +
- Brian Hulette
- Brock
- Bruno Costa +
- Bryan Racic +
- Caleb Epstein
- Calvin Ho
- ChristofKaufmann +
- Christopher Yeh +
- Chuliang Xiao +
- ClaudiaSilver +
- DSM
- Daniel Coll +
- Daniel Schmidt +
- Dare Adewumi
- David +
- David Sanders +
- David Wales +
- Derzan Chiang +
- DeviousLab +
- Dhruv B Shetty +
- Digres45 +
- Dominik Kutra +
- Drew Levitt +
- DriesS
- EdAbati
- Elle
- Elliot Rampono
- Endre Mark Borza
- Erfan Nariman
- Evgeny Naumov +
- Ewout ter Hoeven +
- Fangchen Li
- Felix Divo
- Felix Dulys +
- Francesco Andreuzzi +
- Francois Dion +
- Frans Larsson +
- Fred Reiss
- GYvan
- Gabriel Di Pardi Arruda +
- Gesa Stupperich
- Giacomo Caria +
- Greg Siano +
- Griffin Ansel
- Hiroaki Ogasawara +
- Horace +
- Horace Lai +
- Irv Lustig
- Isaac Virshup
- JHM Darbyshire (MBP)
- JHM Darbyshire (iMac)
- JHM Darbyshire +
- Jack Liu
- Jacob Skwirsk +
- Jaime Di Cristina +
- James Holcombe +
- Janosh Riebesell +
- Jarrod Millman
- Jason Bian +
- Jeff Reback
- Jernej Makovsek +
- Jim Bradley +
- Joel Gibson +
- Joeperdefloep +
- Johannes Mueller +
- John S Bogaardt +
- John Zangwill +
- Jon Haitz Legarreta Gorroño +
- Jon Wiggins +
- Jonas Haag +
- Joris Van den Bossche
- Josh Friedlander
- José Duarte +
- Julian Fleischer +
- Julien de la Bruère-T
- Justin McOmie
- Kadatatlu Kishore +
- Kaiqi Dong
- Kashif Khan +
- Kavya9986 +
- Kendall +
- Kevin Sheppard
- Kiley Hewitt
- Koen Roelofs +
- Krishna Chivukula
- KrishnaSai2020
- Leonardo Freua +
- Leonardus Chen
- Liang-Chi Hsieh +
- Loic Diridollou +
- Lorenzo Maffioli +
- Luke Manley +
- LunarLanding +
- Marc Garcia
- Marcel Bittar +
- Marcel Gerber +
- Marco Edward Gorelli
- Marco Gorelli
- MarcoGorelli
- Marvin +
- Mateusz Piotrowski +
- Mathias Hauser +
- Matt Richards +
- Matthew Davis +
- Matthew Roeschke
- Matthew Zeitlin
- Matthias Bussonnier
- Matti Picus
- Mauro Silberberg +
- Maxim Ivanov
- Maximilian Carr +
- MeeseeksMachine
- Michael Sarrazin +
- Michael Wang +
- Michał Górny +
- Mike Phung +
- Mike Taves +
- Mohamad Hussein Rkein +
- NJOKU OKECHUKWU VALENTINE +
- Neal McBurnett +
- Nick Anderson +
- Nikita Sobolev +
- Olivier Cavadenti +
- PApostol +
- Pandas Development Team
- Patrick Hoefler
- Peter
- Peter Tillmann +
- Prabha Arivalagan +
- Pradyumna Rahul
- Prerana Chakraborty
- Prithvijit +
- Rahul Gaikwad +
- Ray Bell
- Ricardo Martins +
- Richard Shadrach
- Robbert-jan ‘t Hoen +
- Robert Voyer +
- Robin Raymond +
- Rohan Sharma +
- Rohan Sirohia +
- Roman Yurchak
- Ruan Pretorius +
- Sam James +
- Scott Talbert
- Shashwat Sharma +
- Sheogorath27 +
- Shiv Gupta
- Shoham Debnath
- Simon Hawkins
- Soumya +
- Stan West +
- Stefanie Molin +
- Stefano Alberto Russo +
- Stephan Heßelmann
- Stephen
- Suyash Gupta +
- Sven
- Swanand01 +
- Sylvain Marié +
- TLouf
- Tania Allard +
- Terji Petersen
- TheDerivator +
- Thomas Dickson
- Thomas Kastl +
- Thomas Kluyver
- Thomas Li
- Thomas Smith
- Tim Swast
- Tim Tran +
- Tobias McNulty +
- Tobias Pitters
- Tomoki Nakagawa +
- Tony Hirst +
- Torsten Wörtwein
- V.I. Wood +
- Vaibhav K +
- Valentin Oliver Loftsson +
- Varun Shrivastava +
- Vivek Thazhathattil +
- Vyom Pathak
- Wenjun Si
- William Andrea +
- William Bradley +
- Wojciech Sadowski +
- Yao-Ching Huang +
- Yash Gupta +
- Yiannis Hadjicharalambous +
- Yoshiki Vázquez Baeza
- Yuanhao Geng
- Yury Mikhaylov
- Yvan Gatete +
- Yves Delley +
- Zach Rait
- Zbyszek Królikowski +
- Zero +
- Zheyuan
- Zhiyi Wu +
- aiudirog
- ali sayyah +
- aneesh98 +
- aptalca
- arw2019 +
- attack68
- brendandrury +
- bubblingoak +
- calvinsomething +
- claws +
- deponovo +
- dicristina
- el-g-1 +
- evensure +
- fotino21 +
- fshi01 +
- gfkang +
- github-actions[bot]
- i-aki-y
- jbrockmendel
- jreback
- juliandwain +
- jxb4892 +
- kendall smith +
- lmcindewar +
- lrepiton
- maximilianaccardo +
- michal-gh
- neelmraman
- partev
- phofl +
- pratyushsharan +
- quantumalaviya +
- rafael +
- realead
- rocabrera +
- rosagold
- saehuihwang +
- salomondush +
- shubham11941140 +
- srinivasan +
- stphnlyd
- suoniq
- trevorkask +
- tushushu
- tyuyoshi +
- usersblock +
- vernetya +
- vrserpa +
- willie3838 +
- zeitlinv +
- zhangxiaoxing +