What’s new in 2.0.0 (April 3, 2023) — pandas 2.2.3 documentation (original) (raw)
These are the changes in pandas 2.0.0. See Release notes for a full changelog including other versions of pandas.
Enhancements#
Index can now hold numpy numeric dtypes#
It is now possible to use any numpy numeric dtype in a Index (GH 42717).
Previously it was only possible to use int64
, uint64
& float64
dtypes:
In [1]: pd.Index([1, 2, 3], dtype=np.int8) Out[1]: Int64Index([1, 2, 3], dtype="int64") In [2]: pd.Index([1, 2, 3], dtype=np.uint16) Out[2]: UInt64Index([1, 2, 3], dtype="uint64") In [3]: pd.Index([1, 2, 3], dtype=np.float32) Out[3]: Float64Index([1.0, 2.0, 3.0], dtype="float64")
Int64Index
, UInt64Index
& Float64Index
were deprecated in pandas version 1.4 and have now been removed. Instead Index should be used directly, and can it now take all numpy numeric dtypes, i.e.int8
/ int16
/int32
/int64
/uint8
/uint16
/uint32
/uint64
/float32
/float64
dtypes:
In [1]: pd.Index([1, 2, 3], dtype=np.int8) Out[1]: Index([1, 2, 3], dtype='int8')
In [2]: pd.Index([1, 2, 3], dtype=np.uint16) Out[2]: Index([1, 2, 3], dtype='uint16')
In [3]: pd.Index([1, 2, 3], dtype=np.float32) Out[3]: Index([1.0, 2.0, 3.0], dtype='float32')
The ability for Index to hold the numpy numeric dtypes has meant some changes in Pandas functionality. In particular, operations that previously were forced to create 64-bit indexes, can now create indexes with lower bit sizes, e.g. 32-bit indexes.
Below is a possibly non-exhaustive list of changes:
- Instantiating using a numpy numeric array now follows the dtype of the numpy array. Previously, all indexes created from numpy numeric arrays were forced to 64-bit. Now, for example,
Index(np.array([1, 2, 3]))
will beint32
on 32-bit systems, where it previously would have beenint64
even on 32-bit systems. Instantiating Index using a list of numbers will still return 64bit dtypes, e.g.Index([1, 2, 3])
will have aint64
dtype, which is the same as previously. - The various numeric datetime attributes of DatetimeIndex (day,month, year etc.) were previously in of dtype
int64
, while they wereint32
for arrays.DatetimeArray. They are nowint32
on DatetimeIndex also:
In [4]: idx = pd.date_range(start='1/1/2018', periods=3, freq='ME')
In [5]: idx.array.year
Out[5]: array([2018, 2018, 2018], dtype=int32)
In [6]: idx.year
Out[6]: Index([2018, 2018, 2018], dtype='int32') - Level dtypes on Indexes from Series.sparse.from_coo() are now of dtype
int32
, the same as they are on therows
/cols
on a scipy sparse matrix. Previously they were of dtypeint64
.
In [7]: from scipy import sparse
In [8]: A = sparse.coo_matrix(
...: ([3.0, 1.0, 2.0], ([1, 0, 0], [0, 2, 3])), shape=(3, 4)
...: )
...:
In [9]: ser = pd.Series.sparse.from_coo(A)
In [10]: ser.index.dtypes
Out[10]:
level_0 int32
level_1 int32
dtype: object - Index cannot be instantiated using a float16 dtype. Previously instantiating an Index using dtype
float16
resulted in aFloat64Index
with afloat64
dtype. It now raises aNotImplementedError
:
In [11]: pd.Index([1, 2, 3], dtype=np.float16)
NotImplementedError Traceback (most recent call last)
Cell In[11], line 1
----> 1 pd.Index([1, 2, 3], dtype=np.float16)
File ~/work/pandas/pandas/pandas/core/indexes/base.py:576, in Index.new(cls, data, dtype, copy, name, tupleize_cols)
572 arr = ensure_wrapped_if_datetimelike(arr)
574 klass = cls._dtype_to_subclass(arr.dtype)
--> 576 arr = klass._ensure_array(arr, arr.dtype, copy=False)
577 result = klass.simple_new(arr, name, refs=refs)
578 if dtype is None and is_pandas_object and data_dtype == np.object:
File ~/work/pandas/pandas/pandas/core/indexes/base.py:601, in Index._ensure_array(cls, data, dtype, copy)
598 raise ValueError("Index data must be 1-dimensional")
599 elif dtype == np.float16:
600 # float16 not supported (no indexing engine)
--> 601 raise NotImplementedError("float16 indexes are not supported")
603 if copy:
604 # asarray_tuplesafe does not always copy underlying data,
605 # so need to make sure that this happens
606 data = data.copy()
NotImplementedError: float16 indexes are not supported
Argument dtype_backend
, to return pyarrow-backed or numpy-backed nullable dtypes#
The following functions gained a new keyword dtype_backend
(GH 36712)
- read_csv()
- read_clipboard()
- read_fwf()
- read_excel()
- read_html()
- read_xml()
- read_json()
- read_sql()
- read_sql_query()
- read_sql_table()
- read_parquet()
- read_orc()
- read_feather()
- read_spss()
- to_numeric()
- DataFrame.convert_dtypes()
- Series.convert_dtypes()
When this option is set to "numpy_nullable"
it will return a DataFrame that is backed by nullable dtypes.
When this keyword is set to "pyarrow"
, then these functions will return pyarrow-backed nullable ArrowDtype DataFrames (GH 48957, GH 49997):
- read_csv()
- read_clipboard()
- read_fwf()
- read_excel()
- read_html()
- read_xml()
- read_json()
- read_sql()
- read_sql_query()
- read_sql_table()
- read_parquet()
- read_orc()
- read_feather()
- read_spss()
- to_numeric()
- DataFrame.convert_dtypes()
- Series.convert_dtypes()
In [12]: import io
In [13]: data = io.StringIO("""a,b,c,d,e,f,g,h,i ....: 1,2.5,True,a,,,,, ....: 3,4.5,False,b,6,7.5,True,a, ....: """) ....:
In [14]: df = pd.read_csv(data, dtype_backend="pyarrow")
In [15]: df.dtypes Out[15]: a int64[pyarrow] b double[pyarrow] c bool[pyarrow] d string[pyarrow] e int64[pyarrow] f double[pyarrow] g bool[pyarrow] h string[pyarrow] i null[pyarrow] dtype: object
In [16]: data.seek(0) Out[16]: 0
In [17]: df_pyarrow = pd.read_csv(data, dtype_backend="pyarrow", engine="pyarrow")
In [18]: df_pyarrow.dtypes Out[18]: a int64[pyarrow] b double[pyarrow] c bool[pyarrow] d string[pyarrow] e int64[pyarrow] f double[pyarrow] g bool[pyarrow] h string[pyarrow] i null[pyarrow] dtype: object
Copy-on-Write improvements#
- A new lazy copy mechanism that defers the copy until the object in question is modified was added to the methods listed inCopy-on-Write optimizations. These methods return views when Copy-on-Write is enabled, which provides a significant performance improvement compared to the regular execution (GH 49473).
- Accessing a single column of a DataFrame as a Series (e.g.
df["col"]
) now always returns a new object every time it is constructed when Copy-on-Write is enabled (not returning multiple times an identical, cached Series object). This ensures that those Series objects correctly follow the Copy-on-Write rules (GH 49450) - The Series constructor will now create a lazy copy (deferring the copy until a modification to the data happens) when constructing a Series from an existing Series with the default of
copy=False
(GH 50471) - The DataFrame constructor will now create a lazy copy (deferring the copy until a modification to the data happens) when constructing from an existingDataFrame with the default of
copy=False
(GH 51239) - The DataFrame constructor, when constructing a DataFrame from a dictionary of Series objects and specifying
copy=False
, will now use a lazy copy of those Series objects for the columns of the DataFrame (GH 50777) - The DataFrame constructor, when constructing a DataFrame from aSeries or Index and specifying
copy=False
, will now respect Copy-on-Write. - The DataFrame and Series constructors, when constructing from a NumPy array, will now copy the array by default to avoid mutating the DataFrame / Serieswhen mutating the array. Specify
copy=False
to get the old behavior. When settingcopy=False
pandas does not guarantee correct Copy-on-Write behavior when the NumPy array is modified after creation of theDataFrame / Series. - The DataFrame.from_records() will now respect Copy-on-Write when called with a DataFrame.
- Trying to set values using chained assignment (for example,
df["a"][1:3] = 0
) will now always raise a warning when Copy-on-Write is enabled. In this mode, chained assignment can never work because we are always setting into a temporary object that is the result of an indexing operation (getitem), which under Copy-on-Write always behaves as a copy. Thus, assigning through a chain can never update the original Series or DataFrame. Therefore, an informative warning is raised to the user to avoid silently doing nothing (GH 49467) - DataFrame.replace() will now respect the Copy-on-Write mechanism when
inplace=True
. - DataFrame.transpose() will now respect the Copy-on-Write mechanism.
- Arithmetic operations that can be inplace, e.g.
ser *= 2
will now respect the Copy-on-Write mechanism. DataFrame.__getitem__()
will now respect the Copy-on-Write mechanism when theDataFrame has MultiIndex columns.Series.__getitem__()
will now respect the Copy-on-Write mechanism when the
Series has a MultiIndex.- Series.view() will now respect the Copy-on-Write mechanism.
Copy-on-Write can be enabled through one of
pd.set_option("mode.copy_on_write", True)
pd.options.mode.copy_on_write = True
Alternatively, copy on write can be enabled locally through:
with pd.option_context("mode.copy_on_write", True): ...
Other enhancements#
- Added support for
str
accessor methods when using ArrowDtype with apyarrow.string
type (GH 50325) - Added support for
dt
accessor methods when using ArrowDtype with apyarrow.timestamp
type (GH 50954) - read_sas() now supports using
encoding='infer'
to correctly read and use the encoding specified by the sas file. (GH 48048) - DataFrameGroupBy.quantile(), SeriesGroupBy.quantile() and DataFrameGroupBy.std() now preserve nullable dtypes instead of casting to numpy dtypes (GH 37493)
- DataFrameGroupBy.std(), SeriesGroupBy.std() now support datetime64, timedelta64, and DatetimeTZDtype dtypes (GH 48481)
- Series.add_suffix(), DataFrame.add_suffix(), Series.add_prefix() and DataFrame.add_prefix() support an
axis
argument. Ifaxis
is set, the default behaviour of which axis to consider can be overwritten (GH 47819) - testing.assert_frame_equal() now shows the first element where the DataFrames differ, analogously to
pytest
’s output (GH 47910) - Added
index
parameter to DataFrame.to_dict() (GH 46398) - Added support for extension array dtypes in merge() (GH 44240)
- Added metadata propagation for binary operators on DataFrame (GH 28283)
- Added
cumsum
,cumprod
,cummin
andcummax
to theExtensionArray
interface via_accumulate
(GH 28385) - CategoricalConversionWarning, InvalidComparison, InvalidVersion, LossySetitemError, and NoBufferPresent are now exposed in
pandas.errors
(GH 27656) - Fix
test
optional_extra by adding missing test packagepytest-asyncio
(GH 48361) - DataFrame.astype() exception message thrown improved to include column name when type conversion is not possible. (GH 47571)
- date_range() now supports a
unit
keyword (“s”, “ms”, “us”, or “ns”) to specify the desired resolution of the output index (GH 49106) - timedelta_range() now supports a
unit
keyword (“s”, “ms”, “us”, or “ns”) to specify the desired resolution of the output index (GH 49824) - DataFrame.to_json() now supports a
mode
keyword with supported inputs ‘w’ and ‘a’. Defaulting to ‘w’, ‘a’ can be used when lines=True and orient=’records’ to append record oriented json lines to an existing json file. (GH 35849) - Added
name
parameter to IntervalIndex.from_breaks(), IntervalIndex.from_arrays() and IntervalIndex.from_tuples() (GH 48911) - Improve exception message when using testing.assert_frame_equal() on a DataFrame to include the column that is compared (GH 50323)
- Improved error message for merge_asof() when join-columns were duplicated (GH 50102)
- Added support for extension array dtypes to get_dummies() (GH 32430)
- Added
Index.infer_objects()
analogous to Series.infer_objects() (GH 50034) - Added
copy
parameter to Series.infer_objects() and DataFrame.infer_objects(), passingFalse
will avoid making copies for series or columns that are already non-object or where no better dtype can be inferred (GH 50096) - DataFrame.plot.hist() now recognizes
xlabel
andylabel
arguments (GH 49793) - Series.drop_duplicates() has gained
ignore_index
keyword to reset index (GH 48304) - Series.dropna() and DataFrame.dropna() has gained
ignore_index
keyword to reset index (GH 31725) - Improved error message in to_datetime() for non-ISO8601 formats, informing users about the position of the first error (GH 50361)
- Improved error message when trying to align DataFrame objects (for example, in DataFrame.compare()) to clarify that “identically labelled” refers to both index and columns (GH 50083)
- Added support for Index.min() and Index.max() for pyarrow string dtypes (GH 51397)
- Added DatetimeIndex.as_unit() and TimedeltaIndex.as_unit() to convert to different resolutions; supported resolutions are “s”, “ms”, “us”, and “ns” (GH 50616)
- Added Series.dt.unit() and Series.dt.as_unit() to convert to different resolutions; supported resolutions are “s”, “ms”, “us”, and “ns” (GH 51223)
- Added new argument
dtype
to read_sql() to be consistent with read_sql_query() (GH 50797) - read_csv(), read_table(), read_fwf() and read_excel() now accept
date_format
(GH 50601) - to_datetime() now accepts
"ISO8601"
as an argument toformat
, which will match any ISO8601 string (but possibly not identically-formatted) (GH 50411) - to_datetime() now accepts
"mixed"
as an argument toformat
, which will infer the format for each element individually (GH 50972) - Added new argument
engine
to read_json() to support parsing JSON with pyarrow by specifyingengine="pyarrow"
(GH 48893) - Added support for SQLAlchemy 2.0 (GH 40686)
- Added support for
decimal
parameter whenengine="pyarrow"
in read_csv() (GH 51302) - Index set operations Index.union(), Index.intersection(), Index.difference(), and Index.symmetric_difference() now support
sort=True
, which will always return a sorted result, unlike the defaultsort=None
which does not sort in some cases (GH 25151) - Added new escape mode “latex-math” to avoid escaping “$” in formatter (GH 50040)
Notable bug fixes#
These are bug fixes that might have notable behavior changes.
DataFrameGroupBy.cumsum() and DataFrameGroupBy.cumprod() overflow instead of lossy casting to float#
In previous versions we cast to float when applying cumsum
and cumprod
which lead to incorrect results even if the result could be hold by int64
dtype. Additionally, the aggregation overflows consistent with numpy and the regularDataFrame.cumprod() and DataFrame.cumsum() methods when the limit ofint64
is reached (GH 37493).
Old Behavior
In [1]: df = pd.DataFrame({"key": ["b"] * 7, "value": 625}) In [2]: df.groupby("key")["value"].cumprod()[5] Out[2]: 5.960464477539062e+16
We return incorrect results with the 6th value.
New Behavior
In [19]: df = pd.DataFrame({"key": ["b"] * 7, "value": 625})
In [20]: df.groupby("key")["value"].cumprod() Out[20]: 0 625 1 390625 2 244140625 3 152587890625 4 95367431640625 5 59604644775390625 6 359414837200037393 Name: value, dtype: int64
We overflow with the 7th value, but the 6th value is still correct.
DataFrameGroupBy.nth() and SeriesGroupBy.nth() now behave as filtrations#
In previous versions of pandas, DataFrameGroupBy.nth() andSeriesGroupBy.nth() acted as if they were aggregations. However, for most inputs n
, they may return either zero or multiple rows per group. This means that they are filtrations, similar to e.g. DataFrameGroupBy.head(). pandas now treats them as filtrations (GH 13666).
In [21]: df = pd.DataFrame({"a": [1, 1, 2, 1, 2], "b": [np.nan, 2.0, 3.0, 4.0, 5.0]})
In [22]: gb = df.groupby("a")
Old Behavior
In [5]: gb.nth(n=1) Out[5]: A B 1 1 2.0 4 2 5.0
New Behavior
In [23]: gb.nth(n=1) Out[23]: a b 1 1 2.0 4 2 5.0
In particular, the index of the result is derived from the input by selecting the appropriate rows. Also, when n
is larger than the group, no rows instead ofNaN
is returned.
Old Behavior
In [5]: gb.nth(n=3, dropna="any") Out[5]: B A 1 NaN 2 NaN
New Behavior
In [24]: gb.nth(n=3, dropna="any") Out[24]: Empty DataFrame Columns: [a, b] Index: []
Backwards incompatible API changes#
Construction with datetime64 or timedelta64 dtype with unsupported resolution#
In past versions, when constructing a Series or DataFrame and passing a “datetime64” or “timedelta64” dtype with unsupported resolution (i.e. anything other than “ns”), pandas would silently replace the given dtype with its nanosecond analogue:
Previous behavior:
In [5]: pd.Series(["2016-01-01"], dtype="datetime64[s]") Out[5]: 0 2016-01-01 dtype: datetime64[ns]
In [6] pd.Series(["2016-01-01"], dtype="datetime64[D]") Out[6]: 0 2016-01-01 dtype: datetime64[ns]
In pandas 2.0 we support resolutions “s”, “ms”, “us”, and “ns”. When passing a supported dtype (e.g. “datetime64[s]”), the result now has exactly the requested dtype:
New behavior:
In [25]: pd.Series(["2016-01-01"], dtype="datetime64[s]") Out[25]: 0 2016-01-01 dtype: datetime64[s]
With an un-supported dtype, pandas now raises instead of silently swapping in a supported dtype:
New behavior:
In [26]: pd.Series(["2016-01-01"], dtype="datetime64[D]")
TypeError Traceback (most recent call last) Cell In[26], line 1 ----> 1 pd.Series(["2016-01-01"], dtype="datetime64[D]")
File ~/work/pandas/pandas/pandas/core/series.py:584, in Series.init(self, data, index, dtype, name, copy, fastpath) 582 data = data.copy() 583 else: --> 584 data = sanitize_array(data, index, dtype, copy) 586 manager = _get_option("mode.data_manager", silent=True) 587 if manager == "block":
File ~/work/pandas/pandas/pandas/core/construction.py:651, in sanitize_array(data, index, dtype, copy, allow_2d) 648 subarr = np.array([], dtype=np.float64) 650 elif dtype is not None: --> 651 subarr = _try_cast(data, dtype, copy) 653 else: 654 subarr = maybe_convert_platform(data)
File ~/work/pandas/pandas/pandas/core/construction.py:811, in _try_cast(arr, dtype, copy) 806 return lib.ensure_string_array(arr, convert_na_value=False, copy=copy).reshape( 807 shape 808 ) 810 elif dtype.kind in "mM": --> 811 return maybe_cast_to_datetime(arr, dtype) 813 # GH#15832: Check if we are requesting a numeric dtype and 814 # that we can convert the data to the requested dtype. 815 elif dtype.kind in "iu": 816 # this will raise if we have e.g. floats
File ~/work/pandas/pandas/pandas/core/dtypes/cast.py:1218, in maybe_cast_to_datetime(value, dtype) 1214 raise TypeError("value must be listlike") 1216 # TODO: _from_sequence would raise ValueError in cases where 1217 # _ensure_nanosecond_dtype raises TypeError -> 1218 _ensure_nanosecond_dtype(dtype) 1220 if lib.is_np_dtype(dtype, "m"): 1221 res = TimedeltaArray._from_sequence(value, dtype=dtype)
File ~/work/pandas/pandas/pandas/core/dtypes/cast.py:1275, in _ensure_nanosecond_dtype(dtype) 1272 raise ValueError(msg) 1273 # TODO: ValueError or TypeError? existing test 1274 # test_constructor_generic_timestamp_bad_frequency expects TypeError -> 1275 raise TypeError( 1276 f"dtype={dtype} is not supported. Supported resolutions are 's', " 1277 "'ms', 'us', and 'ns'" 1278 )
TypeError: dtype=datetime64[D] is not supported. Supported resolutions are 's', 'ms', 'us', and 'ns'
Value counts sets the resulting name to count
#
In past versions, when running Series.value_counts(), the result would inherit the original object’s name, and the result index would be nameless. This would cause confusion when resetting the index, and the column names would not correspond with the column values. Now, the result name will be 'count'
(or 'proportion'
if normalize=True
was passed), and the index will be named after the original object (GH 49497).
Previous behavior:
In [8]: pd.Series(['quetzal', 'quetzal', 'elk'], name='animal').value_counts()
Out[2]: quetzal 2 elk 1 Name: animal, dtype: int64
New behavior:
In [27]: pd.Series(['quetzal', 'quetzal', 'elk'], name='animal').value_counts() Out[27]: animal quetzal 2 elk 1 Name: count, dtype: int64
Likewise for other value_counts
methods (for example, DataFrame.value_counts()).
Disallow astype conversion to non-supported datetime64/timedelta64 dtypes#
In previous versions, converting a Series or DataFramefrom datetime64[ns]
to a different datetime64[X]
dtype would return with datetime64[ns]
dtype instead of the requested dtype. In pandas 2.0, support is added for “datetime64[s]”, “datetime64[ms]”, and “datetime64[us]” dtypes, so converting to those dtypes gives exactly the requested dtype:
Previous behavior:
In [28]: idx = pd.date_range("2016-01-01", periods=3)
In [29]: ser = pd.Series(idx)
Previous behavior:
In [4]: ser.astype("datetime64[s]") Out[4]: 0 2016-01-01 1 2016-01-02 2 2016-01-03 dtype: datetime64[ns]
With the new behavior, we get exactly the requested dtype:
New behavior:
In [30]: ser.astype("datetime64[s]") Out[30]: 0 2016-01-01 1 2016-01-02 2 2016-01-03 dtype: datetime64[s]
For non-supported resolutions e.g. “datetime64[D]”, we raise instead of silently ignoring the requested dtype:
New behavior:
In [31]: ser.astype("datetime64[D]")
TypeError Traceback (most recent call last) Cell In[31], line 1 ----> 1 ser.astype("datetime64[D]")
File ~/work/pandas/pandas/pandas/core/generic.py:6643, in NDFrame.astype(self, dtype, copy, errors) 6637 results = [ 6638 ser.astype(dtype, copy=copy, errors=errors) for _, ser in self.items() 6639 ] 6641 else: 6642 # else, only a single dtype is given -> 6643 new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors) 6644 res = self._constructor_from_mgr(new_data, axes=new_data.axes) 6645 return res.finalize(self, method="astype")
File ~/work/pandas/pandas/pandas/core/internals/managers.py:430, in BaseBlockManager.astype(self, dtype, copy, errors) 427 elif using_copy_on_write(): 428 copy = False --> 430 return self.apply( 431 "astype", 432 dtype=dtype, 433 copy=copy, 434 errors=errors, 435 using_cow=using_copy_on_write(), 436 )
File ~/work/pandas/pandas/pandas/core/internals/managers.py:363, in BaseBlockManager.apply(self, f, align_keys, **kwargs) 361 applied = b.apply(f, **kwargs) 362 else: --> 363 applied = getattr(b, f)(**kwargs) 364 result_blocks = extend_blocks(applied, result_blocks) 366 out = type(self).from_blocks(result_blocks, self.axes)
File ~/work/pandas/pandas/pandas/core/internals/blocks.py:758, in Block.astype(self, dtype, copy, errors, using_cow, squeeze) 755 raise ValueError("Can not squeeze with more than one column.") 756 values = values[0, :] # type: ignore[call-overload] --> 758 new_values = astype_array_safe(values, dtype, copy=copy, errors=errors) 760 new_values = maybe_coerce_values(new_values) 762 refs = None
File ~/work/pandas/pandas/pandas/core/dtypes/astype.py:237, in astype_array_safe(values, dtype, copy, errors) 234 dtype = dtype.numpy_dtype 236 try: --> 237 new_values = astype_array(values, dtype, copy=copy) 238 except (ValueError, TypeError): 239 # e.g. _astype_nansafe can fail on object-dtype of strings 240 # trying to convert to float 241 if errors == "ignore":
File ~/work/pandas/pandas/pandas/core/dtypes/astype.py:179, in astype_array(values, dtype, copy) 175 return values 177 if not isinstance(values, np.ndarray): 178 # i.e. ExtensionArray --> 179 values = values.astype(dtype, copy=copy) 181 else: 182 values = _astype_nansafe(values, dtype, copy=copy)
File ~/work/pandas/pandas/pandas/core/arrays/datetimes.py:739, in DatetimeArray.astype(self, dtype, copy) 737 elif isinstance(dtype, PeriodDtype): 738 return self.to_period(freq=dtype.freq) --> 739 return dtl.DatetimeLikeArrayMixin.astype(self, dtype, copy)
File ~/work/pandas/pandas/pandas/core/arrays/datetimelike.py:494, in DatetimeLikeArrayMixin.astype(self, dtype, copy) 490 elif (dtype.kind in "mM" and self.dtype != dtype) or dtype.kind == "f": 491 # disallow conversion between datetime/timedelta, 492 # and conversions for any datetimelike to float 493 msg = f"Cannot cast {type(self).name} to dtype {dtype}" --> 494 raise TypeError(msg) 495 else: 496 return np.asarray(self, dtype=dtype)
TypeError: Cannot cast DatetimeArray to dtype datetime64[D]
For conversion from timedelta64[ns]
dtypes, the old behavior converted to a floating point format.
Previous behavior:
In [32]: idx = pd.timedelta_range("1 Day", periods=3)
In [33]: ser = pd.Series(idx)
Previous behavior:
In [7]: ser.astype("timedelta64[s]") Out[7]: 0 86400.0 1 172800.0 2 259200.0 dtype: float64
In [8]: ser.astype("timedelta64[D]") Out[8]: 0 1.0 1 2.0 2 3.0 dtype: float64
The new behavior, as for datetime64, either gives exactly the requested dtype or raises:
New behavior:
In [34]: ser.astype("timedelta64[s]") Out[34]: 0 1 days 1 2 days 2 3 days dtype: timedelta64[s]
In [35]: ser.astype("timedelta64[D]")
ValueError Traceback (most recent call last) Cell In[35], line 1 ----> 1 ser.astype("timedelta64[D]")
File ~/work/pandas/pandas/pandas/core/generic.py:6643, in NDFrame.astype(self, dtype, copy, errors) 6637 results = [ 6638 ser.astype(dtype, copy=copy, errors=errors) for _, ser in self.items() 6639 ] 6641 else: 6642 # else, only a single dtype is given -> 6643 new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors) 6644 res = self._constructor_from_mgr(new_data, axes=new_data.axes) 6645 return res.finalize(self, method="astype")
File ~/work/pandas/pandas/pandas/core/internals/managers.py:430, in BaseBlockManager.astype(self, dtype, copy, errors) 427 elif using_copy_on_write(): 428 copy = False --> 430 return self.apply( 431 "astype", 432 dtype=dtype, 433 copy=copy, 434 errors=errors, 435 using_cow=using_copy_on_write(), 436 )
File ~/work/pandas/pandas/pandas/core/internals/managers.py:363, in BaseBlockManager.apply(self, f, align_keys, **kwargs) 361 applied = b.apply(f, **kwargs) 362 else: --> 363 applied = getattr(b, f)(**kwargs) 364 result_blocks = extend_blocks(applied, result_blocks) 366 out = type(self).from_blocks(result_blocks, self.axes)
File ~/work/pandas/pandas/pandas/core/internals/blocks.py:758, in Block.astype(self, dtype, copy, errors, using_cow, squeeze) 755 raise ValueError("Can not squeeze with more than one column.") 756 values = values[0, :] # type: ignore[call-overload] --> 758 new_values = astype_array_safe(values, dtype, copy=copy, errors=errors) 760 new_values = maybe_coerce_values(new_values) 762 refs = None
File ~/work/pandas/pandas/pandas/core/dtypes/astype.py:237, in astype_array_safe(values, dtype, copy, errors) 234 dtype = dtype.numpy_dtype 236 try: --> 237 new_values = astype_array(values, dtype, copy=copy) 238 except (ValueError, TypeError): 239 # e.g. _astype_nansafe can fail on object-dtype of strings 240 # trying to convert to float 241 if errors == "ignore":
File ~/work/pandas/pandas/pandas/core/dtypes/astype.py:179, in astype_array(values, dtype, copy) 175 return values 177 if not isinstance(values, np.ndarray): 178 # i.e. ExtensionArray --> 179 values = values.astype(dtype, copy=copy) 181 else: 182 values = _astype_nansafe(values, dtype, copy=copy)
File ~/work/pandas/pandas/pandas/core/arrays/timedeltas.py:358, in TimedeltaArray.astype(self, dtype, copy) 354 return type(self)._simple_new( 355 res_values, dtype=res_values.dtype, freq=self.freq 356 ) 357 else: --> 358 raise ValueError( 359 f"Cannot convert from {self.dtype} to {dtype}. " 360 "Supported resolutions are 's', 'ms', 'us', 'ns'" 361 ) 363 return dtl.DatetimeLikeArrayMixin.astype(self, dtype, copy=copy)
ValueError: Cannot convert from timedelta64[ns] to timedelta64[D]. Supported resolutions are 's', 'ms', 'us', 'ns'
UTC and fixed-offset timezones default to standard-library tzinfo objects#
In previous versions, the default tzinfo
object used to represent UTC was pytz.UTC
. In pandas 2.0, we default to datetime.timezone.utc
instead. Similarly, for timezones represent fixed UTC offsets, we use datetime.timezone
objects instead of pytz.FixedOffset
objects. See (GH 34916)
Previous behavior:
In [2]: ts = pd.Timestamp("2016-01-01", tz="UTC") In [3]: type(ts.tzinfo) Out[3]: pytz.UTC
In [4]: ts2 = pd.Timestamp("2016-01-01 04:05:06-07:00") In [3]: type(ts2.tzinfo) Out[5]: pytz._FixedOffset
New behavior:
In [36]: ts = pd.Timestamp("2016-01-01", tz="UTC")
In [37]: type(ts.tzinfo) Out[37]: datetime.timezone
In [38]: ts2 = pd.Timestamp("2016-01-01 04:05:06-07:00")
In [39]: type(ts2.tzinfo) Out[39]: datetime.timezone
For timezones that are neither UTC nor fixed offsets, e.g. “US/Pacific”, we continue to default to pytz
objects.
Empty DataFrames/Series will now default to have a RangeIndex
#
Before, constructing an empty (where data
is None
or an empty list-like argument) Series or DataFrame without specifying the axes (index=None
, columns=None
) would return the axes as empty Index with object dtype.
Now, the axes return an empty RangeIndex (GH 49572).
Previous behavior:
In [8]: pd.Series().index Out[8]: Index([], dtype='object')
In [9] pd.DataFrame().axes Out[9]: [Index([], dtype='object'), Index([], dtype='object')]
New behavior:
In [40]: pd.Series().index Out[40]: RangeIndex(start=0, stop=0, step=1)
In [41]: pd.DataFrame().axes Out[41]: [RangeIndex(start=0, stop=0, step=1), RangeIndex(start=0, stop=0, step=1)]
DataFrame to LaTeX has a new render engine#
The existing DataFrame.to_latex() has been restructured to utilise the extended implementation previously available under Styler.to_latex(). The arguments signature is similar, albeit col_space
has been removed since it is ignored by LaTeX engines. This render engine also requires jinja2
as a dependency which needs to be installed, since rendering is based upon jinja2 templates.
The pandas latex options below are no longer used and have been removed. The generic max rows and columns arguments remain but for this functionality should be replaced by the Styler equivalents. The alternative options giving similar functionality are indicated below:
display.latex.escape
: replaced withstyler.format.escape
,display.latex.longtable
: replaced withstyler.latex.environment
,display.latex.multicolumn
,display.latex.multicolumn_format
anddisplay.latex.multirow
: replaced withstyler.sparse.rows
,styler.sparse.columns
,styler.latex.multirow_align
andstyler.latex.multicol_align
,display.latex.repr
: replaced withstyler.render.repr
,display.max_rows
anddisplay.max_columns
: replace withstyler.render.max_rows
,styler.render.max_columns
andstyler.render.max_elements
.
Note that due to this change some defaults have also changed:
multirow
now defaults to True.multirow_align
defaults to “r” instead of “l”.multicol_align
defaults to “r” instead of “l”.escape
now defaults to False.
Note that the behaviour of _repr_latex_
is also changed. Previously setting display.latex.repr
would generate LaTeX only when using nbconvert for a JupyterNotebook, and not when the user is running the notebook. Now thestyler.render.repr
option allows control of the specific output within JupyterNotebooks for operations (not just on nbconvert). See GH 39911.
Increased minimum versions for dependencies#
Some minimum supported versions of dependencies were updated. If installed, we now require:
Package | Minimum Version | Required | Changed |
---|---|---|---|
mypy (dev) | 1.0 | X | |
pytest (dev) | 7.0.0 | X | |
pytest-xdist (dev) | 2.2.0 | X | |
hypothesis (dev) | 6.34.2 | X | |
python-dateutil | 2.8.2 | X | X |
tzdata | 2022.1 | X | X |
For optional libraries the general recommendation is to use the latest version. The following table lists the lowest version per library that is currently being tested throughout the development of pandas. Optional libraries below the lowest tested version may still work, but are not considered supported.
Package | Minimum Version | Changed |
---|---|---|
pyarrow | 7.0.0 | X |
matplotlib | 3.6.1 | X |
fastparquet | 0.6.3 | X |
xarray | 0.21.0 | X |
See Dependencies and Optional dependencies for more.
Datetimes are now parsed with a consistent format#
In the past, to_datetime() guessed the format for each element independently. This was appropriate for some cases where elements had mixed date formats - however, it would regularly cause problems when users expected a consistent format but the function would switch formats between elements. As of version 2.0.0, parsing will use a consistent format, determined by the first non-NA value (unless the user specifies a format, in which case that is used).
Old behavior:
In [1]: ser = pd.Series(['13-01-2000', '12-01-2000']) In [2]: pd.to_datetime(ser) Out[2]: 0 2000-01-13 1 2000-12-01 dtype: datetime64[ns]
New behavior:
In [42]: ser = pd.Series(['13-01-2000', '12-01-2000'])
In [43]: pd.to_datetime(ser) Out[43]: 0 2000-01-13 1 2000-01-12 dtype: datetime64[ns]
Note that this affects read_csv() as well.
If you still need to parse dates with inconsistent formats, you can useformat='mixed'
(possibly alongside dayfirst
)
ser = pd.Series(['13-01-2000', '12 January 2000']) pd.to_datetime(ser, format='mixed', dayfirst=True)
or, if your formats are all ISO8601 (but possibly not identically-formatted)
ser = pd.Series(['2020-01-01', '2020-01-01 03:00']) pd.to_datetime(ser, format='ISO8601')
Other API changes#
- The
freq
,tz
,nanosecond
, andunit
keywords in the Timestamp constructor are now keyword-only (GH 45307, GH 32526) - Passing
nanoseconds
greater than 999 or less than 0 in Timestamp now raises aValueError
(GH 48538, GH 48255) - read_csv(): specifying an incorrect number of columns with
index_col
of now raisesParserError
instead ofIndexError
when using the c parser. - Default value of
dtype
in get_dummies() is changed tobool
fromuint8
(GH 45848) - DataFrame.astype(), Series.astype(), and
DatetimeIndex.astype()
casting datetime64 data to any of “datetime64[s]”, “datetime64[ms]”, “datetime64[us]” will return an object with the given resolution instead of coercing back to “datetime64[ns]” (GH 48928) - DataFrame.astype(), Series.astype(), and
DatetimeIndex.astype()
casting timedelta64 data to any of “timedelta64[s]”, “timedelta64[ms]”, “timedelta64[us]” will return an object with the given resolution instead of coercing to “float64” dtype (GH 48963) DatetimeIndex.astype()
,TimedeltaIndex.astype()
,PeriodIndex.astype()
Series.astype(), DataFrame.astype() withdatetime64
,timedelta64
or PeriodDtype dtypes no longer allow converting to integer dtypes other than “int64”, doobj.astype('int64', copy=False).astype(dtype)
instead (GH 49715)- Index.astype() now allows casting from
float64
dtype to datetime-like dtypes, matching Series behavior (GH 49660) - Passing data with dtype of “timedelta64[s]”, “timedelta64[ms]”, or “timedelta64[us]” to TimedeltaIndex, Series, or DataFrame constructors will now retain that dtype instead of casting to “timedelta64[ns]”; timedelta64 data with lower resolution will be cast to the lowest supported resolution “timedelta64[s]” (GH 49014)
- Passing
dtype
of “timedelta64[s]”, “timedelta64[ms]”, or “timedelta64[us]” to TimedeltaIndex, Series, or DataFrame constructors will now retain that dtype instead of casting to “timedelta64[ns]”; passing a dtype with lower resolution for Series or DataFrame will be cast to the lowest supported resolution “timedelta64[s]” (GH 49014) - Passing a
np.datetime64
object with non-nanosecond resolution to Timestamp will retain the input resolution if it is “s”, “ms”, “us”, or “ns”; otherwise it will be cast to the closest supported resolution (GH 49008) - Passing
datetime64
values with resolution other than nanosecond to to_datetime() will retain the input resolution if it is “s”, “ms”, “us”, or “ns”; otherwise it will be cast to the closest supported resolution (GH 50369) - Passing integer values and a non-nanosecond datetime64 dtype (e.g. “datetime64[s]”) DataFrame, Series, or Index will treat the values as multiples of the dtype’s unit, matching the behavior of e.g.
Series(np.array(values, dtype="M8[s]"))
(GH 51092) - Passing a string in ISO-8601 format to Timestamp will retain the resolution of the parsed input if it is “s”, “ms”, “us”, or “ns”; otherwise it will be cast to the closest supported resolution (GH 49737)
- The
other
argument in DataFrame.mask() and Series.mask() now defaults tono_default
instead ofnp.nan
consistent with DataFrame.where() and Series.where(). Entries will be filled with the corresponding NULL value (np.nan
for numpy dtypes,pd.NA
for extension dtypes). (GH 49111) - Changed behavior of Series.quantile() and DataFrame.quantile() with SparseDtype to retain sparse dtype (GH 49583)
- When creating a Series with a object-dtype Index of datetime objects, pandas no longer silently converts the index to a DatetimeIndex (GH 39307, GH 23598)
- pandas.testing.assert_index_equal() with parameter
exact="equiv"
now considers two indexes equal when both are either a RangeIndex or Index with anint64
dtype. Previously it meant either a RangeIndex or aInt64Index
(GH 51098) - Series.unique() with dtype “timedelta64[ns]” or “datetime64[ns]” now returns
TimedeltaArray
orDatetimeArray
instead ofnumpy.ndarray
(GH 49176) - to_datetime() and DatetimeIndex now allow sequences containing both
datetime
objects and numeric entries, matching Series behavior (GH 49037, GH 50453) - pandas.api.types.is_string_dtype() now only returns
True
for array-likes withdtype=object
when the elements are inferred to be strings (GH 15585) - Passing a sequence containing
datetime
objects anddate
objects to Series constructor will return withobject
dtype instead ofdatetime64[ns]
dtype, consistent with Index behavior (GH 49341) - Passing strings that cannot be parsed as datetimes to Series or DataFrame with
dtype="datetime64[ns]"
will raise instead of silently ignoring the keyword and returningobject
dtype (GH 24435) - Passing a sequence containing a type that cannot be converted to Timedelta to to_timedelta() or to the Series or DataFrame constructor with
dtype="timedelta64[ns]"
or to TimedeltaIndex now raisesTypeError
instead ofValueError
(GH 49525) - Changed behavior of Index constructor with sequence containing at least one
NaT
and everything else eitherNone
orNaN
to inferdatetime64[ns]
dtype instead ofobject
, matching Series behavior (GH 49340) - read_stata() with parameter
index_col
set toNone
(the default) will now set the index on the returned DataFrame to a RangeIndex instead of aInt64Index
(GH 49745) - Changed behavior of Index, Series, and DataFrame arithmetic methods when working with object-dtypes, the results no longer do type inference on the result of the array operations, use
result.infer_objects(copy=False)
to do type inference on the result (GH 49999, GH 49714) - Changed behavior of Index constructor with an object-dtype
numpy.ndarray
containing all-bool
values or all-complex values, this will now retain object dtype, consistent with the Series behavior (GH 49594) - Changed behavior of Series.astype() from object-dtype containing
bytes
objects to string dtypes; this now doesval.decode()
on bytes objects instead ofstr(val)
, matching Index.astype() behavior (GH 45326) - Added
"None"
to defaultna_values
in read_csv() (GH 50286) - Changed behavior of Series and DataFrame constructors when given an integer dtype and floating-point data that is not round numbers, this now raises
ValueError
instead of silently retaining the float dtype; doSeries(data)
orDataFrame(data)
to get the old behavior, andSeries(data).astype(dtype)
orDataFrame(data).astype(dtype)
to get the specified dtype (GH 49599) - Changed behavior of DataFrame.shift() with
axis=1
, an integerfill_value
, and homogeneous datetime-like dtype, this now fills new columns with integer dtypes instead of casting to datetimelike (GH 49842) - Files are now closed when encountering an exception in read_json() (GH 49921)
- Changed behavior of read_csv(), read_json() & read_fwf(), where the index will now always be a RangeIndex, when no index is specified. Previously the index would be a Index with dtype
object
if the new DataFrame/Series has length 0 (GH 49572) - DataFrame.values(), DataFrame.to_numpy(), DataFrame.xs(), DataFrame.reindex(), DataFrame.fillna(), and DataFrame.replace() no longer silently consolidate the underlying arrays; do
df = df.copy()
to ensure consolidation (GH 49356) - Creating a new DataFrame using a full slice on both axes with locor iloc (thus,
df.loc[:, :]
ordf.iloc[:, :]
) now returns a new DataFrame (shallow copy) instead of the original DataFrame, consistent with other methods to get a full slice (for exampledf.loc[:]
ordf[:]
) (GH 49469) - The Series and DataFrame constructors will now return a shallow copy (i.e. share data, but not attributes) when passed a Series and DataFrame, respectively, and with the default of
copy=False
(and if no other keyword triggers a copy). Previously, the new Series or DataFrame would share the index attribute (e.g.df.index = ...
would also update the index of the parent or child) (GH 49523) - Disallow computing
cumprod
for Timedelta object; previously this returned incorrect values (GH 50246) - DataFrame objects read from a
HDFStore
file without an index now have a RangeIndex instead of anint64
index (GH 51076) - Instantiating an Index with an numeric numpy dtype with data containing NA and/or NaT now raises a
ValueError
. Previously aTypeError
was raised (GH 51050) - Loading a JSON file with duplicate columns using
read_json(orient='split')
renames columns to avoid duplicates, as read_csv() and the other readers do (GH 50370) - The levels of the index of the Series returned from
Series.sparse.from_coo
now always have dtypeint32
. Previously they had dtypeint64
(GH 50926) - to_datetime() with
unit
of either “Y” or “M” will now raise if a sequence contains a non-roundfloat
value, matching theTimestamp
behavior (GH 50301) - The methods Series.round(),
DataFrame.__invert__()
,Series.__invert__()
, DataFrame.swapaxes(), DataFrame.first(), DataFrame.last(), Series.first(), Series.last() and DataFrame.align() will now always return new objects (GH 51032) - DataFrame and
DataFrameGroupBy
aggregations (e.g. “sum”) with object-dtype columns no longer infer non-object dtypes for their results, explicitly callresult.infer_objects(copy=False)
on the result to obtain the old behavior (GH 51205, GH 49603) - Division by zero with ArrowDtype dtypes returns
-inf
,nan
, orinf
depending on the numerator, instead of raising (GH 51541) - Added pandas.api.types.is_any_real_numeric_dtype() to check for real numeric dtypes (GH 51152)
value_counts()
now returns data with ArrowDtype withpyarrow.int64
type instead of"Int64"
type (GH 51462)- factorize() and unique() preserve the original dtype when passed numpy timedelta64 or datetime64 with non-nanosecond resolution (GH 48670)
Note
A current PDEP proposes the deprecation and removal of the keywords inplace
and copy
for all but a small subset of methods from the pandas API. The current discussion takes place at here. The keywords won’t be necessary anymore in the context of Copy-on-Write. If this proposal is accepted, both keywords would be deprecated in the next release of pandas and removed in pandas 3.0.
Deprecations#
- Deprecated parsing datetime strings with system-local timezone to
tzlocal
, pass atz
keyword or explicitly calltz_localize
instead (GH 50791) - Deprecated argument
infer_datetime_format
in to_datetime() and read_csv(), as a strict version of it is now the default (GH 48621) - Deprecated behavior of to_datetime() with
unit
when parsing strings, in a future version these will be parsed as datetimes (matching unit-less behavior) instead of cast to floats. To retain the old behavior, cast strings to numeric types before calling to_datetime() (GH 50735) - Deprecated
pandas.io.sql.execute()
(GH 50185) - Index.is_boolean() has been deprecated. Use pandas.api.types.is_bool_dtype() instead (GH 50042)
- Index.is_integer() has been deprecated. Use pandas.api.types.is_integer_dtype() instead (GH 50042)
- Index.is_floating() has been deprecated. Use pandas.api.types.is_float_dtype() instead (GH 50042)
Index.holds_integer()
has been deprecated. Use pandas.api.types.infer_dtype() instead (GH 50243)- Index.is_numeric() has been deprecated. Use pandas.api.types.is_any_real_numeric_dtype() instead (GH 50042,:issue:51152)
- Index.is_categorical() has been deprecated. Use pandas.api.types.is_categorical_dtype() instead (GH 50042)
- Index.is_object() has been deprecated. Use pandas.api.types.is_object_dtype() instead (GH 50042)
- Index.is_interval() has been deprecated. Use pandas.api.types.is_interval_dtype() instead (GH 50042)
- Deprecated argument
date_parser
in read_csv(), read_table(), read_fwf(), and read_excel() in favour ofdate_format
(GH 50601) - Deprecated
all
andany
reductions withdatetime64
and DatetimeTZDtype dtypes, use e.g.(obj != pd.Timestamp(0), tz=obj.tz).all()
instead (GH 34479) - Deprecated unused arguments
*args
and**kwargs
inResampler
(GH 50977) - Deprecated calling
float
orint
on a single element Series to return afloat
orint
respectively. Extract the element before callingfloat
orint
instead (GH 51101) - Deprecated
Grouper.groups()
, useGroupby.groups()
instead (GH 51182) - Deprecated
Grouper.grouper()
, useGroupby.grouper()
instead (GH 51182) - Deprecated
Grouper.obj()
, useGroupby.obj()
instead (GH 51206) - Deprecated
Grouper.indexer()
, useResampler.indexer()
instead (GH 51206) - Deprecated
Grouper.ax()
, useResampler.ax()
instead (GH 51206) - Deprecated keyword
use_nullable_dtypes
in read_parquet(), usedtype_backend
instead (GH 51853) - Deprecated Series.pad() in favor of Series.ffill() (GH 33396)
- Deprecated Series.backfill() in favor of Series.bfill() (GH 33396)
- Deprecated DataFrame.pad() in favor of DataFrame.ffill() (GH 33396)
- Deprecated DataFrame.backfill() in favor of DataFrame.bfill() (GH 33396)
- Deprecated
close()
. UseStataReader
as a context manager instead (GH 49228) - Deprecated producing a scalar when iterating over a
DataFrameGroupBy
or aSeriesGroupBy
that has been grouped by alevel
parameter that is a list of length 1; a tuple of length one will be returned instead (GH 51583)
Removal of prior version deprecations/changes#
- Removed
Int64Index
,UInt64Index
andFloat64Index
. See also here for more information (GH 42717) - Removed deprecated
Timestamp.freq
,Timestamp.freqstr
and argumentfreq
from the Timestamp constructor and Timestamp.fromordinal() (GH 14146) - Removed deprecated
CategoricalBlock
,Block.is_categorical()
, require datetime64 and timedelta64 values to be wrapped inDatetimeArray
orTimedeltaArray
before passing toBlock.make_block_same_class()
, requireDatetimeTZBlock.values
to have the correct ndim when passing to theBlockManager
constructor, and removed the “fastpath” keyword from theSingleBlockManager
constructor (GH 40226, GH 40571) - Removed deprecated global option
use_inf_as_null
in favor ofuse_inf_as_na
(GH 17126) - Removed deprecated module
pandas.core.index
(GH 30193) - Removed deprecated alias
pandas.core.tools.datetimes.to_time
, import the function directly frompandas.core.tools.times
instead (GH 34145) - Removed deprecated alias
pandas.io.json.json_normalize
, import the function directly frompandas.json_normalize
instead (GH 27615) - Removed deprecated
Categorical.to_dense()
, usenp.asarray(cat)
instead (GH 32639) - Removed deprecated
Categorical.take_nd()
(GH 27745) - Removed deprecated
Categorical.mode()
, useSeries(cat).mode()
instead (GH 45033) - Removed deprecated
Categorical.is_dtype_equal()
andCategoricalIndex.is_dtype_equal()
(GH 37545) - Removed deprecated
CategoricalIndex.take_nd()
(GH 30702) - Removed deprecated
Index.is_type_compatible()
(GH 42113) - Removed deprecated
Index.is_mixed()
, checkindex.inferred_type
directly instead (GH 32922) - Removed deprecated
pandas.api.types.is_categorical()
; use pandas.api.types.is_categorical_dtype() instead (GH 33385) - Removed deprecated
Index.asi8()
(GH 37877) - Enforced deprecation changing behavior when passing
datetime64[ns]
dtype data and timezone-aware dtype to Series, interpreting the values as wall-times instead of UTC times, matching DatetimeIndex behavior (GH 41662) - Enforced deprecation changing behavior when applying a numpy ufunc on multiple non-aligned (on the index or columns) DataFrame that will now align the inputs first (GH 39239)
- Removed deprecated
DataFrame._AXIS_NUMBERS()
,DataFrame._AXIS_NAMES()
,Series._AXIS_NUMBERS()
,Series._AXIS_NAMES()
(GH 33637) - Removed deprecated
Index.to_native_types()
, useobj.astype(str)
instead (GH 36418) - Removed deprecated
Series.iteritems()
,DataFrame.iteritems()
, useobj.items
instead (GH 45321) - Removed deprecated
DataFrame.lookup()
(GH 35224) - Removed deprecated
Series.append()
,DataFrame.append()
, use concat() instead (GH 35407) - Removed deprecated
Series.iteritems()
,DataFrame.iteritems()
andHDFStore.iteritems()
useobj.items
instead (GH 45321) - Removed deprecated
DatetimeIndex.union_many()
(GH 45018) - Removed deprecated
weekofyear
andweek
attributes ofDatetimeArray
, DatetimeIndex anddt
accessor in favor ofisocalendar().week
(GH 33595) - Removed deprecated
RangeIndex._start()
,RangeIndex._stop()
,RangeIndex._step()
, usestart
,stop
,step
instead (GH 30482) - Removed deprecated
DatetimeIndex.to_perioddelta()
, Usedtindex - dtindex.to_period(freq).to_timestamp()
instead (GH 34853) - Removed deprecated
Styler.hide_index()
andStyler.hide_columns()
(GH 49397) - Removed deprecated
Styler.set_na_rep()
andStyler.set_precision()
(GH 49397) - Removed deprecated
Styler.where()
(GH 49397) - Removed deprecated
Styler.render()
(GH 49397) - Removed deprecated argument
col_space
in DataFrame.to_latex() (GH 47970) - Removed deprecated argument
null_color
in Styler.highlight_null() (GH 49397) - Removed deprecated argument
check_less_precise
intesting.assert_frame_equal()
,testing.assert_extension_array_equal()
,testing.assert_series_equal()
,testing.assert_index_equal()
(GH 30562) - Removed deprecated
null_counts
argument in DataFrame.info(). Useshow_counts
instead (GH 37999) - Removed deprecated
Index.is_monotonic()
, andSeries.is_monotonic()
; useobj.is_monotonic_increasing
instead (GH 45422) - Removed deprecated
Index.is_all_dates()
(GH 36697) - Enforced deprecation disallowing passing a timezone-aware Timestamp and
dtype="datetime64[ns]"
to Series or DataFrame constructors (GH 41555) - Enforced deprecation disallowing passing a sequence of timezone-aware values and
dtype="datetime64[ns]"
to to Series or DataFrame constructors (GH 41555) - Enforced deprecation disallowing
numpy.ma.mrecords.MaskedRecords
in the DataFrame constructor; pass"{name: data[name] for name in data.dtype.names}
instead (GH 40363) - Enforced deprecation disallowing unit-less “datetime64” dtype in Series.astype() and DataFrame.astype() (GH 47844)
- Enforced deprecation disallowing using
.astype
to convert adatetime64[ns]
Series, DataFrame, or DatetimeIndex to timezone-aware dtype, useobj.tz_localize
orser.dt.tz_localize
instead (GH 39258) - Enforced deprecation disallowing using
.astype
to convert a timezone-aware Series, DataFrame, or DatetimeIndex to timezone-naivedatetime64[ns]
dtype, useobj.tz_localize(None)
orobj.tz_convert("UTC").tz_localize(None)
instead (GH 39258) - Enforced deprecation disallowing passing non boolean argument to sort in concat() (GH 44629)
- Removed Date parser functions
parse_date_time()
,parse_date_fields()
,parse_all_fields()
andgeneric_parser()
(GH 24518) - Removed argument
index
from thecore.arrays.SparseArray
constructor (GH 43523) - Remove argument
squeeze
from DataFrame.groupby() and Series.groupby() (GH 32380) - Removed deprecated
apply
,apply_index
,__call__
,onOffset
, andisAnchored
attributes fromDateOffset
(GH 34171) - Removed
keep_tz
argument in DatetimeIndex.to_series() (GH 29731) - Remove arguments
names
anddtype
from Index.copy() andlevels
andcodes
from MultiIndex.copy() (GH 35853, GH 36685) - Remove argument
inplace
from MultiIndex.set_levels() and MultiIndex.set_codes() (GH 35626) - Removed arguments
verbose
andencoding
from DataFrame.to_excel() and Series.to_excel() (GH 47912) - Removed argument
line_terminator
from DataFrame.to_csv() and Series.to_csv(), uselineterminator
instead (GH 45302) - Removed argument
inplace
from DataFrame.set_axis() and Series.set_axis(), useobj = obj.set_axis(..., copy=False)
instead (GH 48130) - Disallow passing positional arguments to MultiIndex.set_levels() and MultiIndex.set_codes() (GH 41485)
- Disallow parsing to Timedelta strings with components with units “Y”, “y”, or “M”, as these do not represent unambiguous durations (GH 36838)
- Removed
MultiIndex.is_lexsorted()
andMultiIndex.lexsort_depth()
(GH 38701) - Removed argument
how
fromPeriodIndex.astype()
, use PeriodIndex.to_timestamp() instead (GH 37982) - Removed argument
try_cast
from DataFrame.mask(), DataFrame.where(), Series.mask() and Series.where() (GH 38836) - Removed argument
tz
from Period.to_timestamp(), useobj.to_timestamp(...).tz_localize(tz)
instead (GH 34522) - Removed argument
sort_columns
in DataFrame.plot() and Series.plot() (GH 47563) - Removed argument
is_copy
from DataFrame.take() and Series.take() (GH 30615) - Removed argument
kind
from Index.get_slice_bound(), Index.slice_indexer() and Index.slice_locs() (GH 41378) - Removed arguments
prefix
,squeeze
,error_bad_lines
andwarn_bad_lines
from read_csv() (GH 40413, GH 43427) - Removed arguments
squeeze
from read_excel() (GH 43427) - Removed argument
datetime_is_numeric
from DataFrame.describe() and Series.describe() as datetime data will always be summarized as numeric data (GH 34798) - Disallow passing list
key
to Series.xs() and DataFrame.xs(), pass a tuple instead (GH 41789) - Disallow subclass-specific keywords (e.g. “freq”, “tz”, “names”, “closed”) in the Index constructor (GH 38597)
- Removed argument
inplace
fromCategorical.remove_unused_categories()
(GH 37918) - Disallow passing non-round floats to Timestamp with
unit="M"
orunit="Y"
(GH 47266) - Remove keywords
convert_float
andmangle_dupe_cols
from read_excel() (GH 41176) - Remove keyword
mangle_dupe_cols
from read_csv() and read_table() (GH 48137) - Removed
errors
keyword from DataFrame.where(), Series.where(), DataFrame.mask() and Series.mask() (GH 47728) - Disallow passing non-keyword arguments to read_excel() except
io
andsheet_name
(GH 34418) - Disallow passing non-keyword arguments to DataFrame.drop() and Series.drop() except
labels
(GH 41486) - Disallow passing non-keyword arguments to DataFrame.fillna() and Series.fillna() except
value
(GH 41485) - Disallow passing non-keyword arguments to
StringMethods.split()
andStringMethods.rsplit()
except forpat
(GH 47448) - Disallow passing non-keyword arguments to DataFrame.set_index() except
keys
(GH 41495) - Disallow passing non-keyword arguments to
Resampler.interpolate()
exceptmethod
(GH 41699) - Disallow passing non-keyword arguments to DataFrame.reset_index() and Series.reset_index() except
level
(GH 41496) - Disallow passing non-keyword arguments to DataFrame.dropna() and Series.dropna() (GH 41504)
- Disallow passing non-keyword arguments to
ExtensionArray.argsort()
(GH 46134) - Disallow passing non-keyword arguments to
Categorical.sort_values()
(GH 47618) - Disallow passing non-keyword arguments to Index.drop_duplicates() and Series.drop_duplicates() (GH 41485)
- Disallow passing non-keyword arguments to DataFrame.drop_duplicates() except for
subset
(GH 41485) - Disallow passing non-keyword arguments to DataFrame.sort_index() and Series.sort_index() (GH 41506)
- Disallow passing non-keyword arguments to DataFrame.interpolate() and Series.interpolate() except for
method
(GH 41510) - Disallow passing non-keyword arguments to DataFrame.any() and Series.any() (GH 44896)
- Disallow passing non-keyword arguments to Index.set_names() except for
names
(GH 41551) - Disallow passing non-keyword arguments to Index.join() except for
other
(GH 46518) - Disallow passing non-keyword arguments to concat() except for
objs
(GH 41485) - Disallow passing non-keyword arguments to pivot() except for
data
(GH 48301) - Disallow passing non-keyword arguments to DataFrame.pivot() (GH 48301)
- Disallow passing non-keyword arguments to read_html() except for
io
(GH 27573) - Disallow passing non-keyword arguments to read_json() except for
path_or_buf
(GH 27573) - Disallow passing non-keyword arguments to read_sas() except for
filepath_or_buffer
(GH 47154) - Disallow passing non-keyword arguments to read_stata() except for
filepath_or_buffer
(GH 48128) - Disallow passing non-keyword arguments to read_csv() except
filepath_or_buffer
(GH 41485) - Disallow passing non-keyword arguments to read_table() except
filepath_or_buffer
(GH 41485) - Disallow passing non-keyword arguments to read_fwf() except
filepath_or_buffer
(GH 44710) - Disallow passing non-keyword arguments to read_xml() except for
path_or_buffer
(GH 45133) - Disallow passing non-keyword arguments to Series.mask() and DataFrame.mask() except
cond
andother
(GH 41580) - Disallow passing non-keyword arguments to DataFrame.to_stata() except for
path
(GH 48128) - Disallow passing non-keyword arguments to DataFrame.where() and Series.where() except for
cond
andother
(GH 41523) - Disallow passing non-keyword arguments to Series.set_axis() and DataFrame.set_axis() except for
labels
(GH 41491) - Disallow passing non-keyword arguments to Series.rename_axis() and DataFrame.rename_axis() except for
mapper
(GH 47587) - Disallow passing non-keyword arguments to Series.clip() and DataFrame.clip() except
lower
andupper
(GH 41511) - Disallow passing non-keyword arguments to Series.bfill(), Series.ffill(), DataFrame.bfill() and DataFrame.ffill() (GH 41508)
- Disallow passing non-keyword arguments to DataFrame.replace(), Series.replace() except for
to_replace
andvalue
(GH 47587) - Disallow passing non-keyword arguments to DataFrame.sort_values() except for
by
(GH 41505) - Disallow passing non-keyword arguments to Series.sort_values() (GH 41505)
- Disallow passing non-keyword arguments to DataFrame.reindex() except for
labels
(GH 17966) - Disallow Index.reindex() with non-unique Index objects (GH 42568)
- Disallowed constructing Categorical with scalar
data
(GH 38433) - Disallowed constructing CategoricalIndex without passing
data
(GH 38944) - Removed
Rolling.validate()
,Expanding.validate()
, andExponentialMovingWindow.validate()
(GH 43665) - Removed
Rolling.win_type
returning"freq"
(GH 38963) - Removed
Rolling.is_datetimelike
(GH 38963) - Removed the
level
keyword in DataFrame and Series aggregations; usegroupby
instead (GH 39983) - Removed deprecated
Timedelta.delta()
,Timedelta.is_populated()
, andTimedelta.freq
(GH 46430, GH 46476) - Removed deprecated
NaT.freq
(GH 45071) - Removed deprecated
Categorical.replace()
, use Series.replace() instead (GH 44929) - Removed the
numeric_only
keyword fromCategorical.min()
andCategorical.max()
in favor ofskipna
(GH 48821) - Changed behavior of DataFrame.median() and DataFrame.mean() with
numeric_only=None
to not exclude datetime-like columns THIS NOTE WILL BE IRRELEVANT ONCEnumeric_only=None
DEPRECATION IS ENFORCED (GH 29941) - Removed
is_extension_type()
in favor ofis_extension_array_dtype()
(GH 29457) - Removed
.ExponentialMovingWindow.vol
(GH 39220) - Removed
Index.get_value()
andIndex.set_value()
(GH 33907, GH 28621) - Removed
Series.slice_shift()
andDataFrame.slice_shift()
(GH 37601) - Remove
DataFrameGroupBy.pad()
andDataFrameGroupBy.backfill()
(GH 45076) - Remove
numpy
argument from read_json() (GH 30636) - Disallow passing abbreviations for
orient
in DataFrame.to_dict() (GH 32516) - Disallow partial slicing on an non-monotonic DatetimeIndex with keys which are not in Index. This now raises a
KeyError
(GH 18531) - Removed
get_offset
in favor ofto_offset()
(GH 30340) - Removed the
warn
keyword in infer_freq() (GH 45947) - Removed the
include_start
andinclude_end
arguments in DataFrame.between_time() in favor ofinclusive
(GH 43248) - Removed the
closed
argument in date_range() and bdate_range() in favor ofinclusive
argument (GH 40245) - Removed the
center
keyword in DataFrame.expanding() (GH 20647) - Removed the
truediv
keyword from eval() (GH 29812) - Removed the
method
andtolerance
arguments in Index.get_loc(). Useindex.get_indexer([label], method=..., tolerance=...)
instead (GH 42269) - Removed the
pandas.datetime
submodule (GH 30489) - Removed the
pandas.np
submodule (GH 30296) - Removed
pandas.util.testing
in favor ofpandas.testing
(GH 30745) - Removed
Series.str.__iter__()
(GH 28277) - Removed
pandas.SparseArray
in favor of arrays.SparseArray (GH 30642) - Removed
pandas.SparseSeries
andpandas.SparseDataFrame
, including pickle support. (GH 30642) - Enforced disallowing passing an integer
fill_value
to DataFrame.shift() andSeries.shift`()
with datetime64, timedelta64, or period dtypes (GH 32591) - Enforced disallowing a string column label into
times
in DataFrame.ewm() (GH 43265) - Enforced disallowing passing
True
andFalse
intoinclusive
in Series.between() in favor of"both"
and"neither"
respectively (GH 40628) - Enforced disallowing using
usecols
with out of bounds indices forread_csv
withengine="c"
(GH 25623) - Enforced disallowing the use of
**kwargs
in ExcelWriter; use the keyword argumentengine_kwargs
instead (GH 40430) - Enforced disallowing a tuple of column labels into
DataFrameGroupBy.__getitem__()
(GH 30546) - Enforced disallowing missing labels when indexing with a sequence of labels on a level of a MultiIndex. This now raises a
KeyError
(GH 42351) - Enforced disallowing setting values with
.loc
using a positional slice. Use.loc
with labels or.iloc
with positions instead (GH 31840) - Enforced disallowing positional indexing with a
float
key even if that key is a round number, manually cast to integer instead (GH 34193) - Enforced disallowing using a DataFrame indexer with
.iloc
, use.loc
instead for automatic alignment (GH 39022) - Enforced disallowing
set
ordict
indexers in__getitem__
and__setitem__
methods (GH 42825) - Enforced disallowing indexing on a Index or positional indexing on a Series producing multi-dimensional objects e.g.
obj[:, None]
, convert to numpy before indexing instead (GH 35141) - Enforced disallowing
dict
orset
objects insuffixes
in merge() (GH 34810) - Enforced disallowing merge() to produce duplicated columns through the
suffixes
keyword and already existing columns (GH 22818) - Enforced disallowing using merge() or
join()
on a different number of levels (GH 34862) - Enforced disallowing
value_name
argument in DataFrame.melt() to match an element in the DataFrame columns (GH 35003) - Enforced disallowing passing
showindex
into**kwargs
in DataFrame.to_markdown() and Series.to_markdown() in favor ofindex
(GH 33091) - Removed setting Categorical._codes directly (GH 41429)
- Removed setting Categorical.categories directly (GH 47834)
- Removed argument
inplace
fromCategorical.add_categories()
,Categorical.remove_categories()
,Categorical.set_categories()
,Categorical.rename_categories()
,Categorical.reorder_categories()
,Categorical.set_ordered()
,Categorical.as_ordered()
,Categorical.as_unordered()
(GH 37981, GH 41118, GH 41133, GH 47834) - Enforced
Rolling.count()
withmin_periods=None
to default to the size of the window (GH 31302) - Renamed
fname
topath
in DataFrame.to_parquet(), DataFrame.to_stata() and DataFrame.to_feather() (GH 30338) - Enforced disallowing indexing a Series with a single item list with a slice (e.g.
ser[[slice(0, 2)]]
). Either convert the list to tuple, or pass the slice directly instead (GH 31333) - Changed behavior indexing on a DataFrame with a DatetimeIndex index using a string indexer, previously this operated as a slice on rows, now it operates like any other column key; use
frame.loc[key]
for the old behavior (GH 36179) - Enforced the
display.max_colwidth
option to not accept negative integers (GH 31569) - Removed the
display.column_space
option in favor ofdf.to_string(col_space=...)
(GH 47280) - Removed the deprecated method
mad
from pandas classes (GH 11787) - Removed the deprecated method
tshift
from pandas classes (GH 11631) - Changed behavior of empty data passed into Series; the default dtype will be
object
instead offloat64
(GH 29405) - Changed the behavior of
DatetimeIndex.union()
,DatetimeIndex.intersection()
, andDatetimeIndex.symmetric_difference()
with mismatched timezones to convert to UTC instead of casting to object dtype (GH 39328) - Changed the behavior of to_datetime() with argument “now” with
utc=False
to matchTimestamp("now")
(GH 18705) - Changed the behavior of indexing on a timezone-aware DatetimeIndex with a timezone-naive
datetime
object or vice-versa; these now behave like any other non-comparable type by raisingKeyError
(GH 36148) - Changed the behavior of Index.reindex(), Series.reindex(), and DataFrame.reindex() with a
datetime64
dtype and adatetime.date
object forfill_value
; these are no longer considered equivalent todatetime.datetime
objects so the reindex casts to object dtype (GH 39767) - Changed behavior of
SparseArray.astype()
when given a dtype that is not explicitlySparseDtype
, cast to the exact requested dtype rather than silently using aSparseDtype
instead (GH 34457) - Changed behavior of Index.ravel() to return a view on the original Index instead of a
np.ndarray
(GH 36900) - Changed behavior of Series.to_frame() and Index.to_frame() with explicit
name=None
to useNone
for the column name instead of the index’s name or default0
(GH 45523) - Changed behavior of concat() with one array of
bool
-dtype and another of integer dtype, this now returnsobject
dtype instead of integer dtype; explicitly cast the bool object to integer before concatenating to get the old behavior (GH 45101) - Changed behavior of DataFrame constructor given floating-point
data
and an integerdtype
, when the data cannot be cast losslessly, the floating point dtype is retained, matching Series behavior (GH 41170) - Changed behavior of Index constructor when given a
np.ndarray
with object-dtype containing numeric entries; this now retains object dtype rather than inferring a numeric dtype, consistent with Series behavior (GH 42870) - Changed behavior of
Index.__and__()
,Index.__or__()
andIndex.__xor__()
to behave as logical operations (matching Series behavior) instead of aliases for set operations (GH 37374) - Changed behavior of DataFrame constructor when passed a list whose first element is a Categorical, this now treats the elements as rows casting to
object
dtype, consistent with behavior for other types (GH 38845) - Changed behavior of DataFrame constructor when passed a
dtype
(other than int) that the data cannot be cast to; it now raises instead of silently ignoring the dtype (GH 41733) - Changed the behavior of Series constructor, it will no longer infer a datetime64 or timedelta64 dtype from string entries (GH 41731)
- Changed behavior of Timestamp constructor with a
np.datetime64
object and atz
passed to interpret the input as a wall-time as opposed to a UTC time (GH 42288) - Changed behavior of Timestamp.utcfromtimestamp() to return a timezone-aware object satisfying
Timestamp.utcfromtimestamp(val).timestamp() == val
(GH 45083) - Changed behavior of Index constructor when passed a
SparseArray
orSparseDtype
to retain that dtype instead of casting tonumpy.ndarray
(GH 43930) - Changed behavior of setitem-like operations (
__setitem__
,fillna
,where
,mask
,replace
,insert
, fill_value forshift
) on an object with DatetimeTZDtype when using a value with a non-matching timezone, the value will be cast to the object’s timezone instead of casting both to object-dtype (GH 44243) - Changed behavior of Index, Series, DataFrame constructors with floating-dtype data and a DatetimeTZDtype, the data are now interpreted as UTC-times instead of wall-times, consistent with how integer-dtype data are treated (GH 45573)
- Changed behavior of Series and DataFrame constructors with integer dtype and floating-point data containing
NaN
, this now raisesIntCastingNaNError
(GH 40110) - Changed behavior of Series and DataFrame constructors with an integer
dtype
and values that are too large to losslessly cast to this dtype, this now raisesValueError
(GH 41734) - Changed behavior of Series and DataFrame constructors with an integer
dtype
and values having eitherdatetime64
ortimedelta64
dtypes, this now raisesTypeError
, usevalues.view("int64")
instead (GH 41770) - Removed the deprecated
base
andloffset
arguments from pandas.DataFrame.resample(), pandas.Series.resample() and pandas.Grouper. Useoffset
ororigin
instead (GH 31809) - Changed behavior of Series.fillna() and DataFrame.fillna() with
timedelta64[ns]
dtype and an incompatiblefill_value
; this now casts toobject
dtype instead of raising, consistent with the behavior with other dtypes (GH 45746) - Change the default argument of
regex
for Series.str.replace() fromTrue
toFalse
. Additionally, a single characterpat
withregex=True
is now treated as a regular expression instead of a string literal. (GH 36695, GH 24804) - Changed behavior of DataFrame.any() and DataFrame.all() with
bool_only=True
; object-dtype columns with all-bool values will no longer be included, manually cast tobool
dtype first (GH 46188) - Changed behavior of DataFrame.max(), DataFrame.min, DataFrame.mean, DataFrame.median, DataFrame.skew, DataFrame.kurt with
axis=None
to return a scalar applying the aggregation across both axes (GH 45072) - Changed behavior of comparison of a Timestamp with a
datetime.date
object; these now compare as un-equal and raise on inequality comparisons, matching thedatetime.datetime
behavior (GH 36131) - Changed behavior of comparison of
NaT
with adatetime.date
object; these now raise on inequality comparisons (GH 39196) - Enforced deprecation of silently dropping columns that raised a
TypeError
in Series.transform and DataFrame.transform when used with a list or dictionary (GH 43740) - Changed behavior of DataFrame.apply() with list-like so that any partial failure will raise an error (GH 43740)
- Changed behaviour of DataFrame.to_latex() to now use the Styler implementation via Styler.to_latex() (GH 47970)
- Changed behavior of
Series.__setitem__()
with an integer key and aFloat64Index
when the key is not present in the index; previously we treated the key as positional (behaving likeseries.iloc[key] = val
), now we treat it is a label (behaving likeseries.loc[key] = val
), consistent withSeries.__getitem__`()
behavior (GH 33469) - Removed
na_sentinel
argument from factorize(), Index.factorize(), and ExtensionArray.factorize() (GH 47157) - Changed behavior of Series.diff() and DataFrame.diff() with
ExtensionDtype
dtypes whose arrays do not implementdiff
, these now raiseTypeError
rather than casting to numpy (GH 31025) - Enforced deprecation of calling numpy “ufunc”s on DataFrame with
method="outer"
; this now raisesNotImplementedError
(GH 36955) - Enforced deprecation disallowing passing
numeric_only=True
to Series reductions (rank
,any
,all
, …) with non-numeric dtype (GH 47500) - Changed behavior of DataFrameGroupBy.apply() and SeriesGroupBy.apply() so that
group_keys
is respected even if a transformer is detected (GH 34998) - Comparisons between a DataFrame and a Series where the frame’s columns do not match the series’s index raise
ValueError
instead of automatically aligning, doleft, right = left.align(right, axis=1, copy=False)
before comparing (GH 36795) - Enforced deprecation
numeric_only=None
(the default) in DataFrame reductions that would silently drop columns that raised;numeric_only
now defaults toFalse
(GH 41480) - Changed default of
numeric_only
toFalse
in all DataFrame methods with that argument (GH 46096, GH 46906) - Changed default of
numeric_only
toFalse
in Series.rank() (GH 47561) - Enforced deprecation of silently dropping nuisance columns in groupby and resample operations when
numeric_only=False
(GH 41475) - Enforced deprecation of silently dropping nuisance columns in
Rolling
,Expanding
, andExponentialMovingWindow
ops. This will now raise a errors.DataError (GH 42834) - Changed behavior in setting values with
df.loc[:, foo] = bar
ordf.iloc[:, foo] = bar
, these now always attempt to set values inplace before falling back to casting (GH 45333) - Changed default of
numeric_only
in variousDataFrameGroupBy
methods; all methods now default tonumeric_only=False
(GH 46072) - Changed default of
numeric_only
toFalse
inResampler
methods (GH 47177) - Using the method DataFrameGroupBy.transform() with a callable that returns DataFrames will align to the input’s index (GH 47244)
- When providing a list of columns of length one to DataFrame.groupby(), the keys that are returned by iterating over the resulting
DataFrameGroupBy
object will now be tuples of length one (GH 47761) - Removed deprecated methods
ExcelWriter.write_cells()
,ExcelWriter.save()
,ExcelWriter.cur_sheet()
,ExcelWriter.handles()
,ExcelWriter.path()
(GH 45795) - The ExcelWriter attribute
book
can no longer be set; it is still available to be accessed and mutated (GH 48943) - Removed unused
*args
and**kwargs
inRolling
,Expanding
, andExponentialMovingWindow
ops (GH 47851) - Removed the deprecated argument
line_terminator
from DataFrame.to_csv() (GH 45302) - Removed the deprecated argument
label
from lreshape() (GH 30219) - Arguments after
expr
in DataFrame.eval() and DataFrame.query() are keyword-only (GH 47587) - Removed
Index._get_attributes_dict()
(GH 50648) - Removed
Series.__array_wrap__()
(GH 50648) - Changed behavior of DataFrame.value_counts() to return a Series with MultiIndex for any list-like(one element or not) but an Index for a single label (GH 50829)
Performance improvements#
- Performance improvement in DataFrameGroupBy.median() and SeriesGroupBy.median() and DataFrameGroupBy.cumprod() for nullable dtypes (GH 37493)
- Performance improvement in DataFrameGroupBy.all(), DataFrameGroupBy.any(), SeriesGroupBy.all(), and SeriesGroupBy.any() for object dtype (GH 50623)
- Performance improvement in
MultiIndex.argsort()
andMultiIndex.sort_values()
(GH 48406) - Performance improvement in
MultiIndex.size()
(GH 48723) - Performance improvement in
MultiIndex.union()
without missing values and without duplicates (GH 48505, GH 48752) - Performance improvement in
MultiIndex.difference()
(GH 48606) - Performance improvement in MultiIndex set operations with sort=None (GH 49010)
- Performance improvement in DataFrameGroupBy.mean(), SeriesGroupBy.mean(), DataFrameGroupBy.var(), and SeriesGroupBy.var() for extension array dtypes (GH 37493)
- Performance improvement in
MultiIndex.isin()
whenlevel=None
(GH 48622, GH 49577) - Performance improvement in
MultiIndex.putmask()
(GH 49830) - Performance improvement in Index.union() and
MultiIndex.union()
when index contains duplicates (GH 48900) - Performance improvement in Series.rank() for pyarrow-backed dtypes (GH 50264)
- Performance improvement in Series.searchsorted() for pyarrow-backed dtypes (GH 50447)
- Performance improvement in Series.fillna() for extension array dtypes (GH 49722, GH 50078)
- Performance improvement in Index.join(), Index.intersection() and Index.union() for masked and arrow dtypes when Index is monotonic (GH 50310, GH 51365)
- Performance improvement for Series.value_counts() with nullable dtype (GH 48338)
- Performance improvement for Series constructor passing integer numpy array with nullable dtype (GH 48338)
- Performance improvement for DatetimeIndex constructor passing a list (GH 48609)
- Performance improvement in merge() and DataFrame.join() when joining on a sorted MultiIndex (GH 48504)
- Performance improvement in to_datetime() when parsing strings with timezone offsets (GH 50107)
- Performance improvement in DataFrame.loc() and Series.loc() for tuple-based indexing of a MultiIndex (GH 48384)
- Performance improvement for Series.replace() with categorical dtype (GH 49404)
- Performance improvement for
MultiIndex.unique()
(GH 48335) - Performance improvement for indexing operations with nullable and arrow dtypes (GH 49420, GH 51316)
- Performance improvement for concat() with extension array backed indexes (GH 49128, GH 49178)
- Performance improvement for api.types.infer_dtype() (GH 51054)
- Reduce memory usage of DataFrame.to_pickle()/Series.to_pickle() when using BZ2 or LZMA (GH 49068)
- Performance improvement for StringArray constructor passing a numpy array with type
np.str_
(GH 49109) - Performance improvement in from_tuples() (GH 50620)
- Performance improvement in
factorize()
(GH 49177) - Performance improvement in
__setitem__()
(GH 50248, GH 50632) - Performance improvement in ArrowExtensionArray comparison methods when array contains NA (GH 50524)
- Performance improvement in
to_numpy()
(GH 49973, GH 51227) - Performance improvement when parsing strings to BooleanDtype (GH 50613)
- Performance improvement in DataFrame.join() when joining on a subset of a MultiIndex (GH 48611)
- Performance improvement for
MultiIndex.intersection()
(GH 48604) - Performance improvement in
DataFrame.__setitem__()
(GH 46267) - Performance improvement in
var
andstd
for nullable dtypes (GH 48379). - Performance improvement when iterating over pyarrow and nullable dtypes (GH 49825, GH 49851)
- Performance improvements to read_sas() (GH 47403, GH 47405, GH 47656, GH 48502)
- Memory improvement in
RangeIndex.sort_values()
(GH 48801) - Performance improvement in Series.to_numpy() if
copy=True
by avoiding copying twice (GH 24345) - Performance improvement in Series.rename() with MultiIndex (GH 21055)
- Performance improvement in
DataFrameGroupBy
andSeriesGroupBy
whenby
is a categorical type andsort=False
(GH 48976) - Performance improvement in
DataFrameGroupBy
andSeriesGroupBy
whenby
is a categorical type andobserved=False
(GH 49596) - Performance improvement in read_stata() with parameter
index_col
set toNone
(the default). Now the index will be a RangeIndex instead ofInt64Index
(GH 49745) - Performance improvement in merge() when not merging on the index - the new index will now be RangeIndex instead of
Int64Index
(GH 49478) - Performance improvement in DataFrame.to_dict() and Series.to_dict() when using any non-object dtypes (GH 46470)
- Performance improvement in read_html() when there are multiple tables (GH 49929)
- Performance improvement in Period constructor when constructing from a string or integer (GH 38312)
- Performance improvement in to_datetime() when using
'%Y%m%d'
format (GH 17410) - Performance improvement in to_datetime() when format is given or can be inferred (GH 50465)
- Performance improvement in Series.median() for nullable dtypes (GH 50838)
- Performance improvement in read_csv() when passing to_datetime() lambda-function to
date_parser
and inputs have mixed timezone offsetes (GH 35296) - Performance improvement in isna() and isnull() (GH 50658)
- Performance improvement in SeriesGroupBy.value_counts() with categorical dtype (GH 46202)
- Fixed a reference leak in read_hdf() (GH 37441)
- Fixed a memory leak in DataFrame.to_json() and Series.to_json() when serializing datetimes and timedeltas (GH 40443)
- Decreased memory usage in many
DataFrameGroupBy
methods (GH 51090) - Performance improvement in DataFrame.round() for an integer
decimal
parameter (GH 17254) - Performance improvement in DataFrame.replace() and Series.replace() when using a large dict for
to_replace
(GH 6697) - Memory improvement in
StataReader
when reading seekable files (GH 48922)
Bug fixes#
Categorical#
- Bug in
Categorical.set_categories()
losing dtype information (GH 48812) - Bug in Series.replace() with categorical dtype when
to_replace
values overlap with new values (GH 49404) - Bug in Series.replace() with categorical dtype losing nullable dtypes of underlying categories (GH 49404)
- Bug in DataFrame.groupby() and Series.groupby() would reorder categories when used as a grouper (GH 48749)
- Bug in Categorical constructor when constructing from a Categorical object and
dtype="category"
losing ordered-ness (GH 49309) - Bug in SeriesGroupBy.min(), SeriesGroupBy.max(), DataFrameGroupBy.min(), and DataFrameGroupBy.max() with unordered CategoricalDtype with no groups failing to raise
TypeError
(GH 51034)
Datetimelike#
- Bug in pandas.infer_freq(), raising
TypeError
when inferred on RangeIndex (GH 47084) - Bug in to_datetime() incorrectly raising
OverflowError
with string arguments corresponding to large integers (GH 50533) - Bug in to_datetime() was raising on invalid offsets with
errors='coerce'
andinfer_datetime_format=True
(GH 48633) - Bug in DatetimeIndex constructor failing to raise when
tz=None
is explicitly specified in conjunction with timezone-awaredtype
or data (GH 48659) - Bug in subtracting a
datetime
scalar from DatetimeIndex failing to retain the originalfreq
attribute (GH 48818) - Bug in
pandas.tseries.holiday.Holiday
where a half-open date interval causes inconsistent return types fromUSFederalHolidayCalendar.holidays()
(GH 49075) - Bug in rendering DatetimeIndex and Series and DataFrame with timezone-aware dtypes with
dateutil
orzoneinfo
timezones near daylight-savings transitions (GH 49684) - Bug in to_datetime() was raising
ValueError
when parsing Timestamp,datetime.datetime
,datetime.date
, ornp.datetime64
objects when non-ISO8601format
was passed (GH 49298, GH 50036) - Bug in to_datetime() was raising
ValueError
when parsing empty string and non-ISO8601 format was passed. Now, empty strings will be parsed as NaT, for compatibility with how is done for ISO8601 formats (GH 50251) - Bug in Timestamp was showing
UserWarning
, which was not actionable by users, when parsing non-ISO8601 delimited date strings (GH 50232) - Bug in to_datetime() was showing misleading
ValueError
when parsing dates with format containing ISO week directive and ISO weekday directive (GH 50308) - Bug in Timestamp.round() when the
freq
argument has zero-duration (e.g. “0ns”) returning incorrect results instead of raising (GH 49737) - Bug in to_datetime() was not raising
ValueError
when invalid format was passed anderrors
was'ignore'
or'coerce'
(GH 50266) - Bug in
DateOffset
was throwingTypeError
when constructing with milliseconds and another super-daily argument (GH 49897) - Bug in to_datetime() was not raising
ValueError
when parsing string with decimal date with format'%Y%m%d'
(GH 50051) - Bug in to_datetime() was not converting
None
toNaT
when parsing mixed-offset date strings with ISO8601 format (GH 50071) - Bug in to_datetime() was not returning input when parsing out-of-bounds date string with
errors='ignore'
andformat='%Y%m%d'
(GH 14487) - Bug in to_datetime() was converting timezone-naive
datetime.datetime
to timezone-aware when parsing with timezone-aware strings, ISO8601 format, andutc=False
(GH 50254) - Bug in to_datetime() was throwing
ValueError
when parsing dates with ISO8601 format where some values were not zero-padded (GH 21422) - Bug in to_datetime() was giving incorrect results when using
format='%Y%m%d'
anderrors='ignore'
(GH 26493) - Bug in to_datetime() was failing to parse date strings
'today'
and'now'
ifformat
was not ISO8601 (GH 50359) - Bug in Timestamp.utctimetuple() raising a
TypeError
(GH 32174) - Bug in to_datetime() was raising
ValueError
when parsing mixed-offset Timestamp witherrors='ignore'
(GH 50585) - Bug in to_datetime() was incorrectly handling floating-point inputs within 1
unit
of the overflow boundaries (GH 50183) - Bug in to_datetime() with unit of “Y” or “M” giving incorrect results, not matching pointwise Timestamp results (GH 50870)
- Bug in Series.interpolate() and DataFrame.interpolate() with datetime or timedelta dtypes incorrectly raising
ValueError
(GH 11312) - Bug in to_datetime() was not returning input with
errors='ignore'
when input was out-of-bounds (GH 50587) - Bug in DataFrame.from_records() when given a DataFrame input with timezone-aware datetime64 columns incorrectly dropping the timezone-awareness (GH 51162)
- Bug in to_datetime() was raising
decimal.InvalidOperation
when parsing date strings witherrors='coerce'
(GH 51084) - Bug in to_datetime() with both
unit
andorigin
specified returning incorrect results (GH 42624) - Bug in Series.astype() and DataFrame.astype() when converting an object-dtype object containing timezone-aware datetimes or strings to
datetime64[ns]
incorrectly localizing as UTC instead of raisingTypeError
(GH 50140) - Bug in DataFrameGroupBy.quantile() and SeriesGroupBy.quantile() with datetime or timedelta dtypes giving incorrect results for groups containing
NaT
(GH 51373) - Bug in DataFrameGroupBy.quantile() and SeriesGroupBy.quantile() incorrectly raising with PeriodDtype or DatetimeTZDtype (GH 51373)
Timedelta#
- Bug in to_timedelta() raising error when input has nullable dtype
Float64
(GH 48796) - Bug in Timedelta constructor incorrectly raising instead of returning
NaT
when given anp.timedelta64("nat")
(GH 48898) - Bug in Timedelta constructor failing to raise when passed both a Timedelta object and keywords (e.g. days, seconds) (GH 48898)
- Bug in Timedelta comparisons with very large
datetime.timedelta
objects incorrect raisingOutOfBoundsTimedelta
(GH 49021)
Timezones#
- Bug in Series.astype() and DataFrame.astype() with object-dtype containing multiple timezone-aware
datetime
objects with heterogeneous timezones to a DatetimeTZDtype incorrectly raising (GH 32581) - Bug in to_datetime() was failing to parse date strings with timezone name when
format
was specified with%Z
(GH 49748) - Better error message when passing invalid values to
ambiguous
parameter in Timestamp.tz_localize() (GH 49565) - Bug in string parsing incorrectly allowing a Timestamp to be constructed with an invalid timezone, which would raise when trying to print (GH 50668)
- Corrected TypeError message in
objects_to_datetime64ns()
to inform that DatetimeIndex has mixed timezones (GH 50974)
Numeric#
- Bug in DataFrame.add() cannot apply ufunc when inputs contain mixed DataFrame type and Series type (GH 39853)
- Bug in arithmetic operations on Series not propagating mask when combining masked dtypes and numpy dtypes (GH 45810, GH 42630)
- Bug in DataFrame.sem() and Series.sem() where an erroneous
TypeError
would always raise when using data backed by an ArrowDtype (GH 49759) - Bug in
Series.__add__()
casting to object for list and masked Series (GH 22962) - Bug in
mode()
wheredropna=False
was not respected when there wasNA
values (GH 50982) - Bug in DataFrame.query() with
engine="numexpr"
and column names aremin
ormax
would raise aTypeError
(GH 50937) - Bug in DataFrame.min() and DataFrame.max() with tz-aware data containing
pd.NaT
andaxis=1
would return incorrect results (GH 51242)
Conversion#
- Bug in constructing Series with
int64
dtype from a string list raising instead of casting (GH 44923) - Bug in constructing Series with masked dtype and boolean values with
NA
raising (GH 42137) - Bug in DataFrame.eval() incorrectly raising an
AttributeError
when there are negative values in function call (GH 46471) - Bug in Series.convert_dtypes() not converting dtype to nullable dtype when Series contains
NA
and has dtypeobject
(GH 48791) - Bug where any
ExtensionDtype
subclass withkind="M"
would be interpreted as a timezone type (GH 34986) - Bug in arrays.ArrowExtensionArray that would raise
NotImplementedError
when passed a sequence of strings or binary (GH 49172) - Bug in Series.astype() raising
pyarrow.ArrowInvalid
when converting from a non-pyarrow string dtype to a pyarrow numeric type (GH 50430) - Bug in DataFrame.astype() modifying input array inplace when converting to
string
andcopy=False
(GH 51073) - Bug in Series.to_numpy() converting to NumPy array before applying
na_value
(GH 48951) - Bug in DataFrame.astype() not copying data when converting to pyarrow dtype (GH 50984)
- Bug in to_datetime() was not respecting
exact
argument whenformat
was an ISO8601 format (GH 12649) - Bug in
TimedeltaArray.astype()
raisingTypeError
when converting to a pyarrow duration type (GH 49795) - Bug in DataFrame.eval() and DataFrame.query() raising for extension array dtypes (GH 29618, GH 50261, GH 31913)
- Bug in Series() not copying data when created from Index and
dtype
is equal todtype
from Index (GH 52008)
Strings#
- Bug in pandas.api.types.is_string_dtype() that would not return
True
for StringDtype or ArrowDtype withpyarrow.string()
(GH 15585) - Bug in converting string dtypes to “datetime64[ns]” or “timedelta64[ns]” incorrectly raising
TypeError
(GH 36153) - Bug in setting values in a string-dtype column with an array, mutating the array as side effect when it contains missing values (GH 51299)
Interval#
- Bug in IntervalIndex.is_overlapping() incorrect output if interval has duplicate left boundaries (GH 49581)
- Bug in Series.infer_objects() failing to infer IntervalDtype for an object series of Interval objects (GH 50090)
- Bug in Series.shift() with IntervalDtype and invalid null
fill_value
failing to raiseTypeError
(GH 51258)
Indexing#
- Bug in
DataFrame.__setitem__()
raising when indexer is a DataFrame withboolean
dtype (GH 47125) - Bug in DataFrame.reindex() filling with wrong values when indexing columns and index for
uint
dtypes (GH 48184) - Bug in DataFrame.loc() when setting DataFrame with different dtypes coercing values to single dtype (GH 50467)
- Bug in DataFrame.sort_values() where
None
was not returned whenby
is empty list andinplace=True
(GH 50643) - Bug in DataFrame.loc() coercing dtypes when setting values with a list indexer (GH 49159)
- Bug in Series.loc() raising error for out of bounds end of slice indexer (GH 50161)
- Bug in DataFrame.loc() raising
ValueError
with allFalse
bool
indexer and empty object (GH 51450) - Bug in DataFrame.loc() raising
ValueError
withbool
indexer and MultiIndex (GH 47687) - Bug in DataFrame.loc() raising
IndexError
when setting values for a pyarrow-backed column with a non-scalar indexer (GH 50085) - Bug in
DataFrame.__getitem__()
,Series.__getitem__()
,DataFrame.__setitem__()
andSeries.__setitem__()
when indexing on indexes with extension float dtypes (Float64
&Float64
) or complex dtypes using integers (GH 51053) - Bug in DataFrame.loc() modifying object when setting incompatible value with an empty indexer (GH 45981)
- Bug in
DataFrame.__setitem__()
raisingValueError
when right hand side is DataFrame with MultiIndex columns (GH 49121) - Bug in DataFrame.reindex() casting dtype to
object
when DataFrame has single extension array column when re-indexingcolumns
andindex
(GH 48190) - Bug in DataFrame.iloc() raising
IndexError
when indexer is a Series with numeric extension array dtype (GH 49521) - Bug in describe() when formatting percentiles in the resulting index showed more decimals than needed (GH 46362)
- Bug in DataFrame.compare() does not recognize differences when comparing
NA
with value in nullable dtypes (GH 48939) - Bug in Series.rename() with MultiIndex losing extension array dtypes (GH 21055)
- Bug in
DataFrame.isetitem()
coercing extension array dtypes in DataFrame to object (GH 49922) - Bug in
Series.__getitem__()
returning corrupt object when selecting from an empty pyarrow backed object (GH 51734) - Bug in
BusinessHour
would cause creation of DatetimeIndex to fail when no opening hour was included in the index (GH 49835)
Missing#
- Bug in Index.equals() raising
TypeError
when Index consists of tuples that containNA
(GH 48446) - Bug in Series.map() caused incorrect result when data has NaNs and defaultdict mapping was used (GH 48813)
- Bug in NA raising a
TypeError
instead of return NA when performing a binary operation with abytes
object (GH 49108) - Bug in DataFrame.update() with
overwrite=False
raisingTypeError
whenself
has column withNaT
values and column not present inother
(GH 16713) - Bug in Series.replace() raising
RecursionError
when replacing value in object-dtype Series containingNA
(GH 47480) - Bug in Series.replace() raising
RecursionError
when replacing value in numeric Series withNA
(GH 50758)
MultiIndex#
- Bug in MultiIndex.get_indexer() not matching
NaN
values (GH 29252, GH 37222, GH 38623, GH 42883, GH 43222, GH 46173, GH 48905) - Bug in
MultiIndex.argsort()
raisingTypeError
when index contains NA (GH 48495) - Bug in
MultiIndex.difference()
losing extension array dtype (GH 48606) - Bug in MultiIndex.set_levels raising
IndexError
when setting empty level (GH 48636) - Bug in
MultiIndex.unique()
losing extension array dtype (GH 48335) - Bug in
MultiIndex.intersection()
losing extension array (GH 48604) - Bug in
MultiIndex.union()
losing extension array (GH 48498, GH 48505, GH 48900) - Bug in
MultiIndex.union()
not sorting when sort=None and index contains missing values (GH 49010) - Bug in MultiIndex.append() not checking names for equality (GH 48288)
- Bug in
MultiIndex.symmetric_difference()
losing extension array (GH 48607) - Bug in
MultiIndex.join()
losing dtypes when MultiIndex has duplicates (GH 49830) - Bug in
MultiIndex.putmask()
losing extension array (GH 49830) - Bug in
MultiIndex.value_counts()
returning a Series indexed by flat index of tuples instead of a MultiIndex (GH 49558)
I/O#
- Bug in read_sas() caused fragmentation of DataFrame and raised errors.PerformanceWarning (GH 48595)
- Improved error message in read_excel() by including the offending sheet name when an exception is raised while reading a file (GH 48706)
- Bug when a pickling a subset PyArrow-backed data that would serialize the entire data instead of the subset (GH 42600)
- Bug in read_sql_query() ignoring
dtype
argument whenchunksize
is specified and result is empty (GH 50245) - Bug in read_csv() for a single-line csv with fewer columns than
names
raised errors.ParserError withengine="c"
(GH 47566) - Bug in read_json() raising with
orient="table"
andNA
value (GH 40255) - Bug in displaying
string
dtypes not showing storage option (GH 50099) - Bug in DataFrame.to_string() with
header=False
that printed the index name on the same line as the first row of the data (GH 49230) - Bug in DataFrame.to_string() ignoring float formatter for extension arrays (GH 39336)
- Fixed memory leak which stemmed from the initialization of the internal JSON module (GH 49222)
- Fixed issue where json_normalize() would incorrectly remove leading characters from column names that matched the
sep
argument (GH 49861) - Bug in read_csv() unnecessarily overflowing for extension array dtype when containing
NA
(GH 32134) - Bug in DataFrame.to_dict() not converting
NA
toNone
(GH 50795) - Bug in DataFrame.to_json() where it would segfault when failing to encode a string (GH 50307)
- Bug in DataFrame.to_html() with
na_rep
set when the DataFrame contains non-scalar data (GH 47103) - Bug in read_xml() where file-like objects failed when iterparse is used (GH 50641)
- Bug in read_csv() when
engine="pyarrow"
whereencoding
parameter was not handled correctly (GH 51302) - Bug in read_xml() ignored repeated elements when iterparse is used (GH 51183)
- Bug in ExcelWriter leaving file handles open if an exception occurred during instantiation (GH 51443)
- Bug in DataFrame.to_parquet() where non-string index or columns were raising a
ValueError
whenengine="pyarrow"
(GH 52036)
Period#
- Bug in Period.strftime() and PeriodIndex.strftime(), raising
UnicodeDecodeError
when a locale-specific directive was passed (GH 46319) - Bug in adding a Period object to an array of
DateOffset
objects incorrectly raisingTypeError
(GH 50162) - Bug in Period where passing a string with finer resolution than nanosecond would result in a
KeyError
instead of dropping the extra precision (GH 50417) - Bug in parsing strings representing Week-periods e.g. “2017-01-23/2017-01-29” as minute-frequency instead of week-frequency (GH 50803)
- Bug in DataFrameGroupBy.sum(),
DataFrameGroupByGroupBy.cumsum()
,DataFrameGroupByGroupBy.prod()
,DataFrameGroupByGroupBy.cumprod()
with PeriodDtype failing to raiseTypeError
(GH 51040) - Bug in parsing empty string with Period incorrectly raising
ValueError
instead of returningNaT
(GH 51349)
Plotting#
- Bug in DataFrame.plot.hist(), not dropping elements of
weights
corresponding toNaN
values indata
(GH 48884) ax.set_xlim
was sometimes raisingUserWarning
which users couldn’t address due toset_xlim
not accepting parsing arguments - the converter now uses Timestamp() instead (GH 49148)
Groupby/resample/rolling#
- Bug in
ExponentialMovingWindow
withonline
not raising aNotImplementedError
for unsupported operations (GH 48834) - Bug in DataFrameGroupBy.sample() raises
ValueError
when the object is empty (GH 48459) - Bug in Series.groupby() raises
ValueError
when an entry of the index is equal to the name of the index (GH 48567) - Bug in DataFrameGroupBy.resample() produces inconsistent results when passing empty DataFrame (GH 47705)
- Bug in
DataFrameGroupBy
andSeriesGroupBy
would not include unobserved categories in result when grouping by categorical indexes (GH 49354) - Bug in
DataFrameGroupBy
andSeriesGroupBy
would change result order depending on the input index when grouping by categoricals (GH 49223) - Bug in
DataFrameGroupBy
andSeriesGroupBy
when grouping on categorical data would sort result values even when used withsort=False
(GH 42482) - Bug in DataFrameGroupBy.apply() and
SeriesGroupBy.apply
withas_index=False
would not attempt the computation without using the grouping keys when using them failed with aTypeError
(GH 49256) - Bug in DataFrameGroupBy.describe() would describe the group keys (GH 49256)
- Bug in SeriesGroupBy.describe() with
as_index=False
would have the incorrect shape (GH 49256) - Bug in
DataFrameGroupBy
andSeriesGroupBy
withdropna=False
would drop NA values when the grouper was categorical (GH 36327) - Bug in SeriesGroupBy.nunique() would incorrectly raise when the grouper was an empty categorical and
observed=True
(GH 21334) - Bug in SeriesGroupBy.nth() would raise when grouper contained NA values after subsetting from a
DataFrameGroupBy
(GH 26454) - Bug in DataFrame.groupby() would not include a Grouper specified by
key
in the result whenas_index=False
(GH 50413) - Bug in DataFrameGroupBy.value_counts() would raise when used with a
TimeGrouper
(GH 50486) - Bug in Resampler.size() caused a wide DataFrame to be returned instead of a Series with MultiIndex (GH 46826)
- Bug in DataFrameGroupBy.transform() and SeriesGroupBy.transform() would raise incorrectly when grouper had
axis=1
for"idxmin"
and"idxmax"
arguments (GH 45986) - Bug in
DataFrameGroupBy
would raise when used with an empty DataFrame, categorical grouper, anddropna=False
(GH 50634) - Bug in SeriesGroupBy.value_counts() did not respect
sort=False
(GH 50482) - Bug in DataFrameGroupBy.resample() raises
KeyError
when getting the result from a key list when resampling on time index (GH 50840) - Bug in DataFrameGroupBy.transform() and SeriesGroupBy.transform() would raise incorrectly when grouper had
axis=1
for"ngroup"
argument (GH 45986) - Bug in DataFrameGroupBy.describe() produced incorrect results when data had duplicate columns (GH 50806)
- Bug in DataFrameGroupBy.agg() with
engine="numba"
failing to respectas_index=False
(GH 51228) - Bug in DataFrameGroupBy.agg(), SeriesGroupBy.agg(), and
Resampler.agg()
would ignore arguments when passed a list of functions (GH 50863) - Bug in DataFrameGroupBy.ohlc() ignoring
as_index=False
(GH 51413) - Bug in
DataFrameGroupBy.agg()
after subsetting columns (e.g..groupby(...)[["a", "b"]]
) would not include groupings in the result (GH 51186)
Reshaping#
- Bug in DataFrame.pivot_table() raising
TypeError
for nullable dtype andmargins=True
(GH 48681) - Bug in DataFrame.unstack() and Series.unstack() unstacking wrong level of MultiIndex when MultiIndex has mixed names (GH 48763)
- Bug in DataFrame.melt() losing extension array dtype (GH 41570)
- Bug in DataFrame.pivot() not respecting
None
as column name (GH 48293) - Bug in DataFrame.join() when
left_on
orright_on
is or includes a CategoricalIndex incorrectly raisingAttributeError
(GH 48464) - Bug in DataFrame.pivot_table() raising
ValueError
with parametermargins=True
when result is an empty DataFrame (GH 49240) - Clarified error message in merge() when passing invalid
validate
option (GH 49417) - Bug in DataFrame.explode() raising
ValueError
on multiple columns withNaN
values or empty lists (GH 46084) - Bug in DataFrame.transpose() with
IntervalDtype
column withtimedelta64[ns]
endpoints (GH 44917) - Bug in DataFrame.agg() and Series.agg() would ignore arguments when passed a list of functions (GH 50863)
Sparse#
- Bug in Series.astype() when converting a
SparseDtype
withdatetime64[ns]
subtype toint64
dtype raising, inconsistent with the non-sparse behavior (GH 49631,:issue:50087) - Bug in Series.astype() when converting a from
datetime64[ns]
toSparse[datetime64[ns]]
incorrectly raising (GH 50082) - Bug in Series.sparse.to_coo() raising
SystemError
when MultiIndex contains aExtensionArray
(GH 50996)
ExtensionArray#
- Bug in Series.mean() overflowing unnecessarily with nullable integers (GH 48378)
- Bug in
Series.tolist()
for nullable dtypes returning numpy scalars instead of python scalars (GH 49890) - Bug in Series.round() for pyarrow-backed dtypes raising
AttributeError
(GH 50437) - Bug when concatenating an empty DataFrame with an ExtensionDtype to another DataFrame with the same ExtensionDtype, the resulting dtype turned into object (GH 48510)
- Bug in
array.PandasArray.to_numpy()
raising withNA
value whenna_value
is specified (GH 40638) - Bug in api.types.is_numeric_dtype() where a custom
ExtensionDtype
would not returnTrue
if_is_numeric
returnedTrue
(GH 50563) - Bug in api.types.is_integer_dtype(), api.types.is_unsigned_integer_dtype(), api.types.is_signed_integer_dtype(), api.types.is_float_dtype() where a custom
ExtensionDtype
would not returnTrue
ifkind
returned the corresponding NumPy type (GH 50667) - Bug in Series constructor unnecessarily overflowing for nullable unsigned integer dtypes (GH 38798, GH 25880)
- Bug in setting non-string value into
StringArray
raisingValueError
instead ofTypeError
(GH 49632) - Bug in DataFrame.reindex() not honoring the default
copy=True
keyword in case of columns with ExtensionDtype (and as a result also selecting multiple columns with getitem ([]
) didn’t correctly result in a copy) (GH 51197) - Bug in ArrowExtensionArray logical operations
&
and|
raisingKeyError
(GH 51688)
Styler#
- Fix background_gradient() for nullable dtype Series with
NA
values (GH 50712)
Metadata#
- Fixed metadata propagation in DataFrame.corr() and DataFrame.cov() (GH 28283)
Other#
- Bug in incorrectly accepting dtype strings containing “[pyarrow]” more than once (GH 51548)
- Bug in Series.searchsorted() inconsistent behavior when accepting DataFrame as parameter
value
(GH 49620) - Bug in array() failing to raise on DataFrame inputs (GH 51167)
Contributors#
A total of 260 people contributed patches to this release. People with a “+” by their names contributed a patch for the first time.
- 5j9 +
- ABCPAN-rank +
- Aarni Koskela +
- Aashish KC +
- Abubeker Mohammed +
- Adam Mróz +
- Adam Ormondroyd +
- Aditya Anulekh +
- Ahmed Ibrahim
- Akshay Babbar +
- Aleksa Radojicic +
- Alex +
- Alex Buzenet +
- Alex Kirko
- Allison Kwan +
- Amay Patel +
- Ambuj Pawar +
- Amotz +
- Andreas Schwab +
- Andrew Chen +
- Anton Shevtsov
- Antonio Ossa Guerra +
- Antonio Ossa-Guerra +
- Anushka Bishnoi +
- Arda Kosar
- Armin Berres
- Asadullah Naeem +
- Asish Mahapatra
- Bailey Lissington +
- BarkotBeyene
- Ben Beasley
- Bhavesh Rajendra Patil +
- Bibek Jha +
- Bill +
- Bishwas +
- CarlosGDCJ +
- Carlotta Fabian +
- Chris Roth +
- Chuck Cadman +
- Corralien +
- DG +
- Dan Hendry +
- Daniel Isaac
- David Kleindienst +
- David Poznik +
- David Rudel +
- DavidKleindienst +
- Dea María Léon +
- Deepak Sirohiwal +
- Dennis Chukwunta
- Douglas Lohmann +
- Dries Schaumont
- Dustin K +
- Edoardo Abati +
- Eduardo Chaves +
- Ege Özgüroğlu +
- Ekaterina Borovikova +
- Eli Schwartz +
- Elvis Lim +
- Emily Taylor +
- Emma Carballal Haire +
- Erik Welch +
- Fangchen Li
- Florian Hofstetter +
- Flynn Owen +
- Fredrik Erlandsson +
- Gaurav Sheni
- Georeth Chow +
- George Munyoro +
- Guilherme Beltramini
- Gulnur Baimukhambetova +
- H L +
- Hans
- Hatim Zahid +
- HighYoda +
- Hiki +
- Himanshu Wagh +
- Hugo van Kemenade +
- Idil Ismiguzel +
- Irv Lustig
- Isaac Chung
- Isaac Virshup
- JHM Darbyshire
- JHM Darbyshire (iMac)
- JMBurley
- Jaime Di Cristina
- Jan Koch
- JanVHII +
- Janosh Riebesell
- JasmandeepKaur +
- Jeremy Tuloup
- Jessica M +
- Jonas Haag
- Joris Van den Bossche
- João Meirelles +
- Julia Aoun +
- Justus Magin +
- Kang Su Min +
- Kevin Sheppard
- Khor Chean Wei
- Kian Eliasi
- Kostya Farber +
- KotlinIsland +
- Lakmal Pinnaduwage +
- Lakshya A Agrawal +
- Lawrence Mitchell +
- Levi Ob +
- Loic Diridollou
- Lorenzo Vainigli +
- Luca Pizzini +
- Lucas Damo +
- Luke Manley
- Madhuri Patil +
- Marc Garcia
- Marco Edward Gorelli
- Marco Gorelli
- MarcoGorelli
- Maren Westermann +
- Maria Stazherova +
- Marie K +
- Marielle +
- Mark Harfouche +
- Marko Pacak +
- Martin +
- Matheus Cerqueira +
- Matheus Pedroni +
- Matteo Raso +
- Matthew Roeschke
- MeeseeksMachine +
- Mehdi Mohammadi +
- Michael Harris +
- Michael Mior +
- Natalia Mokeeva +
- Neal Muppidi +
- Nick Crews
- Nishu Choudhary +
- Noa Tamir
- Noritada Kobayashi
- Omkar Yadav +
- P. Talley +
- Pablo +
- Pandas Development Team
- Parfait Gasana
- Patrick Hoefler
- Pedro Nacht +
- Philip +
- Pietro Battiston
- Pooja Subramaniam +
- Pranav Saibhushan Ravuri +
- Pranav. P. A +
- Ralf Gommers +
- RaphSku +
- Richard Shadrach
- Robsdedude +
- Roger
- Roger Thomas
- RogerThomas +
- SFuller4 +
- Salahuddin +
- Sam Rao
- Sean Patrick Malloy +
- Sebastian Roll +
- Shantanu
- Shashwat +
- Shashwat Agrawal +
- Shiko Wamwea +
- Shoham Debnath
- Shubhankar Lohani +
- Siddhartha Gandhi +
- Simon Hawkins
- Soumik Dutta +
- Sowrov Talukder +
- Stefanie Molin
- Stefanie Senger +
- Stepfen Shawn +
- Steven Rotondo
- Stijn Van Hoey
- Sudhansu +
- Sven
- Sylvain MARIE
- Sylvain Marié
- Tabea Kossen +
- Taylor Packard
- Terji Petersen
- Thierry Moisan
- Thomas H +
- Thomas Li
- Torsten Wörtwein
- Tsvika S +
- Tsvika Shapira +
- Vamsi Verma +
- Vinicius Akira +
- William Andrea
- William Ayd
- William Blum +
- Wilson Xing +
- Xiao Yuan +
- Xnot +
- Yasin Tatar +
- Yuanhao Geng
- Yvan Cywan +
- Zachary Moon +
- Zhengbo Wang +
- abonte +
- adrienpacifico +
- alm
- amotzop +
- andyjessen +
- anonmouse1 +
- bang128 +
- bishwas jha +
- calhockemeyer +
- carla-alves-24 +
- carlotta +
- casadipietra +
- catmar22 +
- cfabian +
- codamuse +
- dataxerik
- davidleon123 +
- dependabot[bot] +
- fdrocha +
- github-actions[bot]
- himanshu_wagh +
- iofall +
- jakirkham +
- jbrockmendel
- jnclt +
- joelchen +
- joelsonoda +
- joshuabello2550
- joycewamwea +
- kathleenhang +
- krasch +
- ltoniazzi +
- luke396 +
- milosz-martynow +
- minat-hub +
- mliu08 +
- monosans +
- nealxm
- nikitaved +
- paradox-lab +
- partev
- raisadz +
- ram vikram singh +
- rebecca-palmer
- sarvaSanjay +
- seljaks +
- silviaovo +
- smij720 +
- soumilbaldota +
- stellalin7 +
- strawberry beach sandals +
- tmoschou +
- uzzell +
- yqyqyq-W +
- yun +
- Ádám Lippai
- 김동현 (Daniel Donghyun Kim) +