What’s new in 3.0.0 (Month XX, 2025) — pandas 3.0.0rc0+31.g944c527c0a documentation (original) (raw)
These are the changes in pandas 3.0.0. See Release notes for a full changelog including other versions of pandas.
Enhancements#
Dedicated string data type by default#
Historically, pandas represented string columns with NumPy object data type. This representation has numerous problems: it is not specific to strings (any Python object can be stored in an object-dtype array, not just strings) and it is often not very efficient (both performance wise and for memory usage).
Starting with pandas 3.0, a dedicated string data type is enabled by default (backed by PyArrow under the hood, if installed, otherwise falling back to being backed by NumPy object-dtype). This means that pandas will start inferring columns containing string data as the new str data type when creating pandas objects, such as in constructors or IO functions.
Old behavior:
ser = pd.Series(["a", "b"]) 0 a 1 b dtype: object
New behavior:
ser = pd.Series(["a", "b"]) 0 a 1 b dtype: str
The string data type that is used in these scenarios will mostly behave as NumPy object would, including missing value semantics and general operations on these columns.
The main characteristic of the new string data type:
- Inferred by default for string data (instead of object dtype)
- The
strdtype can only hold strings (or missing values), in contrast toobjectdtype. (setitem with non string fails) - The missing value sentinel is always
NaN(np.nan) and follows the same missing value semantics as the other default dtypes.
Those intentional changes can have breaking consequences, for example when checking for the .dtype being object dtype or checking the exact missing value sentinel. See the Migration guide for the new string data type (pandas 3.0) for more details on the behaviour changes and how to adapt your code to the new default.
Copy-on-Write#
The new “copy-on-write” behaviour in pandas 3.0 brings changes in behavior in how pandas operates with respect to copies and views. A summary of the changes:
- The result of any indexing operation (subsetting a DataFrame or Series in any way, i.e. including accessing a DataFrame column as a Series) or any method returning a new DataFrame or Series, always behaves as if it were a copy in terms of user API.
- As a consequence, if you want to modify an object (DataFrame or Series), the only way to do this is to directly modify that object itself.
The main goal of this change is to make the user API more consistent and predictable. There is now a clear rule: any subset or returned series/dataframe always behaves as a copy of the original, and thus never modifies the original (before pandas 3.0, whether a derived object would be a copy or a view depended on the exact operation performed, which was often confusing).
Because every single indexing step now behaves as a copy, this also means that “chained assignment” (updating a DataFrame with multiple setitem steps) will stop working. Because this now consistently never works, theSettingWithCopyWarning is removed.
The new behavioral semantics are explained in more detail in theuser guide about Copy-on-Write.
A secondary goal is to improve performance by avoiding unnecessary copies. As mentioned above, every new DataFrame or Series returned from an indexing operation or method behaves as a copy, but under the hood pandas will use views as much as possible, and only copy when needed to guarantee the “behaves as a copy” behaviour (this is the actual “copy-on-write” mechanism used as an implementation detail).
Some of the behaviour changes described above are breaking changes in pandas 3.0. When upgrading to pandas 3.0, it is recommended to first upgrade to pandas 2.3 to get deprecation warnings for a subset of those changes. Themigration guide explains the upgrade process in more detail.
Setting the option mode.copy_on_write no longer has any impact. The option is deprecated and will be removed in pandas 4.0.
pd.col syntax can now be used in DataFrame.assign() and DataFrame.loc()#
You can now use pd.col to create callables for use in dataframe methods which accept them. For example, if you have a dataframe
In [1]: df = pd.DataFrame({'a': [1, 1, 2], 'b': [4, 5, 6]})
and you want to create a new column 'c' by summing 'a' and 'b', then instead of
In [2]: df.assign(c = lambda df: df['a'] + df['b']) Out[2]: a b c 0 1 4 5 1 1 5 6 2 2 6 8
you can now write:
In [3]: df.assign(c = pd.col('a') + pd.col('b')) Out[3]: a b c 0 1 4 5 1 1 5 6 2 2 6 8
New Deprecation Policy#
pandas 3.0.0 introduces a new 3-stage deprecation policy: using DeprecationWarning initially, then switching to FutureWarning for broader visibility in the last minor version before the next major release, and then removal of the deprecated functionality in the major release. This was done to give downstream packages more time to adjust to pandas deprecations, which should reduce the amount of warnings that a user gets from code that isn’t theirs. See PDEP 17 for more details.
All warnings for upcoming changes in pandas will have the base class pandas.errors.PandasChangeWarning. Users may also use the following subclasses to control warnings.
- pandas.errors.Pandas4Warning: Warnings which will be enforced in pandas 4.0.
- pandas.errors.Pandas5Warning: Warnings which will be enforced in pandas 5.0.
- pandas.errors.PandasPendingDeprecationWarning: Base class of all warnings which emit a
PendingDeprecationWarning, independent of the version they will be enforced. - pandas.errors.PandasDeprecationWarning: Base class of all warnings which emit a
DeprecationWarning, independent of the version they will be enforced. - pandas.errors.PandasFutureWarning: Base class of all warnings which emit a
FutureWarning, independent of the version they will be enforced.
Other enhancements#
- pandas.NamedAgg now supports passing
*argsand**kwargsto calls ofaggfunc(GH 58283) - pandas.merge() propagates the
attrsattribute to the result if all inputs have identicalattrs, as has so far already been the case forpandas.concat(). pandas.api.typing.FrozenListis available for typing the outputs of MultiIndex.names, MultiIndex.codes and MultiIndex.levels (GH 58237)pandas.api.typing.NoDefaultis available for typingno_default(GH 60696)pandas.api.typing.SASReaderis available for typing the output of read_sas() (GH 55689)- DataFrame.to_excel() now raises a
UserWarningwhen the character count in a cell exceeds Excel’s limitation of 32767 characters (GH 56954) - pandas.merge() now validates the
howparameter input (merge type) (GH 59435) - pandas.merge(), DataFrame.merge() and DataFrame.join() now support anti joins (
left_antiandright_anti) in thehowparameter (GH 42916) - read_spss() now supports kwargs to be passed to
pyreadstat(GH 56356) - read_stata() now returns
datetime64resolutions better matching those natively stored in the stata format (GH 55642) - Styler.set_tooltips() provides alternative method to storing tooltips by using title attribute of td elements. (GH 56981)
- DataFrame.agg() called with
axis=1and afuncwhich relabels the result index now raises aNotImplementedError(GH 58807). - Index.get_loc() now accepts also subclasses of
tupleas keys (GH 57922) - Added Styler.to_typst() to write Styler objects to file, buffer or string in Typst format (GH 57617)
- Added missing pandas.Series.info() to API reference (GH 60926)
- Added missing parameter
weightsin DataFrame.plot.kde() for the estimation of the PDF (GH 59337) - Allow dictionaries to be passed to Series.str.replace() via
patparameter (GH 51748) - Support passing a Series input to json_normalize() that retains the Index (GH 51452)
- Support reading value labels from Stata 108-format (Stata 6) and earlier files (GH 58154)
- Users can globally disable any
PerformanceWarningby setting the optionmode.performance_warningstoFalse(GH 56920) - Styler.format_index_names() can now be used to format the index and column names (GH 48936 and GH 47489)
- errors.DtypeWarning improved to include column names when mixed data types are detected (GH 58174)
- Series now supports the Arrow PyCapsule Interface for export (GH 59518)
- DataFrame.to_excel() argument
merge_cellsnow accepts a value of"columns"to only merge MultiIndex column header header cells (GH 35384) - set_option() now accepts a dictionary of options, simplifying configuration of multiple settings at once (GH 61093)
- DataFrame.corrwith() now accepts
min_periodsas optional arguments, as in DataFrame.corr() and Series.corr() (GH 9490) - DataFrame.cummin(), DataFrame.cummax(), DataFrame.cumprod() and DataFrame.cumsum() methods now have a
numeric_onlyparameter (GH 53072) - DataFrame.ewm() now allows
adjust=Falsewhentimesis provided (GH 54328) - DataFrame.fillna() and Series.fillna() can now accept
value=None; for non-object dtype the corresponding NA value will be used (GH 57723) - DataFrame.pivot_table() and pivot_table() now allow the passing of keyword arguments to
aggfuncthrough**kwargs(GH 57884) - DataFrame.to_json() now encodes
Decimalas strings instead of floats (GH 60698) - Series.cummin() and Series.cummax() now supports CategoricalDtype (GH 52335)
- Series.plot() now correctly handle the
ylabelparameter for pie charts, allowing for explicit control over the y-axis label (GH 58239) - Added Rolling.pipe() and Expanding.pipe() (GH 57076)
- DataFrame.plot.scatter() argument
cnow accepts a column of strings, where rows with the same string are colored identically (GH 16827 and GH 16485) - Easter has gained a new constructor argument
methodwhich specifies the method used to calculate Easter — for example, Orthodox Easter (GH 61665) - ArrowDtype now supports
pyarrow.JsonType(GH 60958) DataFrameGroupByandSeriesGroupBymethodssum,mean,median,prod,min,max,std,varandsemnow acceptskipnaparameter (GH 15675)Holidayconstructor argumentdays_of_weekwill raise aValueErrorwhen type is something other thanNoneortuple(GH 61658)Holidayhas gained the constructor argument and fieldexclude_datesto exclude specific datetimes from a custom holiday calendar (GH 54382)- DataFrame.to_excel() has a new
autofilterparameter to add automatic filters to all columns (GH 61194) - read_parquet() accepts
to_pandas_kwargswhich are forwarded to pyarrow.Table.to_pandas() which enables passing additional keywords to customize the conversion to pandas, such asmaps_as_pydictsto read the Parquet map data type as python dictionaries (GH 56842) - DataFrameGroupBy.transform(), SeriesGroupBy.transform(), DataFrameGroupBy.agg(), SeriesGroupBy.agg(),
RollingGroupby.apply(),ExpandingGroupby.apply(), Rolling.apply(), Expanding.apply(), DataFrame.apply() withengine="numba"now supports positional arguments passed as kwargs (GH 58995) - DataFrameGroupBy.transform(), SeriesGroupBy.transform(), DataFrameGroupBy.agg(), SeriesGroupBy.agg(), SeriesGroupBy.apply(), DataFrameGroupBy.apply() now support
kurt(GH 40139) - Rolling.aggregate(), Expanding.aggregate() and
ExponentialMovingWindow.aggregate()now accept NamedAgg aggregations through**kwargs(GH 28333) - DataFrame.apply() supports using third-party execution engines like the Bodo.ai JIT compiler (GH 60668)
- DataFrame.iloc() and Series.iloc() now support boolean masks in
__getitem__for more consistent indexing behavior (GH 60994) - DataFrame.to_csv() and Series.to_csv() now support f-strings (e.g.,
"{:.6f}") for thefloat_formatparameter, in addition to the%format strings and callables (GH 49580) - Series.map() can now accept kwargs to pass on to func (GH 59814)
- Series.map() now accepts an
engineparameter to allow execution with a third-party execution engine (GH 61125) - Series.nlargest() uses stable sort internally and will preserve original ordering in the case of equality (GH 55767)
- Series.rank() and DataFrame.rank() with numpy-nullable dtypes preserve
NAvalues and returnUInt64dtype where appropriate instead of castingNAtoNaNwithfloat64dtype (GH 62043) - Series.str.get_dummies() now accepts a
dtypeparameter to specify the dtype of the resulting DataFrame (GH 47872) - pandas.concat() will raise a
ValueErrorwhenignore_index=Trueandkeysis notNone(GH 59274) - frozenset elements in pandas objects are now natively printed (GH 60690)
- Added Rolling.first(), Rolling.last(), Expanding.first(), and Expanding.last() (GH 33155)
- Added Rolling.nunique() and Expanding.nunique() (GH 26958)
- Added Series.str.isascii() (GH 59091)
- Added
"delete_rows"option toif_existsargument in DataFrame.to_sql() deleting all records of the table before inserting data (GH 37210). - Added half-year offset classes HalfYearBegin, HalfYearEnd, BHalfYearBegin and BHalfYearEnd (GH 60928)
- Added support for
axis=1withdictor Series arguments in DataFrame.fillna() (GH 4514) - Added support to read and write from and to Apache Iceberg tables with the new read_iceberg() and DataFrame.to_iceberg() functions (GH 61383)
- Errors occurring during SQL I/O will now throw a generic DatabaseError instead of the raw Exception type from the underlying driver manager library (GH 60748)
- Improve error reporting through outputting the first few duplicates when merge() validation fails (GH 62742)
- Improve the resulting dtypes in DataFrame.where() and DataFrame.mask() with
ExtensionDtypeother(GH 62038) - Improved deprecation message for offset aliases (GH 60820)
- Many type aliases are now exposed in the new submodule
pandas.api.typing.aliases(GH 55231) - Multiplying two DateOffset objects will now raise a
TypeErrorinstead of aRecursionError(GH 59442) - Restore support for reading Stata 104-format and enable reading 103-format dta files (GH 58554)
- Support passing a
Iterable[Hashable]input to DataFrame.drop_duplicates() (GH 59237) - Support reading Stata 102-format (Stata 1) dta files (GH 58978)
- Support reading Stata 110-format (Stata 7) dta files (GH 47176)
- Switched wheel upload to PyPI Trusted Publishing (OIDC) for release-tag pushes in
wheels.yml. (GH 61718) - Added a new DataFrame.from_arrow() method to import any Arrow-compatible tabular data object into a pandas DataFrame through theArrow PyCapsule Protocol (GH 59631)
Notable bug fixes#
These are bug fixes that might have notable behavior changes.
Improved behavior in groupby for observed=False#
A number of bugs have been fixed due to improved handling of unobserved groups. All remarks in this section equally impact SeriesGroupBy. (GH 55738)
In previous versions of pandas, a single grouping with DataFrameGroupBy.apply() or DataFrameGroupBy.agg() would pass the unobserved groups to the provided function, resulting correctly in 0 below.
In [4]: df = pd.DataFrame( ...: { ...: "key1": pd.Categorical(list("aabb"), categories=list("abc")), ...: "key2": [1, 1, 1, 2], ...: "values": [1, 2, 3, 4], ...: } ...: ) ...:
In [5]: df Out[5]: key1 key2 values 0 a 1 1 1 a 1 2 2 b 1 3 3 b 2 4
In [6]: gb = df.groupby("key1", observed=False)
In [7]: gb[["values"]].apply(lambda x: x.sum())
Out[7]:
values
key1
a 3
b 7
c 0
However this was not the case when using multiple groupings, resulting in NaN below.
In [1]: gb = df.groupby(["key1", "key2"], observed=False) In [2]: gb[["values"]].apply(lambda x: x.sum()) Out[2]: values key1 key2 a 1 3.0 2 NaN b 1 3.0 2 4.0 c 1 NaN 2 NaN
Now using multiple groupings will also pass the unobserved groups to the provided function.
In [8]: gb = df.groupby(["key1", "key2"], observed=False)
In [9]: gb[["values"]].apply(lambda x: x.sum())
Out[9]:
values
key1 key2
a 1 3
2 0
b 1 3
2 4
c 1 0
2 0
Similarly:
- In previous versions of pandas the method DataFrameGroupBy.sum() would result in
0for unobserved groups, but DataFrameGroupBy.prod(), DataFrameGroupBy.all(), and DataFrameGroupBy.any() would all result in NA values. Now these methods result in1,True, andFalserespectively. - DataFrameGroupBy.groups() did not include unobserved groups and now does.
These improvements also fixed certain bugs in groupby:
- DataFrameGroupBy.agg() would fail when there are multiple groupings, unobserved groups, and
as_index=False(GH 36698) - DataFrameGroupBy.groups() with
sort=Falsewould sort groups; they now occur in the order they are observed (GH 56966) - DataFrameGroupBy.nunique() would fail when there are multiple groupings, unobserved groups, and
as_index=False(GH 52848) - DataFrameGroupBy.sum() would have incorrect values when there are multiple groupings, unobserved groups, and non-numeric data (GH 43891)
- DataFrameGroupBy.value_counts() would produce incorrect results when used with some categorical and some non-categorical groupings and
observed=False(GH 56016)
Backwards incompatible API changes#
Datetime resolution inference#
Converting a sequence of strings, datetime objects, or np.datetime64 objects to a datetime64 dtype now performs inference on the appropriate resolution (AKA unit) for the output dtype. This affects Series, DataFrame, Index, DatetimeIndex, and to_datetime().
Previously, these would always give nanosecond resolution:
In [1]: dt = pd.Timestamp("2024-03-22 11:36").to_pydatetime() In [2]: pd.to_datetime([dt]).dtype Out[2]: dtype('<M8[ns]') In [3]: pd.Index([dt]).dtype Out[3]: dtype('<M8[ns]') In [4]: pd.DatetimeIndex([dt]).dtype Out[4]: dtype('<M8[ns]') In [5]: pd.Series([dt]).dtype Out[5]: dtype('<M8[ns]')
This now infers the unit microsecond unit “us” from the pydatetime object, matching the scalar Timestamp behavior.
In [10]: In [1]: dt = pd.Timestamp("2024-03-22 11:36").to_pydatetime()
In [11]: In [2]: pd.to_datetime([dt]).dtype Out[11]: dtype('<M8[us]')
In [12]: In [3]: pd.Index([dt]).dtype Out[12]: dtype('<M8[us]')
In [13]: In [4]: pd.DatetimeIndex([dt]).dtype Out[13]: dtype('<M8[us]')
In [14]: In [5]: pd.Series([dt]).dtype Out[14]: dtype('<M8[us]')
Similar when passed a sequence of np.datetime64 objects, the resolution of the passed objects will be retained (or for lower-than-second resolution, second resolution will be used).
When passing strings, the resolution will depend on the precision of the string, again matching the Timestamp behavior. Previously:
In [2]: pd.to_datetime(["2024-03-22 11:43:01"]).dtype Out[2]: dtype('<M8[ns]') In [3]: pd.to_datetime(["2024-03-22 11:43:01.002"]).dtype Out[3]: dtype('<M8[ns]') In [4]: pd.to_datetime(["2024-03-22 11:43:01.002003"]).dtype Out[4]: dtype('<M8[ns]') In [5]: pd.to_datetime(["2024-03-22 11:43:01.002003004"]).dtype Out[5]: dtype('<M8[ns]')
The inferred resolution now matches that of the input strings for nanosecond-precision strings, otherwise defaulting to microseconds:
In [15]: In [2]: pd.to_datetime(["2024-03-22 11:43:01"]).dtype Out[15]: dtype('<M8[us]')
In [16]: In [3]: pd.to_datetime(["2024-03-22 11:43:01.002"]).dtype Out[16]: dtype('<M8[us]')
In [17]: In [4]: pd.to_datetime(["2024-03-22 11:43:01.002003"]).dtype Out[17]: dtype('<M8[us]')
In [18]: In [5]: pd.to_datetime(["2024-03-22 11:43:01.002003004"]).dtype Out[18]: dtype('<M8[ns]')
This is also a change for the Timestamp constructor with a string input, which in version 2.x.y could give second or millisecond unit, which users generally disliked (GH 52653)
In cases with mixed-resolution inputs, the highest resolution is used:
In [2]: pd.to_datetime([pd.Timestamp("2024-03-22 11:43:01"), "2024-03-22 11:43:01.002"]).dtype Out[2]: dtype('<M8[ns]')
Warning
Many users will now get “M8[us]” dtype data in cases when they used to get “M8[ns]”. For most use cases they should not notice a difference. One big exception is converting to integers, which will give integers 1000x smaller.
Similarly, the Timedelta constructor and to_timedelta() with a string input now defaults to a microsecond unit, using nanosecond unit only in cases that actually have nanosecond precision.
concat() no longer ignores sort when all objects have a DatetimeIndex#
When all objects passed to concat() have a DatetimeIndex, passing sort=False will now result in the non-concatenation axis not being sorted. Previously, the result would always be sorted along the non-concatenation axis even when sort=False is passed. (GH 57335)
If you do not specify the sort argument, pandas will continue to return a sorted result but this behavior is deprecated and you will receive a warning. In order to make this less noisy for users, pandas checks if not sorting would impact the result and only warns when it would. This check can be expensive, and users can skip the check by explicitly specifying sort=True orsort=False.
This deprecation can also impact pandas’ internal usage of concat(). Here cases where concat() was sorting a DatetimeIndex but not other indexes are considered bugs and have been fixed as noted below. However it is possible some have been missed. In order to be cautious here, pandas has _not_added sort=False to any internal calls where we believe behavior should not change. If we have missed something, users will not experience a behavior change but they will receive a warning about concat() even though they are not directly calling this function. If this does occur, we ask users to open an issue so that we may address any potential behavior changes.
In [19]: idx1 = pd.date_range("2025-01-02", periods=3, freq="h")
In [20]: df1 = pd.DataFrame({"a": [1, 2, 3]}, index=idx1)
In [21]: df1 Out[21]: a 2025-01-02 00:00:00 1 2025-01-02 01:00:00 2 2025-01-02 02:00:00 3
In [22]: idx2 = pd.date_range("2025-01-01", periods=3, freq="h")
In [23]: df2 = pd.DataFrame({"b": [1, 2, 3]}, index=idx2)
In [24]: df2 Out[24]: b 2025-01-01 00:00:00 1 2025-01-01 01:00:00 2 2025-01-01 02:00:00 3
Old behavior
In [3]: pd.concat([df1, df2], axis=1, sort=False) Out[3]: a b 2025-01-01 00:00:00 NaN 1.0 2025-01-01 01:00:00 NaN 2.0 2025-01-01 02:00:00 NaN 3.0 2025-01-02 00:00:00 1.0 NaN 2025-01-02 01:00:00 2.0 NaN 2025-01-02 02:00:00 3.0 NaN
New behavior
In [25]: pd.concat([df1, df2], axis=1, sort=False) Out[25]: a b 2025-01-02 00:00:00 1.0 NaN 2025-01-02 01:00:00 2.0 NaN 2025-01-02 02:00:00 3.0 NaN 2025-01-01 00:00:00 NaN 1.0 2025-01-01 01:00:00 NaN 2.0 2025-01-01 02:00:00 NaN 3.0
Cases where pandas’ internal usage of concat() resulted in inconsistent sorting that are now fixed in this release are as follows.
- Series.apply() and DataFrame.apply() with a list-like or dict-like
funcargument. - Series.shift(), DataFrame.shift(), SeriesGroupBy.shift(), DataFrameGroupBy.shift() with the
periodsargument a list of length greater than 1. - DataFrame.join() with
othera list of one or more Series or DataFrames andhow="inner",how="left", orhow="right". - Series.str.cat() with
othersa Series or DataFrame.
Changed behavior in DataFrame.value_counts() and DataFrameGroupBy.value_counts() when sort=False#
In previous versions of pandas, DataFrame.value_counts() with sort=False would sort the result by row labels (as was documented). This was nonintuitive and inconsistent with Series.value_counts() which would maintain the order of the input. Now DataFrame.value_counts() will maintain the order of the input. (GH 59745)
In [26]: df = pd.DataFrame( ....: { ....: "a": [2, 2, 2, 2, 1, 1, 1, 1], ....: "b": [2, 1, 3, 1, 2, 3, 1, 1], ....: } ....: ) ....:
In [27]: df Out[27]: a b 0 2 2 1 2 1 2 2 3 3 2 1 4 1 2 5 1 3 6 1 1 7 1 1
Old behavior
In [3]: df.value_counts(sort=False) Out[3]: a b 1 1 2 2 1 3 1 2 1 2 2 1 3 1 Name: count, dtype: int64
New behavior
In [28]: df.value_counts(sort=False) Out[28]: a b 2 2 1 1 2 3 1 1 2 1 3 1 1 2 Name: count, dtype: int64
This change also applies to DataFrameGroupBy.value_counts(). Here, there are two options for sorting: one sort passed to DataFrame.groupby() and one passed directly to DataFrameGroupBy.value_counts(). The former will determine whether to sort the groups, the latter whether to sort the counts. All non-grouping columns will maintain the order of the input within groups.
Old behavior
In [5]: df.groupby("a", sort=True).value_counts(sort=False) Out[5]: a b 1 1 2 2 1 3 1 2 1 2 2 1 3 1 dtype: int64
New behavior
In [29]: df.groupby("a", sort=True).value_counts(sort=False) Out[29]: a b 1 2 1 3 1 1 2 2 2 1 3 1 1 2 Name: count, dtype: int64
Changed behavior of pd.offsets.Day to always represent calendar-day#
In previous versions of pandas, offsets.Day represented a fixed span of 24 hours, disregarding Daylight Savings Time transitions. It now consistently behaves as a calendar-day, preserving time-of-day across DST transitions. (GH 61985)
Old behavior
In [5]: ts = pd.Timestamp("2025-03-08 08:00", tz="US/Eastern") In [6]: ts + pd.offsets.Day(1) Out[3]: Timestamp('2025-03-09 09:00:00-0400', tz='US/Eastern')
New behavior
In [30]: ts = pd.Timestamp("2025-03-08 08:00", tz="US/Eastern")
In [31]: ts + pd.offsets.Day(1) Out[31]: Timestamp('2025-03-09 08:00:00-0400', tz='US/Eastern')
This change fixes a long-standing bug in date_range() (GH 51716, GH 35388), but causes several small behavior differences as collateral:
pd.offsets.Day(n)no longer compares as equal topd.offsets.Hour(24*n)- offsets.Day no longer supports division
- Timedelta no longer accepts Day objects as inputs
- tseries.frequencies.to_offset() on a Timedelta object returns an offsets.Hour object in cases where it used to return a Day object.
- Adding or subtracting a scalar from a timezone-aware DatetimeIndex with a Day
freqno longer preserves thatfreqattribute. - Adding or subtracting a Day with a Timedelta is no longer supported.
- Adding or subtracting a Day offset to a timezone-aware Timestamp or datetime-like may lead to an ambiguous or non-existent time, which will raise.
Changed treatment of NaN values in pyarrow and numpy-nullable floating dtypes#
Previously, when dealing with a nullable dtype (e.g. Float64Dtype or int64[pyarrow]),NaN was treated as interchangeable with NA in some circumstances but not others. This was done to make adoption easier, but caused some confusion (GH 32265). In 3.0, this behaviour is made consistent to by default treat NaN as equivalent to NA in all cases.
By default, NaN can be passed to constructors, __setitem__, __contains__and will be treated the same as NA. The only change users will see is that arithmetic and np.ufunc operations that previously introduced NaNentries produce NA entries instead.
Old behavior:
NaN in input gets converted to NA
In [1]: ser = pd.Series([0, np.nan], dtype=pd.Float64Dtype()) In [2]: ser Out[2]: 0 0.0 1 dtype: Float64
NaN produced by arithmetic (0/0) remained NaN
In [3]: ser / 0 Out[3]: 0 NaN 1 dtype: Float64
the NaN value is not considered as missing
In [4]: (ser / 0).isna() Out[4]: 0 False 1 True dtype: bool
New behavior:
In [32]: ser = pd.Series([0, np.nan], dtype=pd.Float64Dtype())
In [33]: ser Out[33]: 0 0.0 1 dtype: Float64
In [34]: ser / 0 Out[34]: 0 1 dtype: Float64
In [35]: (ser / 0).isna() Out[35]: 0 True 1 True dtype: bool
In the future, the intention is to consider NaN and NA as distinct values, and an option to control this behaviour is added in 3.0 throughpd.options.future.distinguish_nan_and_na. When enabled, NaN is always considered distinct and specifically as a floating-point value. As a consequence, it cannot be used with integer dtypes.
Old behavior:
In [2]: ser = pd.Series([1, np.nan], dtype=pd.Float64Dtype()) In [3]: ser[1] Out[3]:
New behavior:
In [36]: with pd.option_context("future.distinguish_nan_and_na", True): ....: ser = pd.Series([1, np.nan], dtype=pd.Float64Dtype()) ....: print(ser[1]) ....: nan
If we had passed pd.Int64Dtype() or "int64[pyarrow]" for the dtype in the latter example, this would raise, as a float NaN cannot be held by an integer dtype.
With "future.distinguish_nan_and_na" enabled, ser.to_numpy() (andframe.values and np.asarray(obj)) will convert to object dtype ifNA entries are present, where before they would coerce toNaN. To retain a float numpy dtype, explicitly pass na_value=np.nanto Series.to_numpy().
Note that the option is experimental and subject to change in future releases.
The __module__ attribute now points to public modules#
The __module__ attribute on functions and classes in the public API has been updated to refer to the preferred public module from which to access the object, rather than the module in which the object happens to be defined (GH 55178).
This produces more informative displays in the Python console for classes, e.g., instead of <class 'pandas.core.frame.DataFrame'> you now see<class 'pandas.DataFrame'>, and in interactive tools such as IPython, e.g., instead of <function pandas.io.parsers.readers.read_csv(...)> you now see<function pandas.read_csv(...)>.
This may break code that relies on the previous __module__ values (e.g. doctests inspecting the type() of a DataFrame object).
Increased minimum version for Python#
pandas 3.0.0 supports Python 3.11 and higher.
Increased minimum versions for dependencies#
Some minimum supported versions of dependencies were updated. The following required dependencies were updated:
For optional libraries the general recommendation is to use the latest version. The following table lists the lowest version per library that is currently being tested throughout the development of pandas. Optional libraries below the lowest tested version may still work, but are not considered supported.
See Dependencies and Optional dependencies for more.
pytz now an optional dependency#
pandas now uses zoneinfo from the standard library as the default timezone implementation when passing a timezone string to various methods. (GH 34916)
Old behavior:
In [1]: ts = pd.Timestamp(2024, 1, 1).tz_localize("US/Pacific") In [2]: ts.tz <DstTzInfo 'US/Pacific' LMT-1 day, 16:07:00 STD>
New behavior:
In [37]: ts = pd.Timestamp(2024, 1, 1).tz_localize("US/Pacific")
In [38]: ts.tz Out[38]: zoneinfo.ZoneInfo(key='US/Pacific')
pytz timezone objects are still supported when passed directly, but they will no longer be returned by default from string inputs. Moreover, pytz is no longer a required dependency of pandas, but can be installed with the pip extra pip install pandas[timezone].
Additionally, pandas no longer throws pytz exceptions for timezone operations leading to ambiguous or nonexistent times. These cases will now raise a ValueError.
Other API changes#
- 3rd party
py.pathobjects are no longer explicitly supported in IO methods. Use pathlib.Path objects instead (GH 57091) - read_table()’s
parse_datesargument defaults toNoneto improve consistency with read_csv() (GH 57476) - All classes inheriting from builtin
tuple(including types created with collections.namedtuple()) are now hashed and compared as builtintupleduring indexing operations (GH 57922) - Made
dtypea required argument inExtensionArray._from_sequence_of_strings()(GH 56519) - Passing a Series input to json_normalize() will now retain the Series Index, previously output had a new RangeIndex (GH 51452)
- Pickle and HDF (
.h5) files created with Python 2 are no longer explicitly supported (GH 57387) - Pickled objects from pandas version less than
1.0.0are no longer supported (GH 57155) - Removed
Index.sort()which always raised aTypeError. This attribute is not defined and will raise anAttributeError(GH 59283) - Unused
dtypeargument has been removed from the MultiIndex constructor (GH 60962) - Updated DataFrame.to_excel() so that the output spreadsheet has no styling. Custom styling can still be done using Styler.to_excel() (GH 54154)
- When comparing the indexes in testing.assert_series_equal(),
check_exactdefaults to True if an Index is of integer dtype. (GH 57386) - Index set operations (like union or intersection) will now ignore the dtype of an empty
RangeIndexor emptyIndexwith object dtype when determining the dtype of the resulting Index (GH 60797) - IncompatibleFrequency now subclasses
TypeErrorinstead ofValueError. As a result, joins with mismatched frequencies now cast to object like other non-comparable joins, and arithmetic with indexes with mismatched frequencies align (GH 55782) - Series “flex” methods like Series.add() no longer allow passing a DataFrame for
other; use the DataFrame reversed method instead (GH 46179) - date_range() and timedelta_range() no longer default to
unit="ns", instead will infer a unit from thestart,end, andfreqparameters. Explicitly specify a desiredunitto override these (GH 59031) - CategoricalIndex.append() no longer attempts to cast different-dtype indexes to the caller’s dtype (GH 41626)
ExtensionDtype.construct_array_type()is now a regular method instead of aclassmethod(GH 58663)- Arithmetic operations between a Series, Index, or ExtensionArray with a
listnow consistently wrap that list with an array equivalent toSeries(my_list).array. To do any other kind of type inference or casting, do so explicitly before operating (GH 62552) - Comparison operations between Index and Series now consistently return Series regardless of which object is on the left or right (GH 36759)
- NumPy functions like
np.isinfthat return a bool dtype when called on a Index object now return a bool-dtype Index instead ofnp.ndarray(GH 52676) - Methods that can operate in-place (replace(), fillna(),ffill(), bfill(), interpolate(),where(), mask(), clip()) now return the modified DataFrame or Series (
self) instead ofNonewheninplace=True(GH 63207)
Deprecations#
Copy keyword#
The copy keyword argument in the following methods is deprecated and will be removed in a future version. (GH 57347)
- DataFrame.truncate() / Series.truncate()
- DataFrame.tz_convert() / Series.tz_convert()
- DataFrame.tz_localize() / Series.tz_localize()
- DataFrame.infer_objects() / Series.infer_objects()
- DataFrame.align() / Series.align()
- DataFrame.astype() / Series.astype()
- DataFrame.reindex() / Series.reindex()
- DataFrame.reindex_like() / Series.reindex_like()
- DataFrame.set_axis() / Series.set_axis()
- DataFrame.to_period() / Series.to_period()
- DataFrame.to_timestamp() / Series.to_timestamp()
- DataFrame.rename() / Series.rename()
- DataFrame.transpose()
- DataFrame.swaplevel()
- DataFrame.merge() /
pd.merge()
Copy-on-Write utilizes a lazy copy mechanism that defers copying the data until necessary. Use .copy to trigger an eager copy. The copy keyword has no effect starting with 3.0, so it can be safely removed from your code.
Other Deprecations#
- Deprecated
core.internals.api.make_block(), use public APIs instead (GH 56815) - Deprecated
DataFrameGroupby.corrwith()(GH 57158) - Deprecated Timestamp.utcfromtimestamp(), use
Timestamp.fromtimestamp(ts, "UTC")instead (GH 56680) - Deprecated Timestamp.utcnow(), use
Timestamp.now("UTC")instead (GH 56680) - Deprecated
pd.core.internals.api.maybe_infer_ndim(GH 40226) - Deprecated allowing constructing or casting to Categorical with non-NA values that are not present in specified
dtype.categories(GH 40996) - Deprecated allowing non-keyword arguments in DataFrame.all(), DataFrame.min(), DataFrame.max(), DataFrame.sum(), DataFrame.prod(), DataFrame.mean(), DataFrame.median(), DataFrame.sem(), DataFrame.var(), DataFrame.std(), DataFrame.skew(), DataFrame.kurt(), Series.all(), Series.min(), Series.max(), Series.sum(), Series.prod(), Series.mean(), Series.median(), Series.sem(), Series.var(), Series.std(), Series.skew(), and Series.kurt(). (GH 57087)
- Deprecated allowing non-keyword arguments in DataFrame.groupby() and Series.groupby() except
byandlevel. (GH 62102) - Deprecated allowing non-keyword arguments in Series.to_markdown() except
buf. (GH 57280) - Deprecated allowing non-keyword arguments in Series.to_string() except
buf. (GH 57280) - Deprecated behavior of DataFrameGroupBy.groups() and SeriesGroupBy.groups(), in a future version
groupsby one element list will return tuple instead of scalar. (GH 58858) - Deprecated behavior of Series.dt.to_pytimedelta(), in a future version this will return a Series containing python
datetime.timedeltaobjects instead of anndarrayof timedelta; this matches the behavior of other Series.dt() properties. (GH 57463) - Deprecated converting object-dtype columns of
datetime.datetimeobjects to datetime64 when writing to stata (GH 56536) - Deprecated lowercase strings
d,bandcdenoting frequencies in Day, BusinessDay and CustomBusinessDay in favour ofD,BandC(GH 58998) - Deprecated lowercase strings
w,w-mon,w-tue, etc. denoting frequencies in Week in favour ofW,W-MON,W-TUE, etc. (GH 58998) - Deprecated parameter
methodin DataFrame.reindex_like() / Series.reindex_like() (GH 58667) - Deprecated strings
w,d,MIN,MS,USandNSdenoting units in Timedelta in favour ofW,D,min,ms,usandns(GH 59051) - Deprecated the
argparameter of Series.map(); pass the addedfuncargument instead. (GH 61260) - Deprecated the
verify_integritykeyword in DataFrame.set_index(); directly check the result forobj.index.is_uniqueinstead (GH 62919) - Deprecated the keyword
check_datetimelike_compatin testing.assert_frame_equal() and testing.assert_series_equal() (GH 55638) - Deprecated using
epochdate format in DataFrame.to_json() and Series.to_json(), useisoinstead. (GH 57063) - Deprecated allowing
fill_valuethat cannot be held in the original dtype (excepting NA values for integer and bool dtypes) in Series.unstack() and DataFrame.unstack() (GH 12189, GH 53868) - Deprecated allowing
fill_valuethat cannot be held in the original dtype (excepting NA values for integer and bool dtypes) in Series.shift() and DataFrame.shift() (GH 53802) - Deprecated allowing strings representing full dates in DataFrame.at_time() and Series.at_time() (GH 50839)
- Deprecated backward-compatibility behavior for DataFrame.select_dtypes() matching
strdtype whennp.object_is specified (GH 61916) - Deprecated option
future.no_silent_downcasting, as it is no longer used. In a future version accessing this option will raise (GH 59502) - Deprecated passing non-Index types to Index.join(); explicitly convert to Index first (GH 62897)
- Deprecated silent casting of non-datetime
otherto datetime in Series.combine_first() (GH 62931) - Deprecated silently casting strings to Timedelta in binary operations with Timedelta (GH 59653)
- Deprecated slicing on a Series or DataFrame with a DatetimeIndex using a
datetime.dateobject, explicitly cast to Timestamp instead (GH 35830) - Deprecated support for the Dataframe Interchange Protocol (GH 56732)
- Deprecated the
inplacekeyword fromResampler.interpolate(), as passingTrueraisesAttributeError(GH 58690)
Removal of prior version deprecations/changes#
Enforced deprecation of aliases M, Q, Y, etc. in favour of ME, QE, YE, etc. for offsets#
Renamed the following offset aliases (GH 57986):
Other Removals#
- DataFrameGroupBy.idxmin(), DataFrameGroupBy.idxmax(), SeriesGroupBy.idxmin(), and SeriesGroupBy.idxmax() will now raise a
ValueErrorwhen a group has all NA values, or when used withskipna=Falseand any NA value is encountered (GH 10694, GH 57745) - concat() no longer ignores empty objects when determining output dtypes (GH 39122)
- concat() with all-NA entries no longer ignores the dtype of those entries when determining the result dtype (GH 40893)
- read_excel(), read_json(), read_html(), and read_xml() no longer accept raw string or byte representation of the data. That type of data must be wrapped in a
StringIOorBytesIO(GH 53767) - to_datetime() with a
unitspecified no longer parses strings into floats, instead parses them the same way as withoutunit(GH 50735) - SeriesGroupBy.agg() no longer pins the name of the group to the input passed to the provided
func(GH 51703) - DataFrame.groupby() with
as_index=Falseand aggregation methods will no longer exclude from the result the groupings that do not arise from the input (GH 49519) ExtensionArray._reduce()now requires akeepdims: bool = Falseparameter in the signature (GH 52788)- Series.dt.to_pydatetime() now returns a Series of datetime.datetime objects (GH 52459)
- All arguments except
namein Index.rename() are now keyword only (GH 56493) - All arguments except the first
path-like argument in IO writers are now keyword only (GH 54229) - Changed behavior of
Series.__getitem__()andSeries.__setitem__()to always treat integer keys as labels, never as positional, consistent with DataFrame behavior (GH 50617) - Changed behavior of
Series.__getitem__(),Series.__setitem__(),DataFrame.__getitem__(),DataFrame.__setitem__()with an integer slice on objects with a floating-dtype index. This is now treated as positional indexing (GH 49612) - Disallow a callable argument to Series.iloc() to return a
tuple(GH 53769) - Disallow allowing logical operations (
||,&,^) between pandas objects and dtype-less sequences (e.g.list,tuple); wrap the objects in Series, Index, ornp.arrayfirst instead (GH 52264) - Disallow automatic casting to object in Series logical operations (
&,^,||) between series with mismatched indexes and dtypes other thanobjectorbool(GH 52538) - Disallow calling Series.replace() or DataFrame.replace() without a
valueand with non-dict-liketo_replace(GH 33302) - Disallow constructing a arrays.SparseArray with scalar data (GH 53039)
- Disallow indexing an Index with a boolean indexer of length zero, it now raises
ValueError(GH 55820) - Disallow non-standard (
np.ndarray, Index, ExtensionArray, or Series) toisin(), unique(), factorize() (GH 52986) - Disallow passing a pandas type to Index.view() (GH 55709)
- Disallow units other than “s”, “ms”, “us”, “ns” for datetime64 and timedelta64 dtypes in array() (GH 53817)
- Removed
Block,DatetimeTZBlock,ExtensionBlock,create_block_manager_from_blocksfrompandas.core.internalsandpandas.core.internals.api(GH 55139) - Removed
fastpathkeyword in Categorical constructor (GH 20110) - Removed
freqkeyword from PeriodArray constructor, use “dtype” instead (GH 52462) - Removed
kindkeyword in Series.resample() and DataFrame.resample() (GH 58125) - Removed alias
arrays.PandasArrayfor arrays.NumpyExtensionArray (GH 53694) - Removed deprecated
methodandlimitkeywords from Series.replace() and DataFrame.replace() (GH 53492) - Removed extension test classes
BaseNoReduceTests,BaseNumericReduceTests,BaseBooleanReduceTests(GH 54663) - Removed the
closedandnormalizekeywords in DatetimeIndex constructor (GH 52628) - Removed the deprecated
delim_whitespacekeyword in read_csv() and read_table(), usesep=r"\s+"instead (GH 55569) - Require
SparseDtype.fill_value()to be a valid value for theSparseDtype.subtype()(GH 53043) - Stopped automatically casting non-datetimelike values (mainly strings) in Series.isin() and Index.isin() with
datetime64,timedelta64, and PeriodDtype dtypes (GH 53111) - Stopped performing dtype inference in Index, Series and DataFrame constructors when given a pandas object (Series, Index, ExtensionArray), call
.infer_objectson the input to keep the current behavior (GH 56012) - Stopped performing dtype inference when setting a Index into a DataFrame (GH 56102)
- Stopped performing dtype inference with in Index.insert() with object-dtype index; this often affects the index/columns that result when setting new entries into an empty Series or DataFrame (GH 51363)
- Removed the
closedandunitkeywords in TimedeltaIndex constructor (GH 52628, GH 55499) - All arguments in Index.sort_values() are now keyword only (GH 56493)
- All arguments in Series.to_dict() are now keyword only (GH 56493)
- Changed the default value of
na_actionin Categorical.map() toNone(GH 51645) - Changed the default value of
observedin DataFrame.groupby() and Series.groupby() toTrue(GH 51811) - Enforce banning of upcasting in in-place setitem-like operations; see PDEP6 (GH 59007)
- Enforce deprecation in testing.assert_series_equal() and testing.assert_frame_equal() with object dtype and mismatched null-like values, which are now considered not-equal (GH 18463)
- Enforced deprecation
allandanyreductions withdatetime64, DatetimeTZDtype, and PeriodDtype dtypes (GH 58029) - Enforced deprecation allowing non-
booland NA values fornain str.contains(), str.startswith(), and str.endswith() (GH 59615) - Enforced deprecation disallowing
floatfor theperiodsargument in date_range(), period_range(), timedelta_range(), interval_range(), (GH 56036) - Enforced deprecation disallowing parsing datetimes with mixed time zones unless user passes
utc=Trueto to_datetime() (GH 57275) - Enforced deprecation in Series.value_counts() and Index.value_counts() with object dtype performing dtype inference on the
.indexof the result (GH 56161) - Enforced deprecation of DataFrameGroupBy.get_group() and SeriesGroupBy.get_group() allowing the
nameargument to be a non-tuple when grouping by a list of length 1 (GH 54155) - Enforced deprecation of Series.interpolate() and DataFrame.interpolate() for object-dtype (GH 57820)
- Enforced deprecation of
offsets.Tick.delta(), usepd.Timedelta(obj)instead (GH 55498) - Enforced deprecation of
axis=Noneacting the same asaxis=0in the DataFrame reductionssum,prod,std,var, andsem, passingaxis=Nonewill now reduce over both axes; this is particularly the case when doing e.g.numpy.sum(df)(GH 21597) - Enforced deprecation of
core.internalsmemberDatetimeTZBlock(GH 58467) - Enforced deprecation of
date_parserin read_csv(), read_table(), read_fwf(), and read_excel() in favour ofdate_format(GH 50601) - Enforced deprecation of
keep_date_colkeyword in read_csv() (GH 55569) - Enforced deprecation of
quantilekeyword in Rolling.quantile() and Expanding.quantile(), renamed toqinstead. (GH 52550) - Enforced deprecation of argument
infer_datetime_formatin read_csv(), as a strict version of it is now the default (GH 48621) - Enforced deprecation of combining parsed datetime columns in read_csv() in
parse_dates(GH 55569) - Enforced deprecation of non-standard (
np.ndarray, ExtensionArray, Index, or Series) argument toapi.extensions.take()(GH 52981) - Enforced deprecation of parsing system timezone strings to
tzlocal, which depended on system timezone, pass thetzkeyword instead (GH 50791) - Enforced deprecation of passing a dictionary to SeriesGroupBy.agg() (GH 52268)
- Enforced deprecation of string
ASdenoting frequency in YearBegin and stringsAS-DEC,AS-JAN, etc. denoting annual frequencies with various fiscal year starts (GH 57793) - Enforced deprecation of string
Adenoting frequency in YearEnd and stringsA-DEC,A-JAN, etc. denoting annual frequencies with various fiscal year ends (GH 57699) - Enforced deprecation of string
BASdenoting frequency in BYearBegin and stringsBAS-DEC,BAS-JAN, etc. denoting annual frequencies with various fiscal year starts (GH 57793) - Enforced deprecation of string
BAdenoting frequency in BYearEnd and stringsBA-DEC,BA-JAN, etc. denoting annual frequencies with various fiscal year ends (GH 57793) - Enforced deprecation of strings
H,BH, andCBHdenoting frequencies in Hour, BusinessHour, CustomBusinessHour (GH 59143) - Enforced deprecation of strings
H,BH, andCBHdenoting units in Timedelta (GH 59143) - Enforced deprecation of strings
T,L,U, andNdenoting frequencies inMinute,Milli,Micro,Nano(GH 57627) - Enforced deprecation of strings
T,L,U, andNdenoting units in Timedelta (GH 57627) - Enforced deprecation of the behavior of concat() when
len(keys) != len(objs)would truncate to the shorter of the two. Now this raises aValueError(GH 43485) - Enforced deprecation of the behavior of DataFrame.replace() and Series.replace() with CategoricalDtype that would introduce new categories. (GH 58270)
- Enforced deprecation of the behavior of Series.argsort() in the presence of NA values (GH 58232)
- Enforced deprecation of values “pad”, “ffill”, “bfill”, and “backfill” for Series.interpolate() and DataFrame.interpolate() (GH 57869)
- Enforced deprecation removing
Categorical.to_list(), useobj.tolist()instead (GH 51254) - Enforced silent-downcasting deprecation for all relevant methods (GH 54710)
- In DataFrame.stack(), the default value of
future_stackis nowTrue; specifyingFalsewill raise aFutureWarning(GH 55448) - Iterating over a
DataFrameGroupByorSeriesGroupBywill return tuples of length 1 for the groups when grouping bylevela list of length 1 (GH 50064) - Methods
apply,agg, andtransformwill no longer replace NumPy functions (e.g.np.sum) and built-in functions (e.g.min) with the equivalent pandas implementation; use string aliases (e.g."sum"and"min") if you desire to use the pandas implementation (GH 53974) - Passing both
freqandfill_valuein DataFrame.shift() and Series.shift() and DataFrameGroupBy.shift() now raises aValueError(GH 54818) - Removed DataFrameGroupBy.quantile() and SeriesGroupBy.quantile() supporting bool dtype (GH 53975)
- Removed
DateOffset.is_anchored()andoffsets.Tick.is_anchored()(GH 56594) - Removed
DataFrame.applymap,Styler.applymapandStyler.applymap_index(GH 52364) - Removed
DataFrame.boolandSeries.bool(GH 51756) - Removed
DataFrame.firstandDataFrame.last(GH 53710) - Removed
DataFrame.swapaxesandSeries.swapaxes(GH 51946) - Removed
DataFrameGroupBy.grouperandSeriesGroupBy.grouper(GH 56521) - Removed
DataFrameGroupby.fillnaandSeriesGroupBy.fillna`(GH 55719) - Removed
Index.format, use Index.astype() withstror Index.map() with aformatterfunction instead (GH 55439) - Removed
Resample.fillna(GH 55719) - Removed
Series.__int__andSeries.__float__. Callint(Series.iloc[0])orfloat(Series.iloc[0])instead. (GH 51131) - Removed
Series.ravel(GH 56053) - Removed
Series.view(GH 56054) - Removed
StataReader.close(GH 49228) - Removed
_datafrom DataFrame, Series, arrays.ArrowExtensionArray (GH 52003) - Removed
axisargument from DataFrame.groupby(), Series.groupby(), DataFrame.rolling(), Series.rolling(), DataFrame.resample(), and Series.resample() (GH 51203) - Removed
axisargument from all groupby operations (GH 50405) - Removed
convert_dtypefrom Series.apply() (GH 52257) - Removed
method,limitfill_axisandbroadcast_axiskeywords from DataFrame.align() (GH 51968) - Removed
pandas.api.types.is_intervalandpandas.api.types.is_period, useisinstance(obj, pd.Interval)andisinstance(obj, pd.Period)instead (GH 55264) - Removed
pandas.io.sql.execute(GH 50185) - Removed
pandas.value_counts, use Series.value_counts() instead (GH 53493) - Removed
read_gbqandDataFrame.to_gbq. Usepandas_gbq.read_gbqandpandas_gbq.to_gbqinstead https://pandas-gbq.readthedocs.io/en/latest/api.html (GH 55525) - Removed
use_nullable_dtypesfrom read_parquet() (GH 51853) - Removed
year,month,quarter,day,hour,minute, andsecondkeywords in the PeriodIndex constructor, use PeriodIndex.from_fields() instead (GH 55960) - Removed argument
limitfrom DataFrame.pct_change(), Series.pct_change(), DataFrameGroupBy.pct_change(), and SeriesGroupBy.pct_change(); the argumentmethodmust be set toNoneand will be removed in a future version of pandas (GH 53520) - Removed deprecated argument
objin DataFrameGroupBy.get_group() and SeriesGroupBy.get_group() (GH 53545) - Removed deprecated behavior of Series.agg() using Series.apply() (GH 53325)
- Removed deprecated keyword
methodon Series.fillna(), DataFrame.fillna() (GH 57760) - Removed option
mode.use_inf_as_na, convert inf entries toNaNbefore instead (GH 51684) - Removed support for DataFrame in DataFrame.from_records() (GH 51697)
- Removed support for
errors="ignore"in to_datetime(), to_timedelta() and to_numeric() (GH 55734) - Removed support for
slicein DataFrame.take() (GH 51539) - Removed the
ArrayManager(GH 55043) - Removed the
fastpathargument from the Series constructor (GH 55466) - Removed the
is_boolean,is_integer,is_floating,holds_integer,is_numeric,is_categorical,is_object, andis_intervalattributes of Index (GH 50042) - Removed the
ordinalkeyword in PeriodIndex, use PeriodIndex.from_ordinals() instead (GH 55960) - Removed unused arguments
*argsand**kwargsinResamplermethods (GH 50977) - Unrecognized timezones when parsing strings to datetimes now raises a
ValueError(GH 51477) - Removed the Grouper attributes
ax,groups,indexer, andobj(GH 51206, GH 51182) - Removed deprecated keyword
verboseon read_csv() and read_table() (GH 56556) - Removed the
methodkeyword in ExtensionArray.fillna(), implementExtensionArray._pad_or_backfillinstead (GH 53621) - Removed the attribute
dtypesfromDataFrameGroupBy(GH 51997) - Enforced deprecation of
argmin,argmax,idxmin, andidxmaxreturning a result whenskipna=Falseand an NA value is encountered or all values are NA values; these operations will now raise in such cases (GH 33941, GH 51276) - Enforced deprecation of storage option “pyarrow_numpy” for StringDtype (GH 60152)
- Removed specifying
include_groups=Truein DataFrameGroupBy.apply() and Resampler.apply() (GH 7155)
Performance improvements#
- Eliminated circular reference in to original pandas object in accessor attributes (e.g. Series.str). However, accessor instantiation is no longer cached (GH 47667, GH 41357)
- Categorical.categories returns a RangeIndex columns instead of an Index if the constructed
valueswas arange. (GH 57787) - DataFrame returns a RangeIndex columns when possible when
datais adict(GH 57943) - Series returns a RangeIndex index when possible when
datais adict(GH 58118) - concat() returns a RangeIndex column when possible when
objscontains Series and DataFrame andaxis=0(GH 58119) - concat() returns a RangeIndex level in the MultiIndex result when
keysis arangeor RangeIndex (GH 57542) RangeIndex.append()returns a RangeIndex instead of a Index when appending values that could continue the RangeIndex (GH 57467)- Series.nlargest() has improved performance when there are duplicate values in the index (GH 55767)
- Series.str.extract() returns a RangeIndex columns instead of an Index column when possible (GH 57542)
- Series.str.partition() with ArrowDtype returns a RangeIndex columns instead of an Index column when possible (GH 57768)
- Performance improvement in DataFrame when
datais adictandcolumnsis specified (GH 24368) - Performance improvement in MultiIndex when setting MultiIndex.names doesn’t invalidate all cached operations (GH 59578)
- Performance improvement in DataFrameGroupBy.ffill(), DataFrameGroupBy.bfill(), SeriesGroupBy.ffill(), and SeriesGroupBy.bfill() (GH 56902)
- Performance improvement in DataFrame.join() for sorted but non-unique indexes (GH 56941)
- Performance improvement in DataFrame.join() when left and/or right are non-unique and
howis"left","right", or"inner"(GH 56817) - Performance improvement in DataFrame.join() with
how="left"orhow="right"andsort=True(GH 56919) - Performance improvement in DataFrame.to_csv() when
index=False(GH 59312) - Performance improvement in Index.join() by propagating cached attributes in cases where the result matches one of the inputs (GH 57023)
- Performance improvement in Index.take() when
indicesis a full range indexer from zero to length of index (GH 56806) - Performance improvement in Index.to_frame() returning a RangeIndex columns of a Index when possible. (GH 58018)
- Performance improvement in
MultiIndex._engine()to use smaller dtypes if possible (GH 58411) - Performance improvement in
MultiIndex.equals()for equal length indexes (GH 56990) - Performance improvement in
MultiIndex.memory_usage()to ignore the index engine when it isn’t already cached. (GH 58385) - Performance improvement in
RangeIndex.__getitem__()with a boolean mask or integers returning a RangeIndex instead of a Index when possible. (GH 57588) - Performance improvement in
RangeIndex.append()when appending the same index (GH 57252) - Performance improvement in
RangeIndex.argmin()andRangeIndex.argmax()(GH 57823) - Performance improvement in
RangeIndex.insert()returning a RangeIndex instead of a Index when the RangeIndex is empty. (GH 57833) - Performance improvement in
RangeIndex.round()returning a RangeIndex instead of a Index when possible. (GH 57824) - Performance improvement in
RangeIndex.searchsorted()(GH 58376) - Performance improvement in
RangeIndex.to_numpy()when specifying anna_value(GH 58376) - Performance improvement in
RangeIndex.value_counts()(GH 58376) - Performance improvement in
RangeIndex.join()returning a RangeIndex instead of a Index when possible. (GH 57651, GH 57752) - Performance improvement in
RangeIndex.reindex()returning a RangeIndex instead of a Index when possible. (GH 57647, GH 57752) - Performance improvement in
RangeIndex.take()returning a RangeIndex instead of a Index when possible. (GH 57445, GH 57752) - Performance improvement in merge() if hash-join can be used (GH 57970)
- Performance improvement in merge() when join keys have different dtypes and need to be upcast (GH 62902)
- Performance improvement in
CategoricalDtype.update_dtype()whendtypeis a CategoricalDtype with nonNonecategories and ordered (GH 59647) - Performance improvement in
DataFrame.__getitem__()whenkeyis a DataFrame with many columns (GH 61010) - Performance improvement in DataFrame.astype() when converting to extension floating dtypes, e.g. “Float64” (GH 60066)
- Performance improvement in DataFrame.stack() when using
future_stack=Trueand the DataFrame does not have a MultiIndex (GH 58391) - Performance improvement in DataFrame.to_hdf() avoid unnecessary reopenings of the HDF5 file to speedup data addition to files with a very large number of groups . (GH 58248)
- Performance improvement in DataFrame.where() when
condis a DataFrame with many columns (GH 61010) - Performance improvement in
DataFrameGroupBy.__len__andSeriesGroupBy.__len__(GH 57595) - Performance improvement in indexing operations for string dtypes (GH 56997)
- Performance improvement in unary methods on a RangeIndex returning a RangeIndex instead of a Index when possible. (GH 57825)
Bug fixes#
Categorical#
- Bug in Categorical when constructing with an Index with ArrowDtype (GH 60563)
- Bug in Categorical where constructing from a pandas Series or Index with
dtype='object'did not preserve the categories’ dtype asobject; now thecategories.dtypeis preserved asobjectfor these cases, while numpy arrays and Python sequences withdtype='object'continue to infer the most specific dtype (for example,strif all elements are strings) (GH 61778) - Bug in pandas.Categorical displaying string categories without quotes when using “string” dtype (GH 63045)
- Bug in Series.apply() where
nanwas ignored for CategoricalDtype (GH 59938) - Bug in bdate_range() raising
ValueErrorwith frequencyfreq="cbh"(GH 62849) - Bug in testing.assert_index_equal() raising
TypeErrorinstead ofAssertionErrorfor incomparableCategoricalIndexwhencheck_categorical=Trueandexact=False(GH 61935) - Bug in
Categorical.astype()wherecopy=Falsewould still trigger a copy of the codes (GH 62000) - Bug in DataFrame.pivot() and DataFrame.set_index() raising an
ArrowNotImplementedErrorfor columns with pyarrow dictionary dtype (GH 53051) - Bug in Series.convert_dtypes() with
dtype_backend="pyarrow"where empty CategoricalDtype Series raised an error or got converted tonull[pyarrow](GH 59934)
Datetimelike#
- Bug in
is_year_startwhere a DatetimeIndex constructed via date_range() with frequency ‘MS’ wouldn’t have the correct year or quarter start attributes (GH 57377) - Bug in DataFrame raising
ValueErrorwhendtypeistimedelta64anddatais a list containingNone(GH 60064) - Bug in Timestamp constructor failing to raise when
tz=Noneis explicitly specified in conjunction with timezone-awaretzinfoor data (GH 48688) - Bug in Timestamp constructor failing to raise when given a
np.datetime64object with non-standard unit (GH 25611) - Bug in date_range() where the last valid timestamp would sometimes not be produced (GH 56134)
- Bug in date_range() where using a negative frequency value would not include all points between the start and end values (GH 56147)
- Bug in infer_freq() with a Series with ArrowDtype timestamp dtype incorrectly raising
TypeError(GH 58403) - Bug in to_datetime() where passing an
lxml.etree._ElementUnicodeResulttogether withformatraisedTypeError. Now subclasses ofstrare handled. (GH 60933) - Bug in tseries.api.guess_datetime_format() would fail to infer time format when “%Y” == “%H%M” (GH 57452)
- Bug in tseries.frequencies.to_offset() would fail to parse frequency strings starting with “LWOM” (GH 59218)
- Bug in DateOffset.rollback() (and subclass methods) with
normalize=Truerolling back one offset too long (GH 32616) - Bug in DataFrame.agg() with with missing values resulting in IndexError (GH 58810)
- Bug in DataFrame.fillna() raising an
AssertionErrorinstead ofOutOfBoundsDatetimewhen filling adatetime64[ns]column with an out-of-bounds timestamp. Now correctly raisesOutOfBoundsDatetime. (GH 61208) - Bug in DataFrame.min() and DataFrame.max() casting
datetime64andtimedelta64columns tofloat64and losing precision (GH 60850) - Bug in
DatetimeIndex.asof()with a string key giving incorrect results (GH 50946) - Bug in DatetimeIndex.is_year_start() and DatetimeIndex.is_quarter_start() does not raise on Custom business days frequencies bigger then “1C” (GH 58664)
- Bug in DatetimeIndex.is_year_start() and DatetimeIndex.is_quarter_start() returning
Falseon double-digit frequencies (GH 58523) - Bug in
DatetimeIndex.union()andDatetimeIndex.intersection()whenunitwas non-nanosecond (GH 59036) - Bug in
DatetimeIndex.where()andTimedeltaIndex.where()failing to setfreq=Nonein some cases (GH 24555) - Bug in Index.union() with a
pyarrowtimestamp dtype incorrectly returningobjectdtype (GH 58421) - Bug in Series.dt.microsecond() producing incorrect results for pyarrow backed Series. (GH 59154)
- Bug in Timestamp.normalize() and
DatetimeArray.normalize()returning incorrect results instead of raising on integer overflow for very small (distant past) values (GH 60583) - Bug in Timestamp.replace() failing to update
unitattribute when replacement introduces non-zeronanosecondormicrosecond(GH 57749) - Bug in to_datetime() not respecting dayfirst if an uncommon date string was passed. (GH 58859)
- Bug in to_datetime() on float array with missing values throwing
FloatingPointError(GH 58419) - Bug in to_datetime() on float32 data with year, month, day etc. columns leads to precision issues and incorrect result. (GH 60506)
- Bug in to_datetime() reports incorrect index in case of any failure scenario. (GH 58298)
- Bug in to_datetime() with
format="ISO8601"andutc=Truewhere naive timestamps incorrectly inherited timezone offset from previous timestamps in a series. (GH 61389) - Bug in to_datetime() wrongly converts when
argis anp.datetime64object with unit ofps. (GH 60341) - Bug in comparison between objects with
np.datetime64dtype andtimestamp[pyarrow]dtypes incorrectly raisingTypeError(GH 60937) - Bug in comparison between objects with pyarrow date dtype and
timestamp[pyarrow]ornp.datetime64dtype failing to consider these as non-comparable (GH 62157) - Bug in constructing arrays with ArrowDtype with
timestamptype incorrectly allowingDecimal("NaN")(GH 61773) - Bug in constructing arrays with a timezone-aware ArrowDtype from timezone-naive datetime objects incorrectly treating those as UTC times instead of wall times like DatetimeTZDtype (GH 61775)
- Bug in retaining frequency in
value_counts()specifically for DatetimeIndex() and TimedeltaIndex() (GH 33830) - Bug in setting scalar values with mismatched resolution into arrays with non-nanosecond
datetime64,timedelta64or DatetimeTZDtype incorrectly truncating those scalars (GH 56410)
Timedelta#
- Accuracy improvement in Timedelta.to_pytimedelta() to round microseconds consistently for large nanosecond based Timedelta (GH 57841)
- Bug in Timedelta constructor failing to raise when passed an invalid keyword (GH 53801)
- Bug in DataFrame.cumsum() which was raising
IndexErrorif dtype istimedelta64[ns](GH 57956) - Bug in multiplication operations with
timedelta64dtype failing to raiseTypeErrorwhen multiplying byboolobjects or dtypes (GH 58054) - Bug in multiplication operations with
timedelta64dtype incorrectly raising when multiplying by numpy-nullable dtypes or pyarrow integer dtypes (GH 58054)
Timezones#
- Bug in
DatetimeIndex.union(),DatetimeIndex.intersection(), andDatetimeIndex.symmetric_difference()changing timezone to UTC when merging two DatetimeIndex objects with the same timezone but different units (GH 60080) - Bug in Series.dt.tz_localize() with a timezone-aware ArrowDtype incorrectly converting to UTC when
tz=None(GH 61780) - Fixed bug in date_range() where tz-aware endpoints with calendar offsets (e.g.
"MS") failed on DST fall-back. These now respectambiguous/nonexistent. (GH 52908)
Numeric#
- Bug in api.types.infer_dtype() returning “mixed” for complex and
pd.NAmix (GH 61976) - Bug in api.types.infer_dtype() returning “mixed-integer-float” for float and
pd.NAmix (GH 61621) - Bug in DataFrame.combine_first() where Int64 and UInt64 integers with absolute value greater than
2**53would lose precision after the operation. (GH 60128) - Bug in DataFrame.corr() where numerical precision errors resulted in correlations above
1.0(GH 61120) - Bug in DataFrame.cov() raises a
TypeErrorinstead of returning potentially incorrect results or other errors (GH 53115) - Bug in DataFrame.quantile() where the column type was not preserved when
numeric_only=Truewith a list-likeqproduced an empty result (GH 59035) - Bug in Series.dot() returning
objectdtype for ArrowDtype and nullable-dtype data (GH 61375) - Bug in Series.std() and Series.var() when using complex-valued data (GH 61645)
- Bug in
np.matmulwith Index inputs raising aTypeError(GH 57079) - Bug in arithmetic operations between objects with numpy-nullable dtype and ArrowDtype incorrectly raising (GH 58602)
Conversion#
- to_numeric() on big integers converts to
objectdatatype with python integers when not coercing. (GH 51295) - Bug in DataFrame.astype() not casting
valuesfor Arrow-based dictionary dtype correctly (GH 58479) - Bug in DataFrame.update() bool dtype being converted to object (GH 55509)
- Bug in Series.astype() might modify read-only array inplace when casting to a string dtype (GH 57212)
- Bug in Series.convert_dtypes() and DataFrame.convert_dtypes() raising
TypeErrorwhen called on data with complex dtype (GH 60129) - Bug in Series.convert_dtypes() and DataFrame.convert_dtypes() removing timezone information for objects with ArrowDtype (GH 60237)
- Bug in Series.reindex() not maintaining
float32type when areindexintroduces a missing value (GH 45857) - Bug in to_datetime() and to_timedelta() with input
NonereturningNoneinstead ofNaT, inconsistent with other conversion methods (GH 23055)
Strings#
- Bug in Series.str.match() failing to raise when given a compiled
re.Patternobject and conflictingcaseorflagsarguments (GH 62240) - Bug in Series.str.zfill() raising
AttributeErrorfor ArrowDtype (GH 61485) - Bug in Series.value_counts() would not respect
sort=Falsefor series havingstringdtype (GH 55224) - Bug in multiplication with a StringDtype incorrectly allowing multiplying by bools; explicitly cast to integers instead (GH 62595)
Interval#
- Index.is_monotonic_decreasing(), Index.is_monotonic_increasing(), and Index.is_unique() could incorrectly be
Falsefor anIndexcreated from a slice of anotherIndex. (GH 57911) - Bug in Index, Series, DataFrame constructors when given a sequence of Interval subclass objects casting them to Interval (GH 46945)
- Bug in interval_range() where start and end numeric types were always cast to 64 bit (GH 57268)
- Bug in pandas.interval_range() incorrectly inferring
int64dtype whennp.float32andintare used forstartandfreq(GH 58964) - Bug in IntervalIndex.get_indexer() and
IntervalIndex.drop()when one of the sides of the index is non-unique (GH 52245) - Construction of IntervalArray and IntervalIndex from arrays with mismatched signed/unsigned integer dtypes (e.g.,
int64anduint64) now raises a TypeError instead of proceeding silently. (GH 55715)
Indexing#
- Bug in
DataFrame.__getitem__()when slicing a DataFrame with many rows raised anOverflowError(GH 59531) - Bug in
DataFrame.__setitem__()on an empty DataFrame with a tuple corrupting the frame (GH 54385) - Bug in DataFrame.from_records() throwing a
ValueErrorwhen passed an empty list inindex(GH 58594) - Bug in DataFrame.loc() and DataFrame.iloc() returning incorrect dtype when selecting from a DataFrame with mixed data types. (GH 60600)
- Bug in DataFrame.loc() with inconsistent behavior of loc-set with 2 given indexes to Series (GH 59933)
- Bug in Index.equals() when comparing between Series with string dtype Index (GH 61099)
- Bug in Index.get_indexer() and similar methods when
NaNis located at or after position 128 (GH 58924) - Bug in
MultiIndex.insert()when a new value inserted to a datetime-like level gets cast toNaTand fails indexing (GH 60388) - Bug in
Series.__setitem__()when assigning boolean series with boolean indexer will raiseLossySetitemError(GH 57338) - Bug in indexing
obj.loc[start:stop]with a DatetimeIndex and Timestamp endpoints with higher resolution than the index (GH 63262) - Bug in printing Index.names and MultiIndex.levels would not escape single quotes (GH 60190)
- Bug in reindexing of DataFrame with PeriodDtype columns in case of consolidated block (GH 60980, GH 60273)
- Bug in
DataFrame.loc.__getitem__()andDataFrame.iloc.__getitem__()with a CategoricalDtype column with integer categories raising when trying to index a row containing aNaNentry (GH 58954) - Bug in
Index.__getitem__()incorrectly raising with a 0-dimnp.ndarraykey (GH 55601) - Bug in Index.get_indexer() not casting missing values correctly for new string datatype (GH 55833)
- Bug in adding new rows with
DataFrame.loc.__setitem__()orSeries.loc.__setitem__which failed to retain dtype on the object’s index in some cases (GH 41626) - Bug in indexing on a DatetimeIndex with a
timestamp[pyarrow]dtype or on a TimedeltaIndex with aduration[pyarrow]dtype (GH 62277)
Missing#
- Bug in DataFrame.fillna() and Series.fillna() that would ignore the
limitargument on ExtensionArray dtypes (GH 58001) - Bug in
MultiIndex.fillna()error message was referring toisnainstead offillna(GH 60974) - Bug in
NA.__and__(),NA.__or__()andNA.__xor__()when operating withnp.bool_objects (GH 58427) - Bug in
divmodbetween NA andInt64dtype objects (GH 62196) - Fixed bug in Series.replace() and DataFrame.replace() when trying to replace NA values in a Float64Dtype object with
np.nan; this now works withpd.set_option("mode.distinguish_nan_and_na", True)and is irrelevant otherwise (GH 55127) - Fixed bug in Series.replace() and DataFrame.replace() when trying to replace
np.nanvalues in a Int64Dtype object with NA; this is now a no-op withpd.set_option("mode.distinguish_nan_and_na", True)and is irrelevant otherwise (GH 51237)
MultiIndex#
- DataFrame.loc() with
axis=0and MultiIndex when setting a value adds extra columns (GH 58116) - DataFrame.melt() would not accept multiple names in
var_namewhen the columns were a MultiIndex (GH 58033) MultiIndex.insert()would not insert NA value correctly at unified location of index -1 (GH 59003)- MultiIndex.get_level_values() accessing a DatetimeIndex does not carry the frequency attribute along (GH 58327, GH 57949)
- Bug in DataFrame arithmetic operations in case of unaligned MultiIndex columns (GH 60498)
- Bug in DataFrame arithmetic operations with Series in case of unaligned MultiIndex (GH 61009)
- Bug in
MultiIndex.union()raising when indexes have duplicates with differing names (GH 62059) - Bug in MultiIndex.from_tuples() causing wrong output with input of type tuples having NaN values (GH 60695, GH 60988)
- Bug in
DataFrame.__setitem__()where column alignment logic would reindex the assigned value with an empty index, incorrectly setting all values toNaN.(GH 61841) - Bug in DataFrame.reindex() and Series.reindex() where reindexing Index to a MultiIndex would incorrectly set all values to
NaN.(GH 60923)
I/O#
- Bug in DataFrame and Series
reprof collections.abc.Mapping elements. (GH 57915) - Bug in DataFrame.to_hdf() and read_hdf() with
timedelta64dtypes with non-nanosecond resolution failing to round-trip correctly (GH 63239) - Fix bug in
on_bad_linescallable when returning too many fields: now emitsParserWarningand truncates extra fields regardless ofindex_col(GH 61837) - Bug in pandas.json_normalize() inconsistently handling non-dict items in
datawhenmax_levelwas set. The function will now raise aTypeErrorifdatais a list containing non-dict items (GH 62829) - Bug in pandas.json_normalize() raising
TypeErrorwhenmetacontained a non-string key (e.g.,int) andrecord_pathwas specified, which was inconsistent with the behavior whenrecord_pathwasNone(GH 63019) - Bug in DataFrame.to_json() when
indexargument was a value in theDataFrame.columnand Index.name wasNone. Now, this will fail with aValueError(GH 58925) - Bug in
io.common.is_fsspec_url()not recognizing chained fsspec URLs (GH 48978) - Bug in
DataFrame._repr_html_()which ignored the"display.float_format"option (GH 59876) - Bug in DataFrame.from_records() ignoring
columnsandindexparameters whendatais an empty iterator andnrows=0. (GH 61140) - Bug in DataFrame.from_records() not initializing subclasses properly (GH 57008)
- Bug in DataFrame.from_records() where
columnsparameter with numpy structured array was not reordering and filtering out the columns (GH 59717) - Bug in DataFrame.to_dict() raises unnecessary
UserWarningwhen columns are not unique andorient='tight'. (GH 58281) - Bug in DataFrame.to_excel() when writing empty DataFrame with MultiIndex on both axes (GH 57696)
- Bug in DataFrame.to_excel() where the MultiIndex index with a period level was not a date (GH 60099)
- Bug in DataFrame.to_stata() when exporting a column containing both long strings (Stata strL) and
pd.NAvalues (GH 23633) - Bug in DataFrame.to_stata() when input encoded length and normal length are mismatched (GH 61583)
- Bug in DataFrame.to_stata() when writing DataFrame and
byteorder=`big`. (GH 58969) - Bug in DataFrame.to_stata() when writing more than 32,000 value labels. (GH 60107)
- Bug in DataFrame.to_string() that raised
StopIterationwith nested DataFrames. (GH 16098) - Bug in HDFStore.get() was failing to save data of dtype
datetime64[s]correctly (GH 59004) - Bug in HDFStore.select() causing queries on categorical string columns to return unexpected results (GH 57608)
- Bug in
MultiIndex.factorize()incorrectly raising on length-0 indexes (GH 57517) - Bug in read_csv() causing segmentation fault when
encoding_errorsis not a string. (GH 59059) - Bug in read_csv() for the
candpythonengines where parsing numbers with large exponents caused overflows. Now, numbers with large positive exponents are parsed asinfor-infdepending on the sign of the mantissa, while those with large negative exponents are parsed as0.0(GH 62617, GH 38794, GH 62740) - Bug in DataFrame.to_csv() where
quotecharis not escaped whenescapecharis notNone(GH 61407) - Bug in read_csv() raising
TypeErrorwhenindex_colis specified andna_valuesis a dict containing the keyNone. (GH 57547) - Bug in read_csv() raising
TypeErrorwhennrowsanditeratorare specified without specifying achunksize. (GH 59079) - Bug in read_csv() where chained fsspec TAR file and
compression="infer"fails withtarfile.ReadError(GH 60028) - Bug in read_csv() where it did not appropriately skip a line when instructed, causing Empty Data Error (GH 62739)
- Bug in read_csv() where the order of the
na_valuesmakes an inconsistency whenna_valuesis a list non-string values. (GH 59303) - Bug in read_csv() with
candpythonengines reading big integers as strings. Now reads them as python integers. (GH 51295) - Bug in read_csv() with
engine="c"reading large float numbers with preceding integers as strings. Now reads them as floats. (GH 51295) - Bug in read_csv() with
engine="pyarrow"anddtype="Int64"losing precision (GH 56136) - Bug in read_excel() raising
ValueErrorwhen passing array of boolean values whendtype="boolean". (GH 58159) - Bug in read_html() where
rowspanin header row causes incorrect conversion toDataFrame. (GH 60210) - Bug in read_json() ignoring the given
dtypewhenengine="pyarrow"(GH 59516) - Bug in read_json() not validating the
typargument to not be exactly"frame"or"series"(GH 59124) - Bug in read_json() where extreme value integers in string format were incorrectly parsed as a different integer number (GH 20608)
- Bug in read_stata() raising
KeyErrorwhen input file is stored in big-endian format and contains strL data. (GH 58638) - Bug in read_stata() where extreme value integers were incorrectly interpreted as missing for format versions 111 and prior (GH 58130)
- Bug in read_stata() where the missing code for double was not recognised for format versions 105 and prior (GH 58149)
- Bug in set_option() where setting the pandas option
display.html.use_mathjaxtoFalsehas no effect (GH 59884) - Bug in
to_excel()where MultiIndex columns would be merged to a single row whenmerge_cells=Falseis passed (GH 60274)
Period#
- Fixed error message when passing invalid period alias to PeriodIndex.to_timestamp() (GH 58974)
Plotting#
- Bug in DataFrameGroupBy.boxplot() failed when there were multiple groupings (GH 14701)
- Bug in DataFrame.plot.bar() when
subplotsandstacked=Trueare used in conjunction which causes incorrect stacking. (GH 61018) - Bug in DataFrame.plot.bar() with
stacked=Truewhere labels on stacked bars with zero-height segments were incorrectly positioned at the base instead of the label position of the previous segment (GH 59429) - Bug in DataFrame.plot.line() raising
ValueErrorwhen set both color and adictstyle (GH 59461) - Bug in DataFrame.plot() that causes a shift to the right when the frequency multiplier is greater than one. (GH 57587)
- Bug in DataFrame.plot() where
titlewould require extra titles when plotting more than one column per subplot. (GH 61019) - Bug in Series.plot() preventing a line and bar from being aligned on the same plot (GH 61161)
- Bug in Series.plot() preventing a line and scatter plot from being aligned (GH 61005)
- Bug in Series.plot() with
kind="pie"with ArrowDtype (GH 59192) - Bug in plotting with a TimedeltaIndex with non-nanosecond resolution displaying incorrect labels (GH 63237)
Groupby/resample/rolling#
- Bug in
DataFrameGroupByreductions where non-Boolean values were allowed for thenumeric_onlyargument; passing a non-Boolean value will now raise (GH 62778) - Bug in
DataFrameGroupBy.__len__()andSeriesGroupBy.__len__()would raise when the grouping contained NA values anddropna=False(GH 58644) - Bug in DataFrameGroupBy.agg() and SeriesGroupBy.agg() that was returning numpy dtype values when input values are pyarrow dtype values, instead of returning pyarrow dtype values. (GH 53030)
- Bug in DataFrameGroupBy.agg() that raises
AttributeErrorwhen there is dictionary input and duplicated columns, instead of returning a DataFrame with the aggregation of all duplicate columns. (GH 55041) - Bug in DataFrameGroupBy.agg() where applying a user-defined function to an empty DataFrame returned a Series instead of an empty DataFrame. (GH 61503)
- Bug in DataFrameGroupBy.any() that returned True for groups where all Timedelta values are
NaT. (GH 59712) - Bug in DataFrameGroupBy.apply() and SeriesGroupBy.apply() for empty data frame with
group_keys=Falsestill creating output index using group keys. (GH 60471) - Bug in DataFrameGroupBy.apply() and SeriesGroupBy.apply() not preserving
_metadataattributes from subclassed DataFrames and Series (GH 62134) - Bug in DataFrameGroupBy.apply() that was returning a completely empty DataFrame when all return values of
funcwereNoneinstead of returning an empty DataFrame with the original columns and dtypes. (GH 57775) - Bug in DataFrameGroupBy.apply() with
as_index=Falsethat was returning MultiIndex instead of returning Index. (GH 58291) - Bug in DataFrameGroupBy.cumsum() and DataFrameGroupBy.cumprod() where
numeric_onlyparameter was passed indirectly through kwargs instead of passing directly. (GH 58811) - Bug in DataFrameGroupBy.cumsum() where it did not return the correct dtype when the label contained
None. (GH 58811) - Bug in DataFrameGroupBy.groups() and SeriesGroupBy.groups() that would not respect groupby argument
dropna(GH 55919) - Bug in DataFrameGroupBy.groups() and SeriesGroupBy.groups() would fail when the groups were Categorical with an NA value (GH 61356)
- Bug in DataFrameGroupBy.median() where nat values gave an incorrect result. (GH 57926)
- Bug in DataFrameGroupBy.quantile() when
interpolation="nearest"is inconsistent with DataFrame.quantile() (GH 47942) - Bug in DataFrameGroupBy.sum() and SeriesGroupBy.sum() returning
NaNon overflow. These methods now returnsinfor-infon overflow. (GH 60303) - Bug in DataFrameGroupBy.transform() and SeriesGroupBy.transform() with a reducer and
observed=Falsethat coerces dtype to float when there are unobserved categories. (GH 55326) - Bug in Resampler.asfreq() where fixed-frequency indexes with
originignored alignment and returned incorrect values. Noworiginandoffsetare respected. (GH 62725) - Bug in Resampler.interpolate() on a DataFrame with non-uniform sampling and/or indices not aligning with the resulting resampled index would result in wrong interpolation (GH 21351)
- Bug in Rolling.apply() for
method="table"where column order was not being respected due to the columns getting sorted by default. (GH 59666) - Bug in Rolling.apply() where the applied function could be called on fewer than
min_periodperiods ifmethod="table". (GH 58868) - Bug in Rolling.sem() computing incorrect results because it divided by
sqrt((n - 1) * (n - ddof))instead ofsqrt(n * (n - ddof)). (GH 63180) - Bug in Series.rolling() when used with a BaseIndexer subclass and computing min/max (GH 46726)
- Bug in DataFrame.ewm() and Series.ewm() when passed
timesand aggregation functions other than mean (GH 51695) - Bug in DataFrame.resample() and Series.resample() were not keeping the index name when the index had ArrowDtype timestamp dtype (GH 61222)
- Bug in DataFrame.resample() changing index type to MultiIndex when the dataframe is empty and using an upsample method (GH 55572)
- Bug in Rolling.skew() and in Rolling.kurt() incorrectly computing skewness and kurtosis, respectively, for windows following outliers due to numerical instability. The calculation now properly handles catastrophic cancellation by recomputing affected windows (GH 47461, GH 61416)
- Bug in Rolling.skew() and in Rolling.kurt() where results varied with input length despite identical data and window contents (GH 54380)
- Bug in Series.resample() could raise when the date range ended shortly before a non-existent time. (GH 58380)
- Bug in Series.resample() raising error when resampling non-nanosecond resolutions out of bounds for nanosecond precision (GH 57427)
- Bug in Rolling.var() and Rolling.std() computing incorrect results due to numerical instability. (GH 47721, GH 52407, GH 54518, GH 55343)
- Bug in DataFrame.groupby() methods when operating on NumPy-nullable data failing when the NA mask was not C-contiguous (GH 61031)
- Bug in DataFrame.groupby() when grouping by a Series and that Series was modified after calling DataFrame.groupby() but prior to the groupby operation (GH 63219)
Reshaping#
- Bug in concat() with mixed integer and bool dtypes incorrectly casting the bools to integers (GH 45101)
- Bug in qcut() where values at the quantile boundaries could be incorrectly assigned (GH 59355)
- Bug in DataFrame.combine_first() not preserving the column order (GH 60427)
- Bug in DataFrame.combine_first() with non-unique columns incorrectly raising (GH 29135)
- Bug in DataFrame.combine() with non-unique columns incorrectly raising (GH 51340)
- Bug in DataFrame.explode() producing incorrect result for
pyarrow.large_listtype (GH 61091) - Bug in DataFrame.join() inconsistently setting result index name (GH 55815)
- Bug in DataFrame.join() not producing the correct row order when joining with a list of Series/DataFrames (GH 62954)
- Bug in DataFrame.join() when a DataFrame with a MultiIndex would raise an
AssertionErrorwhen MultiIndex.names containedNone. (GH 58721) - Bug in DataFrame.merge() where merging on a column containing only
NaNvalues resulted in an out-of-bounds array access (GH 59421) - Bug in Series.combine_first() incorrectly replacing
Noneentries withNaN(GH 58977) - Bug in DataFrame.unstack() producing incorrect results when
sort=False(GH 54987, GH 55516) - Bug in DataFrame.unstack() raising an error with indexes containing
NaNwithsort=False(GH 61221) - Bug in DataFrame.merge() when merging two DataFrame on
intcoruintctypes on Windows (GH 60091, GH 58713) - Bug in DataFrame.pivot_table() incorrectly subaggregating results when called without an
indexargument (GH 58722) - Bug in DataFrame.pivot_table() incorrectly ignoring the
valuesargument when also supplied to theindexorcolumnsparameters (GH 57876, GH 61292) - Bug in DataFrame.pivot_table() where
margins=Truedid not correctly include groups withNaNvalues in the index or columns whendropna=Falsewas explicitly passed. (GH 61509) - Bug in DataFrame.stack() with the
future_stack=TruewhereValueErroris raised whenlevel=[](GH 60740) - Bug in DataFrame.unstack() producing incorrect results when manipulating empty DataFrame with an
ExtentionDtype(GH 59123) - Bug in concat() where concatenating DataFrame and Series with
ignore_index=Truedrops the Series name (GH 60723, GH 56257) - Bug in melt() where calling with duplicate column names in
id_varsraised a misleadingAttributeError(GH 61475) - Bug in DataFrame.merge() where specifying both
right_onandright_indexdid not raise aMergeErrorifleft_onis also specified. Now raises aMergeErrorin such cases. (GH 63242) - Bug in DataFrame.merge() where user-provided suffixes could result in duplicate column names if the resulting names matched existing columns. Now raises a
MergeErrorin such cases. (GH 61402) - Bug in DataFrame.merge() with CategoricalDtype columns incorrectly raising
RecursionError(GH 56376) - Bug in DataFrame.merge() with a
float32index incorrectly casting the index tofloat64(GH 41626)
Sparse#
- Bug in SparseDtype for equal comparison with na fill value. (GH 54770)
- Bug in DataFrame.sparse.from_spmatrix() which hard coded an invalid
fill_valuefor certain subtypes. (GH 59063) - Bug in DataFrame.sparse.to_dense() which ignored subclassing and always returned an instance of DataFrame (GH 59913)
- Bug in
SparseArray.cumsum()with integer data caused max recursion depth error. (GH 62669)
ExtensionArray#
- Bug in
arrays.ArrowExtensionArray.__setitem__()which caused wrong behavior when using an integer array with repeated values as a key (GH 58530) - Bug in
ArrowExtensionArray.factorize()where NA values were dropped when input was dictionary-encoded even when dropna was set to False(GH 60567) - Bug in
NDArrayBackedExtensionArray.take()which produced arrays whose dtypes didn’t match their underlying data, when called with integer arrays (GH 62448) - Bug in api.types.is_datetime64_any_dtype() where a custom
ExtensionDtypewould returnFalsefor array-likes (GH 57055) - Bug in comparison between object with ArrowDtype and incompatible-dtyped (e.g. string vs bool) incorrectly raising instead of returning all-
False(for==) or all-True(for!=) (GH 59505) - Bug in constructing pandas data structures when passing into
dtypea string of the type followed by[pyarrow]while PyArrow is not installed would raiseNameErrorrather thanImportError(GH 57928) - Bug in various DataFrame reductions for pyarrow temporal dtypes returning incorrect dtype when result was null (GH 59234)
- Fixed flex arithmetic with ExtensionArray operands raising when
fill_valuewas passed. (GH 62467)
Styler#
- Bug in Styler.to_latex() where styling column headers when combined with a hidden index or hidden index-levels is fixed.
Other#
- Bug in DataFrame when passing a
dictwith a NA scalar andcolumnsthat would always returnnp.nan(GH 57205) - Bug in Series ignoring errors when trying to convert Series input data to the given
dtype(GH 60728) - Bug in eval() on ExtensionArray on including division
/failed with aTypeError. (GH 58748) - Bug in eval() where method calls on binary operations like
(x + y).dropna()would raiseAttributeError: 'BinOp' object has no attribute 'value'(GH 61175) - Bug in eval() where the names of the Series were not preserved when using
engine="numexpr". (GH 10239) - Bug in eval() with
engine="numexpr"returning unexpected result for float division. (GH 59736) - Bug in to_numeric() raising
TypeErrorwhenargis a Timedelta or Timestamp scalar. (GH 59944) - Bug in unique() on Index not always returning Index (GH 57043)
- Bug in DataFrame.apply() raising
RecursionErrorwhen passingfunc=list[int]. (GH 61565) - Bug in DataFrame.apply() where passing
engine="numba"ignoredargspassed to the applied function (GH 58712) - Bug in DataFrame.eval() and DataFrame.query() which caused an exception when using NumPy attributes via
@notation, e.g.,df.eval("@np.floor(a)"). (GH 58041) - Bug in DataFrame.eval() and DataFrame.query() which did not allow to use
tanfunction. (GH 55091) - Bug in DataFrame.query() where using duplicate column names led to a
TypeError. (GH 59950) - Bug in DataFrame.query() which raised an exception or produced incorrect results when expressions contained backtick-quoted column names containing the hash character
#, backticks, or characters that fall outside the ASCII range (U+0001..U+007F). (GH 59285) (GH 49633) - Bug in DataFrame.query() which raised an exception when querying integer column names using backticks. (GH 60494)
- Bug in DataFrame.rename() and Series.rename() when passed a
mapper,index, orcolumnsargument that is a Series with non-uniqueser.indexproducing a corrupted result instead of raisingValueError(GH 58621) - Bug in DataFrame.sample() with
replace=Falseand(n * max(weights) / sum(weights)) > 1, the method would return biased results. Now raisesValueError. (GH 61516) - Bug in DataFrame.shift() where passing a
freqon a DataFrame with no columns did not shift the index correctly. (GH 60102) - Bug in DataFrame.sort_index() when passing
axis="columns"andignore_index=Trueandascending=Falsenot returning a RangeIndex columns (GH 57293) - Bug in DataFrame.sort_values() where sorting by a column explicitly named
Noneraised aKeyErrorinstead of sorting by the column as expected. (GH 61512) - Bug in DataFrame.transform() that was returning the wrong order unless the index was monotonically increasing. (GH 57069)
- Bug in DataFrame.where() where using a non-bool type array in the function would return a
ValueErrorinstead of aTypeError(GH 56330) - Bug in Index.sort_values() when passing a key function that turns values into tuples, e.g.
key=natsort.natsort_key, would raiseTypeError(GH 56081) - Bug in Series.describe() where median percentile was always included when the
percentilesargument was passed (GH 60550). - Bug in Series.diff() allowing non-integer values for the
periodsargument. (GH 56607) - Bug in Series.dt() methods in ArrowDtype that were returning incorrect values. (GH 57355)
- Bug in Series.isin() raising
TypeErrorwhen series is large (>10**6) andvaluescontains NA (GH 60678) - Bug in Series.kurt() and Series.skew() resulting in zero for low variance arrays (GH 57972)
- Bug in
Series.list()accessor methods not preserving the original Index. (GH 58425) - Bug in
Series.list()accessor methods not preserving the original name. (GH 60522) - Bug in Series.map() with a
timestamp[pyarrow]dtype orduration[pyarrow]dtype incorrectly returning all-NaNentries (GH 61231) - Bug in Series.mode() where an exception was raised when taking the mode with nullable types with no null values in the series. (GH 58926)
- Bug in Series.rank() that doesn’t preserve missing values for nullable integers when
na_option='keep'. (GH 56976) - Bug in Series.replace() and DataFrame.replace() throwing
ValueErrorwhenregex=Trueand all NA values. (GH 60688) - Bug in Series.replace() when the Series was created from an Index and Copy-On-Write is enabled (GH 61622)
- Bug in Series.to_string() when series contains complex floats with exponents (GH 60405)
- Bug in Dataframe Interchange Protocol implementation was returning incorrect results for data buffers’ associated dtype, for string and datetime columns (GH 54781)
- Bug in
divmodandrdivmodwith DataFrame, Series, and Index withbooldtypes failing to raise, which was inconsistent with__floordiv__behavior (GH 46043) - Bug in printing a DataFrame with a DataFrame stored in DataFrame.attrs raised a
ValueError(GH 60455) - Bug in printing a Series with a DataFrame stored in Series.attrs raised a
ValueError(GH 60568) - Bug when calling copy.copy() on a DataFrame or Series which would return a deep copy instead of a shallow copy (GH 62971)
- Fixed bug in the Series.rank() with object dtype and extremely small float values (GH 62036)
- Fixed bug where the DataFrame constructor misclassified array-like objects with a
.nameattribute as Series or Index (GH 61443) - Accessing the underlying NumPy array of a DataFrame or Series will return a read-only array if the array shares data with the original DataFrame or Series (Read-only NumPy arrays). This logic is expanded to accessing the underlying pandas ExtensionArray through
.array(or.valuesdepending on the dtype) as well (GH 61925).