What’s new in 1.3.0 (July 2, 2021) — pandas 3.0.0rc0+31.g944c527c0a documentation (original) (raw)
These are the changes in pandas 1.3.0. See Release notes for a full changelog including other versions of pandas.
Warning
When reading new Excel 2007+ (.xlsx) files, the default argumentengine=None to read_excel() will now result in using theopenpyxl engine in all cases when the option io.excel.xlsx.reader is set to "auto". Previously, some cases would use thexlrd engine instead. SeeWhat’s new 1.2.0 for background on this change.
Enhancements#
Read and write XML documents#
We added I/O support to read and render shallow versions of XML documents withread_xml() and DataFrame.to_xml(). Using lxml as parser, both XPath 1.0 and XSLT 1.0 are available. (GH 27554)
In [1]: xml = """ ...: ...: ...: square ...: 360 ...: 4.0 ...: ...: ...: circle ...: 360 ...: ...: ...: ...: triangle ...: 180 ...: 3.0 ...: ...: """
In [2]: df = pd.read_xml(xml) In [3]: df Out[3]: shape degrees sides 0 square 360 4.0 1 circle 360 NaN 2 triangle 180 3.0
In [4]: df.to_xml() Out[4]: 0 square 360 4.0 1 circle 360 2 triangle 180 3.0
For more, see Writing XML in the user guide on IO tools.
Styler enhancements#
We provided some focused development on Styler. See also the Styler documentationwhich has been revised and improved (GH 39720, GH 39317, GH 40493).
- The method Styler.set_table_styles() can now accept more natural CSS language for arguments, such as
'color:red;'instead of[('color', 'red')](GH 39563)- The methods Styler.highlight_null(), Styler.highlight_min(), and Styler.highlight_max() now allow custom CSS highlighting instead of the default background coloring (GH 40242)
- Styler.apply() now accepts functions that return an
ndarraywhenaxis=None, making it now consistent with theaxis=0andaxis=1behavior (GH 39359)- When incorrectly formatted CSS is given via Styler.apply() or
Styler.applymap(), an error is now raised upon rendering (GH 39660)- Styler.format() now accepts the keyword argument
escapefor optional HTML and LaTeX escaping (GH 40388, GH 41619)- Styler.background_gradient() has gained the argument
gmapto supply a specific gradient map for shading (GH 22727)- Styler.clear() now clears
Styler.hidden_indexandStyler.hidden_columnsas well (GH 40484)- Added the method Styler.highlight_between() (GH 39821)
- Added the method Styler.highlight_quantile() (GH 40926)
- Added the method Styler.text_gradient() (GH 41098)
- Added the method Styler.set_tooltips() to allow hover tooltips; this can be used enhance interactive displays (GH 21266, GH 40284)
- Added the parameter
precisionto the method Styler.format() to control the display of floating point numbers (GH 40134)- Styler rendered HTML output now follows the w3 HTML Style Guide (GH 39626)
- Many features of the Styler class are now either partially or fully usable on a DataFrame with a non-unique indexes or columns (GH 41143)
- One has greater control of the display through separate sparsification of the index or columns using the new styler options, which are also usable via option_context() (GH 41142)
- Added the option
styler.render.max_elementsto avoid browser overload when styling large DataFrames (GH 40712)- Added the method Styler.to_latex() (GH 21673, GH 42320), which also allows some limited CSS conversion (GH 40731)
- Added the method Styler.to_html() (GH 13379)
- Added the method Styler.set_sticky() to make index and column headers permanently visible in scrolling HTML frames (GH 29072)
DataFrame constructor honors copy=False with dict#
When passing a dictionary to DataFrame with copy=False, a copy will no longer be made (GH 32960).
In [1]: arr = np.array([1, 2, 3])
In [2]: df = pd.DataFrame({"A": arr, "B": arr.copy()}, copy=False)
In [3]: df Out[3]: A B 0 1 1 1 2 2 2 3 3
df["A"] remains a view on arr:
In [4]: arr[0] = 0
In [5]: assert df.iloc[0, 0] == 0
The default behavior when not passing copy will remain unchanged, i.e. a copy will be made.
PyArrow backed string data type#
We’ve enhanced the StringDtype, an extension type dedicated to string data. (GH 39908)
It is now possible to specify a storage keyword option to StringDtype. Use pandas options or specify the dtype using dtype='string[pyarrow]' to allow the StringArray to be backed by a PyArrow array instead of a NumPy array of Python objects.
The PyArrow backed StringArray requires pyarrow 1.0.0 or greater to be installed.
Warning
string[pyarrow] is currently considered experimental. The implementation and parts of the API may change without warning.
In [6]: pd.Series(['abc', None, 'def'], dtype=pd.StringDtype(storage="pyarrow")) Out[6]: 0 abc 1 2 def dtype: string
You can use the alias "string[pyarrow]" as well.
In [7]: s = pd.Series(['abc', None, 'def'], dtype="string[pyarrow]")
In [8]: s Out[8]: 0 abc 1 2 def dtype: string
You can also create a PyArrow backed string array using pandas options.
In [9]: with pd.option_context("string_storage", "pyarrow"): ...: s = pd.Series(['abc', None, 'def'], dtype="string") ...:
In [10]: s Out[10]: 0 abc 1 2 def dtype: string
The usual string accessor methods work. Where appropriate, the return type of the Series or columns of a DataFrame will also have string dtype.
In [11]: s.str.upper() Out[11]: 0 ABC 1 2 DEF dtype: string
In [12]: s.str.split('b', expand=True).dtypes Out[12]: 0 string 1 string dtype: object
String accessor methods returning integers will return a value with Int64Dtype
In [13]: s.str.count("a") Out[13]: 0 1 1 2 0 dtype: Int64
Centered datetime-like rolling windows#
When performing rolling calculations on DataFrame and Series objects with a datetime-like index, a centered datetime-like window can now be used (GH 38780). For example:
In [14]: df = pd.DataFrame( ....: {"A": [0, 1, 2, 3, 4]}, index=pd.date_range("2020", periods=5, freq="1D") ....: ) ....:
In [15]: df Out[15]: A 2020-01-01 0 2020-01-02 1 2020-01-03 2 2020-01-04 3 2020-01-05 4
In [16]: df.rolling("2D", center=True).mean() Out[16]: A 2020-01-01 0.5 2020-01-02 1.5 2020-01-03 2.5 2020-01-04 3.5 2020-01-05 4.0
Other enhancements#
- DataFrame.rolling(), Series.rolling(), DataFrame.expanding(), and Series.expanding() now support a
methodargument with a'table'option that performs the windowing operation over an entire DataFrame. See Window Overview for performance and functional benefits (GH 15095, GH 38995) ExponentialMovingWindownow support aonlinemethod that can performmeancalculations in an online fashion. See Window Overview (GH 41673)- Added MultiIndex.dtypes() (GH 37062)
- Added
endandend_dayoptions for theoriginargument in DataFrame.resample() (GH 37804) - Improved error message when
usecolsandnamesdo not match for read_csv() andengine="c"(GH 29042) - Improved consistency of error messages when passing an invalid
win_typeargument in Window methods (GH 15969) - read_sql_query() now accepts a
dtypeargument to cast the columnar data from the SQL database based on user input (GH 10285) - read_csv() now raising
ParserWarningif length of header or given names does not match length of data whenusecolsis not specified (GH 21768) - Improved integer type mapping from pandas to SQLAlchemy when using DataFrame.to_sql() (GH 35076)
- to_numeric() now supports downcasting of nullable
ExtensionDtypeobjects (GH 33013) - Added support for dict-like names in
MultiIndex.set_namesandMultiIndex.rename(GH 20421) - read_excel() can now auto-detect .xlsb files and older .xls files (GH 35416, GH 41225)
- ExcelWriter now accepts an
if_sheet_existsparameter to control the behavior of append mode when writing to existing sheets (GH 40230) - Rolling.sum(), Expanding.sum(), Rolling.mean(), Expanding.mean(), ExponentialMovingWindow.mean(), Rolling.median(), Expanding.median(), Rolling.max(), Expanding.max(), Rolling.min(), and Expanding.min() now support Numba execution with the
enginekeyword (GH 38895, GH 41267) - DataFrame.apply() can now accept NumPy unary operators as strings, e.g.
df.apply("sqrt"), which was already the case for Series.apply() (GH 39116) - DataFrame.apply() can now accept non-callable DataFrame properties as strings, e.g.
df.apply("size"), which was already the case for Series.apply() (GH 39116) DataFrame.applymap()can now accept kwargs to pass on to the user-providedfunc(GH 39987)- Passing a DataFrame indexer to
ilocis now disallowed forSeries.__getitem__()andDataFrame.__getitem__()(GH 39004) - Series.apply() can now accept list-like or dictionary-like arguments that aren’t lists or dictionaries, e.g.
ser.apply(np.array(["sum", "mean"])), which was already the case for DataFrame.apply() (GH 39140) - DataFrame.plot.scatter() can now accept a categorical column for the argument
c(GH 12380, GH 31357) - Series.loc() now raises a helpful error message when the Series has a MultiIndex and the indexer has too many dimensions (GH 35349)
- read_stata() now supports reading data from compressed files (GH 26599)
- Added support for parsing
ISO 8601-like timestamps with negative signs to Timedelta (GH 37172) - Added support for unary operators in
FloatingArray(GH 38749) - RangeIndex can now be constructed by passing a
rangeobject directly e.g.pd.RangeIndex(range(3))(GH 12067) - Series.round() and DataFrame.round() now work with nullable integer and floating dtypes (GH 38844)
- read_csv() and read_json() expose the argument
encoding_errorsto control how encoding errors are handled (GH 39450) - DataFrameGroupBy.any(), SeriesGroupBy.any(), DataFrameGroupBy.all(), and SeriesGroupBy.all() use Kleene logic with nullable data types (GH 37506)
- DataFrameGroupBy.any(), SeriesGroupBy.any(), DataFrameGroupBy.all(), and SeriesGroupBy.all() return a
BooleanDtypefor columns with nullable data types (GH 33449) - DataFrameGroupBy.any(), SeriesGroupBy.any(), DataFrameGroupBy.all(), and SeriesGroupBy.all() raising with
objectdata containingpd.NAeven whenskipna=True(GH 37501) - DataFrameGroupBy.rank() and SeriesGroupBy.rank() now supports object-dtype data (GH 38278)
- Constructing a DataFrame or Series with the
dataargument being a Python iterable that is not a NumPyndarrayconsisting of NumPy scalars will now result in a dtype with a precision the maximum of the NumPy scalars; this was already the case whendatais a NumPyndarray(GH 40908) - Add keyword
sortto pivot_table() to allow non-sorting of the result (GH 39143) - Add keyword
dropnato DataFrame.value_counts() to allow counting rows that includeNAvalues (GH 41325) - Series.replace() will now cast results to
PeriodDtypewhere possible instead ofobjectdtype (GH 41526) - Improved error message in
corrandcovmethods onRolling,Expanding, andExponentialMovingWindowwhenotheris not a DataFrame or Series (GH 41741) - Series.between() can now accept
leftorrightas arguments toinclusiveto include only the left or right boundary (GH 40245) - DataFrame.explode() now supports exploding multiple columns. Its
columnargument now also accepts a list of str or tuples for exploding on multiple columns at the same time (GH 39240) - DataFrame.sample() now accepts the
ignore_indexargument to reset the index after sampling, similar to DataFrame.drop_duplicates() and DataFrame.sort_values() (GH 38581)
Notable bug fixes#
These are bug fixes that might have notable behavior changes.
Categorical.unique now always maintains same dtype as original#
Previously, when calling Categorical.unique() with categorical data, unused categories in the new array would be removed, making the dtype of the new array different than the original (GH 18291)
As an example of this, given:
In [17]: dtype = pd.CategoricalDtype(['bad', 'neutral', 'good'], ordered=True)
In [18]: cat = pd.Categorical(['good', 'good', 'bad', 'bad'], dtype=dtype)
In [19]: original = pd.Series(cat)
In [20]: unique = original.unique()
Previous behavior:
In [1]: unique ['good', 'bad'] Categories (2, object): ['bad' < 'good'] In [2]: original.dtype == unique.dtype False
New behavior:
In [21]: unique Out[21]: ['good', 'bad'] Categories (3, str): ['bad' < 'neutral' < 'good']
In [22]: original.dtype == unique.dtype Out[22]: True
Preserve dtypes in DataFrame.combine_first()#
DataFrame.combine_first() will now preserve dtypes (GH 7509)
In [23]: df1 = pd.DataFrame({"A": [1, 2, 3], "B": [1, 2, 3]}, index=[0, 1, 2])
In [24]: df1 Out[24]: A B 0 1 1 1 2 2 2 3 3
In [25]: df2 = pd.DataFrame({"B": [4, 5, 6], "C": [1, 2, 3]}, index=[2, 3, 4])
In [26]: df2 Out[26]: B C 2 4 1 3 5 2 4 6 3
In [27]: combined = df1.combine_first(df2)
Previous behavior:
In [1]: combined.dtypes Out[2]: A float64 B float64 C float64 dtype: object
New behavior:
In [28]: combined.dtypes Out[28]: A float64 B int64 C float64 dtype: object
Groupby methods agg and transform no longer changes return dtype for callables#
Previously the methods DataFrameGroupBy.aggregate(),SeriesGroupBy.aggregate(), DataFrameGroupBy.transform(), andSeriesGroupBy.transform() might cast the result dtype when the argument funcis callable, possibly leading to undesirable results (GH 21240). The cast would occur if the result is numeric and casting back to the input dtype does not change any values as measured by np.allclose. Now no such casting occurs.
In [29]: df = pd.DataFrame({'key': [1, 1], 'a': [True, False], 'b': [True, True]})
In [30]: df Out[30]: key a b 0 1 True True 1 1 False True
Previous behavior:
In [5]: df.groupby('key').agg(lambda x: x.sum()) Out[5]: a b key 1 True 2
New behavior:
In [31]: df.groupby('key').agg(lambda x: x.sum())
Out[31]:
a b
key
1 1 2
float result for DataFrameGroupBy.mean(), DataFrameGroupBy.median(), and GDataFrameGroupBy.var(), SeriesGroupBy.mean(), SeriesGroupBy.median(), and SeriesGroupBy.var()#
Previously, these methods could result in different dtypes depending on the input values. Now, these methods will always return a float dtype. (GH 41137)
In [32]: df = pd.DataFrame({'a': [True], 'b': [1], 'c': [1.0]})
Previous behavior:
In [5]: df.groupby(df.index).mean() Out[5]: a b c 0 True 1 1.0
New behavior:
In [33]: df.groupby(df.index).mean() Out[33]: a b c 0 1.0 1.0 1.0
Try operating inplace when setting values with loc and iloc#
When setting an entire column using loc or iloc, pandas will try to insert the values into the existing data rather than create an entirely new array.
In [34]: df = pd.DataFrame(range(3), columns=["A"], dtype="float64")
In [35]: values = df.values
In [36]: new = np.array([5, 6, 7], dtype="int64")
In [37]: df.loc[[0, 1, 2], "A"] = new
In both the new and old behavior, the data in values is overwritten, but in the old behavior the dtype of df["A"] changed to int64.
Previous behavior:
In [1]: df.dtypes Out[1]: A int64 dtype: object In [2]: np.shares_memory(df["A"].values, new) Out[2]: False In [3]: np.shares_memory(df["A"].values, values) Out[3]: False
In pandas 1.3.0, df continues to share data with values
New behavior:
In [38]: df.dtypes Out[38]: A float64 dtype: object
In [39]: np.shares_memory(df["A"], new) Out[39]: False
In [40]: np.shares_memory(df["A"], values) Out[40]: True
Never operate inplace when setting frame[keys] = values#
When setting multiple columns using frame[keys] = values new arrays will replace pre-existing arrays for these keys, which will not be over-written (GH 39510). As a result, the columns will retain the dtype(s) of values, never casting to the dtypes of the existing arrays.
In [41]: df = pd.DataFrame(range(3), columns=["A"], dtype="float64")
In [42]: df[["A"]] = 5
In the old behavior, 5 was cast to float64 and inserted into the existing array backing df:
Previous behavior:
In [1]: df.dtypes Out[1]: A float64
In the new behavior, we get a new array, and retain an integer-dtyped 5:
New behavior:
In [43]: df.dtypes Out[43]: A int64 dtype: object
Consistent casting with setting into Boolean Series#
Setting non-boolean values into a Series with dtype=bool now consistently casts to dtype=object (GH 38709)
In [1]: orig = pd.Series([True, False])
In [2]: ser = orig.copy()
In [3]: ser.iloc[1] = np.nan
In [4]: ser2 = orig.copy()
In [5]: ser2.iloc[1] = 2.0
Previous behavior:
In [1]: ser Out [1]: 0 1.0 1 NaN dtype: float64
In [2]:ser2 Out [2]: 0 True 1 2.0 dtype: object
New behavior:
In [1]: ser Out [1]: 0 True 1 NaN dtype: object
In [2]:ser2 Out [2]: 0 True 1 2.0 dtype: object
DataFrameGroupBy.rolling and SeriesGroupBy.rolling no longer return grouped-by column in values#
The group-by column will now be dropped from the result of agroupby.rolling operation (GH 32262)
In [44]: df = pd.DataFrame({"A": [1, 1, 2, 3], "B": [0, 1, 2, 3]})
In [45]: df Out[45]: A B 0 1 0 1 1 1 2 2 2 3 3 3
Previous behavior:
In [1]: df.groupby("A").rolling(2).sum() Out[1]: A B A 1 0 NaN NaN 1 2.0 1.0 2 2 NaN NaN 3 3 NaN NaN
New behavior:
In [46]: df.groupby("A").rolling(2).sum()
Out[46]:
B
A
1 0 NaN
1 1.0
2 2 NaN
3 3 NaN
Removed artificial truncation in rolling variance and standard deviation#
Rolling.std() and Rolling.var() will no longer artificially truncate results that are less than ~1e-8 and ~1e-15 respectively to zero (GH 37051, GH 40448, GH 39872).
However, floating point artifacts may now exist in the results when rolling over larger values.
In [47]: s = pd.Series([7, 5, 5, 5])
In [48]: s.rolling(3).var() Out[48]: 0 NaN 1 NaN 2 1.333333 3 0.000000 dtype: float64
DataFrameGroupBy.rolling and SeriesGroupBy.rolling with MultiIndex no longer drop levels in the result#
DataFrameGroupBy.rolling() and SeriesGroupBy.rolling() will no longer drop levels of a DataFramewith a MultiIndex in the result. This can lead to a perceived duplication of levels in the resultingMultiIndex, but this change restores the behavior that was present in version 1.1.3 (GH 38787, GH 38523).
In [49]: index = pd.MultiIndex.from_tuples([('idx1', 'idx2')], names=['label1', 'label2'])
In [50]: df = pd.DataFrame({'a': [1], 'b': [2]}, index=index)
In [51]: df
Out[51]:
a b
label1 label2
idx1 idx2 1 2
Previous behavior:
In [1]: df.groupby('label1').rolling(1).sum() Out[1]: a b label1 idx1 1.0 2.0
New behavior:
In [52]: df.groupby('label1').rolling(1).sum()
Out[52]:
a b
label1 label1 label2
idx1 idx1 idx2 1.0 2.0
Backwards incompatible API changes#
Increased minimum versions for dependencies#
Some minimum supported versions of dependencies were updated. If installed, we now require:
For optional libraries the general recommendation is to use the latest version. The following table lists the lowest version per library that is currently being tested throughout the development of pandas. Optional libraries below the lowest tested version may still work, but are not considered supported.
See Dependencies and Optional dependencies for more.
Other API changes#
- Partially initialized CategoricalDtype objects (i.e. those with
categories=None) will no longer compare as equal to fully initialized dtype objects (GH 38516) - Accessing
_constructor_expanddimon a DataFrame and_constructor_slicedon a Series now raise anAttributeError. Previously aNotImplementedErrorwas raised (GH 38782) - Added new
engineand**engine_kwargsparameters to DataFrame.to_sql() to support other future “SQL engines”. Currently we still only useSQLAlchemyunder the hood, but more engines are planned to be supported such as turbodbc (GH 36893) - Removed redundant
freqfrom PeriodIndex string representation (GH 41653) ExtensionDtype.construct_array_type()is now a required method instead of an optional one forExtensionDtypesubclasses (GH 24860)- Calling
hashon non-hashable pandas objects will now raiseTypeErrorwith the built-in error message (e.g.unhashable type: 'Series'). Previously it would raise a custom message such as'Series' objects are mutable, thus they cannot be hashed. Furthermore,isinstance(<Series>, abc.collections.Hashable)will now returnFalse(GH 40013) - Styler.from_custom_template() now has two new arguments for template names, and removed the old
name, due to template inheritance having been introducing for better parsing (GH 42053). Subclassing modifications to Styler attributes are also needed.
Build#
- Documentation in
.pptxand.pdfformats are no longer included in wheels or source distributions. (GH 30741)
Deprecations#
Deprecated dropping nuisance columns in DataFrame reductions and DataFrameGroupBy operations#
Calling a reduction (e.g. .min, .max, .sum) on a DataFrame withnumeric_only=None (the default), columns where the reduction raises a TypeErrorare silently ignored and dropped from the result.
This behavior is deprecated. In a future version, the TypeError will be raised, and users will need to select only valid columns before calling the function.
For example:
In [53]: df = pd.DataFrame({"A": [1, 2, 3, 4], "B": pd.date_range("2016-01-01", periods=4)})
In [54]: df Out[54]: A B 0 1 2016-01-01 1 2 2016-01-02 2 3 2016-01-03 3 4 2016-01-04
Old behavior:
In [3]: df.prod() Out[3]: Out[3]: A 24 dtype: int64
Future behavior:
In [4]: df.prod() ... TypeError: 'DatetimeArray' does not implement reduction 'prod'
In [5]: df[["A"]].prod() Out[5]: A 24 dtype: int64
Similarly, when applying a function to DataFrameGroupBy, columns on which the function raises TypeError are currently silently ignored and dropped from the result.
This behavior is deprecated. In a future version, the TypeErrorwill be raised, and users will need to select only valid columns before calling the function.
For example:
In [55]: df = pd.DataFrame({"A": [1, 2, 3, 4], "B": pd.date_range("2016-01-01", periods=4)})
In [56]: gb = df.groupby([1, 1, 2, 2])
Old behavior:
In [4]: gb.prod(numeric_only=False) Out[4]: A 1 2 2 12
Future behavior:
In [5]: gb.prod(numeric_only=False) ... TypeError: datetime64 type does not support prod operations
In [6]: gb[["A"]].prod(numeric_only=False) Out[6]: A 1 2 2 12
Other Deprecations#
- Deprecated allowing scalars to be passed to the Categorical constructor (GH 38433)
- Deprecated constructing CategoricalIndex without passing list-like data (GH 38944)
- Deprecated allowing subclass-specific keyword arguments in the Index constructor, use the specific subclass directly instead (GH 14093, GH 21311, GH 22315, GH 26974)
- Deprecated the
astype()method of datetimelike (timedelta64[ns],datetime64[ns],Datetime64TZDtype,PeriodDtype) to convert to integer dtypes, usevalues.view(...)instead (GH 38544). This deprecation was later reverted in pandas 1.4.0. - Deprecated
MultiIndex.is_lexsorted()andMultiIndex.lexsort_depth(), useMultiIndex.is_monotonic_increasing()instead (GH 32259) - Deprecated keyword
try_castin Series.where(), Series.mask(), DataFrame.where(), DataFrame.mask(); cast results manually if desired (GH 38836) - Deprecated comparison of Timestamp objects with
datetime.dateobjects. Instead of e.g.ts <= mydateusets <= pd.Timestamp(mydate)orts.date() <= mydate(GH 36131) - Deprecated
Rolling.win_typereturning"freq"(GH 38963) - Deprecated
Rolling.is_datetimelike(GH 38963) - Deprecated DataFrame indexer for
Series.__setitem__()andDataFrame.__setitem__()(GH 39004) - Deprecated
ExponentialMovingWindow.vol()(GH 39220) - Using
.astypeto convert betweendatetime64[ns]dtype and DatetimeTZDtype is deprecated and will raise in a future version, useobj.tz_localizeorobj.dt.tz_localizeinstead (GH 38622) - Deprecated casting
datetime.dateobjects todatetime64when used asfill_valuein DataFrame.unstack(), DataFrame.shift(), Series.shift(), and DataFrame.reindex(), passpd.Timestamp(dateobj)instead (GH 39767) - Deprecated
Styler.set_na_rep()andStyler.set_precision()in favor of Styler.format() withna_repandprecisionas existing and new input arguments respectively (GH 40134, GH 40425) - Deprecated
Styler.where()in favor of using an alternative formulation withStyler.applymap()(GH 40821) - Deprecated allowing partial failure in Series.transform() and DataFrame.transform() when
funcis list-like or dict-like and raises anything butTypeError;funcraising anything but aTypeErrorwill raise in a future version (GH 40211) - Deprecated arguments
error_bad_linesandwarn_bad_linesin read_csv() and read_table() in favor of argumenton_bad_lines(GH 15122) - Deprecated support for
np.ma.mrecords.MaskedRecordsin the DataFrame constructor, pass{name: data[name] for name in data.dtype.names}instead (GH 40363) - Deprecated using merge(), DataFrame.merge(), and DataFrame.join() on a different number of levels (GH 34862)
- Deprecated the use of
**kwargsin ExcelWriter; use the keyword argumentengine_kwargsinstead (GH 40430) - Deprecated the
levelkeyword for DataFrame and Series aggregations; use groupby instead (GH 39983) - Deprecated the
inplaceparameter of Categorical.remove_categories(), Categorical.add_categories(), Categorical.reorder_categories(), Categorical.rename_categories(), Categorical.set_categories() and will be removed in a future version (GH 37643) - Deprecated merge() producing duplicated columns through the
suffixeskeyword and already existing columns (GH 22818) - Deprecated setting
Categorical._codes, create a new Categorical with the desired codes instead (GH 40606) - Deprecated the
convert_floatoptional argument in read_excel() and ExcelFile.parse() (GH 41127) - Deprecated behavior of
DatetimeIndex.union()with mixed timezones; in a future version both will be cast to UTC instead of object dtype (GH 39328) - Deprecated using
usecolswith out of bounds indices for read_csv() withengine="c"(GH 25623) - Deprecated special treatment of lists with first element a Categorical in the DataFrame constructor; pass as
pd.DataFrame({col: categorical, ...})instead (GH 38845) - Deprecated behavior of DataFrame constructor when a
dtypeis passed and the data cannot be cast to that dtype. In a future version, this will raise instead of being silently ignored (GH 24435) - Deprecated the
Timestamp.freqattribute. For the properties that use it (is_month_start,is_month_end,is_quarter_start,is_quarter_end,is_year_start,is_year_end), when you have afreq, use e.g.freq.is_month_start(ts)(GH 15146) - Deprecated construction of Series or DataFrame with
DatetimeTZDtypedata anddatetime64[ns]dtype. UseSeries(data).dt.tz_localize(None)instead (GH 41555, GH 33401) - Deprecated behavior of Series construction with large-integer values and small-integer dtype silently overflowing; use
Series(data).astype(dtype)instead (GH 41734) - Deprecated behavior of DataFrame construction with floating data and integer dtype casting even when lossy; in a future version this will remain floating, matching Series behavior (GH 41770)
- Deprecated inference of
timedelta64[ns],datetime64[ns], orDatetimeTZDtypedtypes in Series construction when data containing strings is passed and nodtypeis passed (GH 33558) - In a future version, constructing Series or DataFrame with
datetime64[ns]data andDatetimeTZDtypewill treat the data as wall-times instead of as UTC times (matching DatetimeIndex behavior). To treat the data as UTC times, usepd.Series(data).dt.tz_localize("UTC").dt.tz_convert(dtype.tz)orpd.Series(data.view("int64"), dtype=dtype)(GH 33401) - Deprecated passing lists as
keyto DataFrame.xs() and Series.xs() (GH 41760) - Deprecated boolean arguments of
inclusivein Series.between() to have{"left", "right", "neither", "both"}as standard argument values (GH 40628) - Deprecated passing arguments as positional for all of the following, with exceptions noted (GH 41485):
- concat() (other than
objs) - read_csv() (other than
filepath_or_buffer) - read_table() (other than
filepath_or_buffer) - DataFrame.clip() and Series.clip() (other than
upperandlower) - DataFrame.drop_duplicates() (except for
subset), Series.drop_duplicates(), Index.drop_duplicates() andMultiIndex.drop_duplicates() - DataFrame.drop() (other than
labels) and Series.drop() - DataFrame.dropna() and Series.dropna()
- DataFrame.ffill(), Series.ffill(), DataFrame.bfill(), and Series.bfill()
- DataFrame.fillna() and Series.fillna() (apart from
value) - DataFrame.interpolate() and Series.interpolate() (other than
method) - DataFrame.mask() and Series.mask() (other than
condandother) - DataFrame.reset_index() (other than
level) and Series.reset_index() - DataFrame.set_axis() and Series.set_axis() (other than
labels) - DataFrame.set_index() (other than
keys) - DataFrame.sort_index() and Series.sort_index()
- DataFrame.sort_values() (other than
by) and Series.sort_values() - DataFrame.where() and Series.where() (other than
condandother) - Index.set_names() and
MultiIndex.set_names()(except fornames) - MultiIndex.codes() (except for
codes) - MultiIndex.set_levels() (except for
levels) Resampler.interpolate()(other thanmethod)
- concat() (other than
Performance improvements#
- Performance improvement in
IntervalIndex.isin()(GH 38353) - Performance improvement in Series.mean() for nullable data types (GH 34814)
- Performance improvement in Series.isin() for nullable data types (GH 38340)
- Performance improvement in DataFrame.fillna() with
method="pad"ormethod="backfill"for nullable floating and nullable integer dtypes (GH 39953) - Performance improvement in DataFrame.corr() for
method=kendall(GH 28329) - Performance improvement in DataFrame.corr() for
method=spearman(GH 40956, GH 41885) - Performance improvement in Rolling.corr() and Rolling.cov() (GH 39388)
- Performance improvement in
RollingGroupby.corr(),ExpandingGroupby.corr(),ExpandingGroupby.corr()andExpandingGroupby.cov()(GH 39591) - Performance improvement in unique() for object data type (GH 37615)
- Performance improvement in json_normalize() for basic cases (including separators) (GH 40035 GH 15621)
- Performance improvement in
ExpandingGroupbyaggregation methods (GH 39664) - Performance improvement in Styler where render times are more than 50% reduced and now matches DataFrame.to_html() (GH 39972 GH 39952, GH 40425)
- The method Styler.set_td_classes() is now as performant as Styler.apply() and
Styler.applymap(), and even more so in some cases (GH 40453) - Performance improvement in ExponentialMovingWindow.mean() with
times(GH 39784) - Performance improvement in DataFrameGroupBy.apply() and SeriesGroupBy.apply() when requiring the Python fallback implementation (GH 40176)
- Performance improvement in the conversion of a PyArrow Boolean array to a pandas nullable Boolean array (GH 41051)
- Performance improvement for concatenation of data with type CategoricalDtype (GH 40193)
- Performance improvement in DataFrameGroupBy.cummin(), SeriesGroupBy.cummin(), DataFrameGroupBy.cummax(), and SeriesGroupBy.cummax() with nullable data types (GH 37493)
- Performance improvement in Series.nunique() with nan values (GH 40865)
- Performance improvement in DataFrame.transpose(), Series.unstack() with
DatetimeTZDtype(GH 40149) - Performance improvement in Series.plot() and DataFrame.plot() with entry point lazy loading (GH 41492)
Bug fixes#
Categorical#
- Bug in CategoricalIndex incorrectly failing to raise
TypeErrorwhen scalar data is passed (GH 38614) - Bug in
CategoricalIndex.reindexfailed when the Index passed was not categorical but whose values were all labels in the category (GH 28690) - Bug where constructing a Categorical from an object-dtype array of
dateobjects did not round-trip correctly withastype(GH 38552) - Bug in constructing a DataFrame from an
ndarrayand a CategoricalDtype (GH 38857) - Bug in setting categorical values into an object-dtype column in a DataFrame (GH 39136)
- Bug in DataFrame.reindex() was raising an
IndexErrorwhen the new index contained duplicates and the old index was a CategoricalIndex (GH 38906) - Bug in
Categorical.fillna()with a tuple-like category raisingNotImplementedErrorinstead ofValueErrorwhen filling with a non-category tuple (GH 41914)
Datetimelike#
- Bug in DataFrame and Series constructors sometimes dropping nanoseconds from Timestamp (resp. Timedelta)
data, withdtype=datetime64[ns](resp.timedelta64[ns]) (GH 38032) - Bug in
DataFrame.first()andSeries.first()with an offset of one month returning an incorrect result when the first day is the last day of a month (GH 29623) - Bug in constructing a DataFrame or Series with mismatched
datetime64data andtimedelta64dtype, or vice-versa, failing to raise aTypeError(GH 38575, GH 38764, GH 38792) - Bug in constructing a Series or DataFrame with a
datetimeobject out of bounds fordatetime64[ns]dtype or atimedeltaobject out of bounds fortimedelta64[ns]dtype (GH 38792, GH 38965) - Bug in
DatetimeIndex.intersection(),DatetimeIndex.symmetric_difference(),PeriodIndex.intersection(),PeriodIndex.symmetric_difference()always returning object-dtype when operating with CategoricalIndex (GH 38741) - Bug in
DatetimeIndex.intersection()giving incorrect results with non-Tick frequencies withn != 1(GH 42104) - Bug in Series.where() incorrectly casting
datetime64values toint64(GH 37682) - Bug in Categorical incorrectly typecasting
datetimeobject toTimestamp(GH 38878) - Bug in comparisons between Timestamp object and
datetime64objects just outside the implementation bounds for nanoseconddatetime64(GH 39221) - Bug in Timestamp.round(), Timestamp.floor(), Timestamp.ceil() for values near the implementation bounds of Timestamp (GH 39244)
- Bug in Timedelta.round(), Timedelta.floor(), Timedelta.ceil() for values near the implementation bounds of Timedelta (GH 38964)
- Bug in date_range() incorrectly creating DatetimeIndex containing
NaTinstead of raisingOutOfBoundsDatetimein corner cases (GH 24124) - Bug in infer_freq() incorrectly fails to infer ‘H’ frequency of DatetimeIndex if the latter has a timezone and crosses DST boundaries (GH 39556)
- Bug in Series backed by
DatetimeArrayorTimedeltaArraysometimes failing to set the array’sfreqtoNone(GH 41425)
Timedelta#
- Bug in constructing Timedelta from
np.timedelta64objects with non-nanosecond units that are out of bounds fortimedelta64[ns](GH 38965) - Bug in constructing a TimedeltaIndex incorrectly accepting
np.datetime64("NaT")objects (GH 39462) - Bug in constructing Timedelta from an input string with only symbols and no digits failed to raise an error (GH 39710)
- Bug in TimedeltaIndex and to_timedelta() failing to raise when passed non-nanosecond
timedelta64arrays that overflow when converting totimedelta64[ns](GH 40008)
Timezones#
- Bug in different
tzinfoobjects representing UTC not being treated as equivalent (GH 39216) - Bug in
dateutil.tz.gettz("UTC")not being recognized as equivalent to other UTC-representing tzinfos (GH 39276)
Numeric#
- Bug in DataFrame.quantile(), DataFrame.sort_values() causing incorrect subsequent indexing behavior (GH 38351)
- Bug in DataFrame.sort_values() raising an IndexError for empty
by(GH 40258) - Bug in DataFrame.select_dtypes() with
include=np.numberwould drop numericExtensionDtypecolumns (GH 35340) - Bug in DataFrame.mode() and Series.mode() not keeping consistent integer Index for empty input (GH 33321)
- Bug in DataFrame.rank() when the DataFrame contained
np.inf(GH 32593) - Bug in DataFrame.rank() with
axis=0and columns holding incomparable types raising anIndexError(GH 38932) - Bug in Series.rank(), DataFrame.rank(), DataFrameGroupBy.rank(), and SeriesGroupBy.rank() treating the most negative
int64value as missing (GH 32859) - Bug in DataFrame.select_dtypes() different behavior between Windows and Linux with
include="int"(GH 36596) - Bug in DataFrame.apply() and DataFrame.agg() when passed the argument
func="size"would operate on the entireDataFrameinstead of rows or columns (GH 39934) - Bug in DataFrame.transform() would raise a
SpecificationErrorwhen passed a dictionary and columns were missing; will now raise aKeyErrorinstead (GH 40004) - Bug in DataFrameGroupBy.rank() and SeriesGroupBy.rank() giving incorrect results with
pct=Trueand equal values between consecutive groups (GH 40518) - Bug in Series.count() would result in an
int32result on 32-bit platforms when argumentlevel=None(GH 40908) - Bug in Series and DataFrame reductions with methods
anyandallnot returning Boolean results for object data (GH 12863, GH 35450, GH 27709) - Bug in Series.clip() would fail if the Series contains NA values and has nullable int or float as a data type (GH 40851)
- Bug in
UInt64Index.where()andUInt64Index.putmask()with annp.int64dtypeotherincorrectly raisingTypeError(GH 41974) - Bug in DataFrame.agg() not sorting the aggregated axis in the order of the provided aggregation functions when one or more aggregation function fails to produce results (GH 33634)
- Bug in DataFrame.clip() not interpreting missing values as no threshold (GH 40420)
Conversion#
- Bug in Series.to_dict() with
orient='records'now returns Python native types (GH 25969) - Bug in
Series.view()and Index.view() when converting between datetime-like (datetime64[ns],datetime64[ns, tz],timedelta64,period) dtypes (GH 39788) - Bug in creating a DataFrame from an empty
np.recarraynot retaining the original dtypes (GH 40121) - Bug in DataFrame failing to raise a
TypeErrorwhen constructing from afrozenset(GH 40163) - Bug in Index construction silently ignoring a passed
dtypewhen the data cannot be cast to that dtype (GH 21311) - Bug in
StringArray.astype()falling back to NumPy and raising when converting todtype='categorical'(GH 40450) - Bug in factorize() where, when given an array with a numeric NumPy dtype lower than int64, uint64 and float64, the unique values did not keep their original dtype (GH 41132)
- Bug in DataFrame construction with a dictionary containing an array-like with
ExtensionDtypeandcopy=Truefailing to make a copy (GH 38939) - Bug in qcut() raising error when taking
Float64DTypeas input (GH 40730) - Bug in DataFrame and Series construction with
datetime64[ns]data anddtype=objectresulting indatetimeobjects instead of Timestamp objects (GH 41599) - Bug in DataFrame and Series construction with
timedelta64[ns]data anddtype=objectresulting innp.timedelta64objects instead of Timedelta objects (GH 41599) - Bug in DataFrame construction when given a two-dimensional object-dtype
np.ndarrayof Period or Interval objects failing to cast to PeriodDtype or IntervalDtype, respectively (GH 41812) - Bug in constructing a Series from a list and a
PandasDtype(GH 39357) - Bug in creating a Series from a
rangeobject that does not fit in the bounds ofint64dtype (GH 30173) - Bug in creating a Series from a
dictwith all-tuple keys and an Index that requires reindexing (GH 41707) - Bug in infer_dtype() not recognizing Series, Index, or array with a Period dtype (GH 23553)
- Bug in infer_dtype() raising an error for general ExtensionArray objects. It will now return
"unknown-array"instead of raising (GH 37367) - Bug in DataFrame.convert_dtypes() incorrectly raised a
ValueErrorwhen called on an empty DataFrame (GH 40393)
Strings#
- Bug in the conversion from
pyarrow.ChunkedArrayto StringArray when the original had zero chunks (GH 41040) - Bug in Series.replace() and DataFrame.replace() ignoring replacements with
regex=TrueforStringDTypedata (GH 41333, GH 35977) - Bug in Series.str.extract() with StringArray returning object dtype for an empty DataFrame (GH 41441)
- Bug in Series.str.replace() where the
caseargument was ignored whenregex=False(GH 41602)
Interval#
- Bug in
IntervalIndex.intersection()andIntervalIndex.symmetric_difference()always returning object-dtype when operating with CategoricalIndex (GH 38653, GH 38741) - Bug in
IntervalIndex.intersection()returning duplicates when at least one of the Index objects have duplicates which are present in the other (GH 38743) IntervalIndex.union(),IntervalIndex.intersection(),IntervalIndex.difference(), andIntervalIndex.symmetric_difference()now cast to the appropriate dtype instead of raising aTypeErrorwhen operating with another IntervalIndex with incompatible dtype (GH 39267)PeriodIndex.union(),PeriodIndex.intersection(),PeriodIndex.symmetric_difference(),PeriodIndex.difference()now cast to object dtype instead of raisingIncompatibleFrequencywhen operating with another PeriodIndex with incompatible dtype (GH 39306)- Bug in
IntervalIndex.is_monotonic(), IntervalIndex.get_loc(),IntervalIndex.get_indexer_for(), andIntervalIndex.__contains__()when NA values are present (GH 41831)
Indexing#
- Bug in Index.union() and
MultiIndex.union()dropping duplicateIndexvalues whenIndexwas not monotonic orsortwas set toFalse(GH 36289, GH 31326, GH 40862) - Bug in
CategoricalIndex.get_indexer()failing to raiseInvalidIndexErrorwhen non-unique (GH 38372) - Bug in IntervalIndex.get_indexer() when
targethasCategoricalDtypeand both the index and the target contain NA values (GH 41934) - Bug in Series.loc() raising a
ValueErrorwhen input was filtered with a Boolean list and values to set were a list with lower dimension (GH 20438) - Bug in inserting many new columns into a DataFrame causing incorrect subsequent indexing behavior (GH 38380)
- Bug in
DataFrame.__setitem__()raising aValueErrorwhen setting multiple values to duplicate columns (GH 15695) - Bug in DataFrame.loc(), Series.loc(),
DataFrame.__getitem__()andSeries.__getitem__()returning incorrect elements for non-monotonic DatetimeIndex for string slices (GH 33146) - Bug in DataFrame.reindex() and Series.reindex() with timezone aware indexes raising a
TypeErrorformethod="ffill"andmethod="bfill"and specifiedtolerance(GH 38566) - Bug in DataFrame.reindex() with
datetime64[ns]ortimedelta64[ns]incorrectly casting to integers when thefill_valuerequires casting to object dtype (GH 39755) - Bug in
DataFrame.__setitem__()raising aValueErrorwhen setting on an empty DataFrame using specified columns and a nonempty DataFrame value (GH 38831) - Bug in
DataFrame.loc.__setitem__()raising aValueErrorwhen operating on a unique column when the DataFrame has duplicate columns (GH 38521) - Bug in
DataFrame.iloc.__setitem__()andDataFrame.loc.__setitem__()with mixed dtypes when setting with a dictionary value (GH 38335) - Bug in
Series.loc.__setitem__()andDataFrame.loc.__setitem__()raisingKeyErrorwhen provided a Boolean generator (GH 39614) - Bug in Series.iloc() and DataFrame.iloc() raising a
KeyErrorwhen provided a generator (GH 39614) - Bug in
DataFrame.__setitem__()not raising aValueErrorwhen the right hand side is a DataFrame with wrong number of columns (GH 38604) - Bug in
Series.__setitem__()raising aValueErrorwhen setting a Series with a scalar indexer (GH 38303) - Bug in DataFrame.loc() dropping levels of a MultiIndex when the DataFrame used as input has only one row (GH 10521)
- Bug in
DataFrame.__getitem__()andSeries.__getitem__()always raisingKeyErrorwhen slicing with existing strings where the Index has milliseconds (GH 33589) - Bug in setting
timedelta64ordatetime64values into numeric Series failing to cast to object dtype (GH 39086, GH 39619) - Bug in setting Interval values into a Series or DataFrame with mismatched IntervalDtype incorrectly casting the new values to the existing dtype (GH 39120)
- Bug in setting
datetime64values into a Series with integer-dtype incorrectly casting the datetime64 values to integers (GH 39266) - Bug in setting
np.datetime64("NaT")into a Series withDatetime64TZDtypeincorrectly treating the timezone-naive value as timezone-aware (GH 39769) - Bug in Index.get_loc() not raising
KeyErrorwhenkey=NaNandmethodis specified butNaNis not in the Index (GH 39382) - Bug in
DatetimeIndex.insert()when insertingnp.datetime64("NaT")into a timezone-aware index incorrectly treating the timezone-naive value as timezone-aware (GH 39769) - Bug in incorrectly raising in Index.insert(), when setting a new column that cannot be held in the existing
frame.columns, or in Series.reset_index() or DataFrame.reset_index() instead of casting to a compatible dtype (GH 39068) - Bug in
RangeIndex.append()where a single object of length 1 was concatenated incorrectly (GH 39401) - Bug in
RangeIndex.astype()where when converting to CategoricalIndex, the categories became aInt64Indexinstead of a RangeIndex (GH 41263) - Bug in setting
numpy.timedelta64values into an object-dtype Series using a Boolean indexer (GH 39488) - Bug in setting numeric values into a into a boolean-dtypes Series using
atoriatfailing to cast to object-dtype (GH 39582) - Bug in
DataFrame.__setitem__()andDataFrame.iloc.__setitem__()raisingValueErrorwhen trying to index with a row-slice and setting a list as values (GH 40440) - Bug in DataFrame.loc() not raising
KeyErrorwhen the key was not found in MultiIndex and the levels were not fully specified (GH 41170) - Bug in
DataFrame.loc.__setitem__()when setting-with-expansion incorrectly raising when the index in the expanding axis contained duplicates (GH 40096) - Bug in
DataFrame.loc.__getitem__()with MultiIndex casting to float when at least one index column has float dtype and we retrieve a scalar (GH 41369) - Bug in DataFrame.loc() incorrectly matching non-Boolean index elements (GH 20432)
- Bug in indexing with
np.nanon a Series or DataFrame with a CategoricalIndex incorrectly raisingKeyErrorwhennp.nankeys are present (GH 41933) - Bug in
Series.__delitem__()withExtensionDtypeincorrectly casting tondarray(GH 40386) - Bug in DataFrame.at() with a CategoricalIndex returning incorrect results when passed integer keys (GH 41846)
- Bug in DataFrame.loc() returning a MultiIndex in the wrong order if an indexer has duplicates (GH 40978)
- Bug in
DataFrame.__setitem__()raising aTypeErrorwhen using astrsubclass as the column name with a DatetimeIndex (GH 37366) - Bug in
PeriodIndex.get_loc()failing to raise aKeyErrorwhen given a Period with a mismatchedfreq(GH 41670) - Bug
.loc.__getitem__with aUInt64Indexand negative-integer keys raisingOverflowErrorinstead ofKeyErrorin some cases, wrapping around to positive integers in others (GH 41777) - Bug in Index.get_indexer() failing to raise
ValueErrorin some cases with invalidmethod,limit, ortolerancearguments (GH 41918) - Bug when slicing a Series or DataFrame with a TimedeltaIndex when passing an invalid string raising
ValueErrorinstead of aTypeError(GH 41821) - Bug in Index constructor sometimes silently ignoring a specified
dtype(GH 38879) - Index.where() behavior now mirrors Index.putmask() behavior, i.e.
index.where(mask, other)matchesindex.putmask(~mask, other)(GH 39412)
Missing#
- Bug in Grouper did not correctly propagate the
dropnaargument; DataFrameGroupBy.transform() now correctly handles missing values fordropna=True(GH 35612) - Bug in isna(), Series.isna(), Index.isna(), DataFrame.isna(), and the corresponding
notnafunctions not recognizingDecimal("NaN")objects (GH 39409) - Bug in DataFrame.fillna() not accepting a dictionary for the
downcastkeyword (GH 40809) - Bug in isna() not returning a copy of the mask for nullable types, causing any subsequent mask modification to change the original array (GH 40935)
- Bug in DataFrame construction with float data containing
NaNand an integerdtypecasting instead of retaining theNaN(GH 26919) - Bug in Series.isin() and
MultiIndex.isin()didn’t treat all nans as equivalent if they were in tuples (GH 41836)
MultiIndex#
- Bug in DataFrame.drop() raising a
TypeErrorwhen the MultiIndex is non-unique andlevelis not provided (GH 36293) - Bug in
MultiIndex.intersection()duplicatingNaNin the result (GH 38623) - Bug in
MultiIndex.equals()incorrectly returningTruewhen the MultiIndex containedNaNeven when they are differently ordered (GH 38439) - Bug in
MultiIndex.intersection()always returning an empty result when intersecting with CategoricalIndex (GH 38653) - Bug in
MultiIndex.difference()incorrectly raisingTypeErrorwhen indexes contain non-sortable entries (GH 41915) - Bug in
MultiIndex.reindex()raising aValueErrorwhen used on an empty MultiIndex and indexing only a specific level (GH 41170) - Bug in
MultiIndex.reindex()raisingTypeErrorwhen reindexing against a flat Index (GH 41707)
I/O#
- Bug in
Index.__repr__()whendisplay.max_seq_items=1(GH 38415) - Bug in read_csv() not recognizing scientific notation if the argument
decimalis set andengine="python"(GH 31920) - Bug in read_csv() interpreting
NAvalue as comment, whenNAdoes contain the comment string fixed forengine="python"(GH 34002) - Bug in read_csv() raising an
IndexErrorwith multiple header columns andindex_colis specified when the file has no data rows (GH 38292) - Bug in read_csv() not accepting
usecolswith a different length thannamesforengine="python"(GH 16469) - Bug in read_csv() returning object dtype when
delimiter=","withusecolsandparse_datesspecified forengine="python"(GH 35873) - Bug in read_csv() raising a
TypeErrorwhennamesandparse_datesis specified forengine="c"(GH 33699) - Bug in read_clipboard() and DataFrame.to_clipboard() not working in WSL (GH 38527)
- Allow custom error values for the
parse_datesargument of read_sql(), read_sql_query() and read_sql_table() (GH 35185) - Bug in DataFrame.to_hdf() and Series.to_hdf() raising a
KeyErrorwhen trying to apply for subclasses ofDataFrameorSeries(GH 33748) - Bug in HDFStore.put() raising a wrong
TypeErrorwhen saving a DataFrame with non-string dtype (GH 34274) - Bug in json_normalize() resulting in the first element of a generator object not being included in the returned DataFrame (GH 35923)
- Bug in read_csv() applying the thousands separator to date columns when the column should be parsed for dates and
usecolsis specified forengine="python"(GH 39365) - Bug in read_excel() forward filling MultiIndex names when multiple header and index columns are specified (GH 34673)
- Bug in read_excel() not respecting set_option() (GH 34252)
- Bug in read_csv() not switching
true_valuesandfalse_valuesfor nullable Boolean dtype (GH 34655) - Bug in read_json() when
orient="split"not maintaining a numeric string index (GH 28556) - read_sql() returned an empty generator if
chunksizewas non-zero and the query returned no results. Now returns a generator with a single empty DataFrame (GH 34411) - Bug in read_hdf() returning unexpected records when filtering on categorical string columns using the
whereparameter (GH 39189) - Bug in read_sas() raising a
ValueErrorwhendatetimeswere null (GH 39725) - Bug in read_excel() dropping empty values from single-column spreadsheets (GH 39808)
- Bug in read_excel() loading trailing empty rows/columns for some filetypes (GH 41167)
- Bug in read_excel() raising an
AttributeErrorwhen the excel file had aMultiIndexheader followed by two empty rows and no index (GH 40442) - Bug in read_excel(), read_csv(), read_table(), read_fwf(), and read_clipboard() where one blank row after a
MultiIndexheader with no index would be dropped (GH 40442) - Bug in DataFrame.to_string() misplacing the truncation column when
index=False(GH 40904) - Bug in DataFrame.to_string() adding an extra dot and misaligning the truncation row when
index=False(GH 40904) - Bug in read_orc() always raising an
AttributeError(GH 40918) - Bug in read_csv() and read_table() silently ignoring
prefixifnamesandprefixare defined, now raising aValueError(GH 39123) - Bug in read_csv() and read_excel() not respecting the dtype for a duplicated column name when
mangle_dupe_colsis set toTrue(GH 35211) - Bug in read_csv() silently ignoring
sepifdelimiterandsepare defined, now raising aValueError(GH 39823) - Bug in read_csv() and read_table() misinterpreting arguments when
sys.setprofilehad been previously called (GH 41069) - Bug in the conversion from PyArrow to pandas (e.g. for reading Parquet) with nullable dtypes and a PyArrow array whose data buffer size is not a multiple of the dtype size (GH 40896)
- Bug in read_excel() would raise an error when pandas could not determine the file type even though the user specified the
engineargument (GH 41225) - Bug in read_clipboard() copying from an excel file shifts values into the wrong column if there are null values in first column (GH 41108)
- Bug in DataFrame.to_hdf() and Series.to_hdf() raising a
TypeErrorwhen trying to append a string column to an incompatible column (GH 41897)
Period#
- Comparisons of Period objects or Index, Series, or DataFrame with mismatched
PeriodDtypenow behave like other mismatched-type comparisons, returningFalsefor equals,Truefor not-equal, and raisingTypeErrorfor inequality checks (GH 39274)
Plotting#
- Bug in plotting.scatter_matrix() raising when 2d
axargument passed (GH 16253) - Prevent warnings when Matplotlib’s
constrained_layoutis enabled (GH 25261) - Bug in DataFrame.plot() was showing the wrong colors in the legend if the function was called repeatedly and some calls used
yerrwhile others didn’t (GH 39522) - Bug in DataFrame.plot() was showing the wrong colors in the legend if the function was called repeatedly and some calls used
secondary_yand others uselegend=False(GH 40044) - Bug in DataFrame.plot.box() when
dark_backgroundtheme was selected, caps or min/max markers for the plot were not visible (GH 40769)
Groupby/resample/rolling#
- Bug in DataFrameGroupBy.agg() and SeriesGroupBy.agg() with PeriodDtype columns incorrectly casting results too aggressively (GH 38254)
- Bug in SeriesGroupBy.value_counts() where unobserved categories in a grouped categorical Series were not tallied (GH 38672)
- Bug in SeriesGroupBy.value_counts() where an error was raised on an empty Series (GH 39172)
- Bug in
GroupBy.indices()would contain non-existent indices when null values were present in the groupby keys (GH 9304) - Fixed bug in DataFrameGroupBy.sum() and SeriesGroupBy.sum() causing a loss of precision by now using Kahan summation (GH 38778)
- Fixed bug in DataFrameGroupBy.cumsum(), SeriesGroupBy.cumsum(), DataFrameGroupBy.mean(), and SeriesGroupBy.mean() causing loss of precision through using Kahan summation (GH 38934)
- Bug in Resampler.aggregate() and DataFrame.transform() raising a
TypeErrorinstead ofSpecificationErrorwhen missing keys had mixed dtypes (GH 39025) - Bug in DataFrameGroupBy.idxmin() and DataFrameGroupBy.idxmax() with
ExtensionDtypecolumns (GH 38733) - Bug in Series.resample() would raise when the index was a PeriodIndex consisting of
NaT(GH 39227) - Bug in
RollingGroupby.corr()andExpandingGroupby.corr()where the groupby column would return0instead ofnp.nanwhen providingotherthat was longer than each group (GH 39591) - Bug in
ExpandingGroupby.corr()andExpandingGroupby.cov()where1would be returned instead ofnp.nanwhen providingotherthat was longer than each group (GH 39591) - Bug in DataFrameGroupBy.mean(), SeriesGroupBy.mean(), DataFrameGroupBy.median(), SeriesGroupBy.median(), and DataFrame.pivot_table() not propagating metadata (GH 28283)
- Bug in Series.rolling() and DataFrame.rolling() not calculating window bounds correctly when window is an offset and dates are in descending order (GH 40002)
- Bug in Series.groupby() and DataFrame.groupby() on an empty
SeriesorDataFramewould lose index, columns, and/or data types when directly using the methodsidxmax,idxmin,mad,min,max,sum,prod, andskewor using them throughapply,aggregate, orresample(GH 26411) - Bug in DataFrameGroupBy.apply() and SeriesGroupBy.apply() where a MultiIndex would be created instead of an Index when used on a
RollingGroupbyobject (GH 39732) - Bug in DataFrameGroupBy.sample() where an error was raised when
weightswas specified and the index was anInt64Index(GH 39927) - Bug in DataFrameGroupBy.aggregate() and Resampler.aggregate() would sometimes raise a
SpecificationErrorwhen passed a dictionary and columns were missing; will now always raise aKeyErrorinstead (GH 40004) - Bug in DataFrameGroupBy.sample() where column selection was not applied before computing the result (GH 39928)
- Bug in
ExponentialMovingWindowwhen calling__getitem__would incorrectly raise aValueErrorwhen providingtimes(GH 40164) - Bug in
ExponentialMovingWindowwhen calling__getitem__would not retaincom,span,alphaorhalflifeattributes (GH 40164) ExponentialMovingWindownow raises aNotImplementedErrorwhen specifyingtimeswithadjust=Falsedue to an incorrect calculation (GH 40098)- Bug in
ExponentialMovingWindowGroupby.mean()where thetimesargument was ignored whenengine='numba'(GH 40951) - Bug in
ExponentialMovingWindowGroupby.mean()where the wrong times were used the in case of multiple groups (GH 40951) - Bug in
ExponentialMovingWindowGroupbywhere the times vector and values became out of sync for non-trivial groups (GH 40951) - Bug in Series.asfreq() and DataFrame.asfreq() dropping rows when the index was not sorted (GH 39805)
- Bug in aggregation functions for DataFrame not respecting
numeric_onlyargument whenlevelkeyword was given (GH 40660) - Bug in SeriesGroupBy.aggregate() where using a user-defined function to aggregate a Series with an object-typed Index causes an incorrect Index shape (GH 40014)
- Bug in
RollingGroupbywhereas_index=Falseargument ingroupbywas ignored (GH 39433) - Bug in DataFrameGroupBy.any(), SeriesGroupBy.any(), DataFrameGroupBy.all() and SeriesGroupBy.all() raising a
ValueErrorwhen using with nullable type columns holdingNAeven withskipna=True(GH 40585) - Bug in DataFrameGroupBy.cummin(), SeriesGroupBy.cummin(), DataFrameGroupBy.cummax() and SeriesGroupBy.cummax() incorrectly rounding integer values near the
int64implementations bounds (GH 40767) - Bug in DataFrameGroupBy.rank() and SeriesGroupBy.rank() with nullable dtypes incorrectly raising a
TypeError(GH 41010) - Bug in DataFrameGroupBy.cummin(), SeriesGroupBy.cummin(), DataFrameGroupBy.cummax() and SeriesGroupBy.cummax() computing wrong result with nullable data types too large to roundtrip when casting to float (GH 37493)
- Bug in DataFrame.rolling() returning mean zero for all
NaNwindow withmin_periods=0if calculation is not numerical stable (GH 41053) - Bug in DataFrame.rolling() returning sum not zero for all
NaNwindow withmin_periods=0if calculation is not numerical stable (GH 41053) - Bug in SeriesGroupBy.agg() failing to retain ordered CategoricalDtype on order-preserving aggregations (GH 41147)
- Bug in DataFrameGroupBy.min(), SeriesGroupBy.min(), DataFrameGroupBy.max() and SeriesGroupBy.max() with multiple object-dtype columns and
numeric_only=Falseincorrectly raising aValueError(GH 41111) - Bug in DataFrameGroupBy.rank() with the GroupBy object’s
axis=0and therankmethod’s keywordaxis=1(GH 41320) - Bug in
DataFrameGroupBy.__getitem__()with non-unique columns incorrectly returning a malformedSeriesGroupByinstead ofDataFrameGroupBy(GH 41427) - Bug in DataFrameGroupBy.transform() with non-unique columns incorrectly raising an
AttributeError(GH 41427) - Bug in Resampler.apply() with non-unique columns incorrectly dropping duplicated columns (GH 41445)
- Bug in Series.groupby() aggregations incorrectly returning empty Series instead of raising
TypeErroron aggregations that are invalid for its dtype, e.g..prodwithdatetime64[ns]dtype (GH 41342) - Bug in
DataFrameGroupByaggregations incorrectly failing to drop columns with invalid dtypes for that aggregation when there are no valid columns (GH 41291) - Bug in
DataFrame.rolling.__iter__()whereonwas not assigned to the index of the resulting objects (GH 40373) - Bug in DataFrameGroupBy.transform() and DataFrameGroupBy.agg() with
engine="numba"where*argswere being cached with the user passed function (GH 41647) - Bug in
DataFrameGroupBymethodsagg,transform,sum,bfill,ffill,pad,pct_change,shift,ohlcdropping.columns.names(GH 41497)
Reshaping#
- Bug in merge() raising error when performing an inner join with partial index and
right_index=Truewhen there was no overlap between indices (GH 33814) - Bug in DataFrame.unstack() with missing levels led to incorrect index names (GH 37510)
- Bug in merge_asof() propagating the right Index with
left_index=Trueandright_onspecification instead of left Index (GH 33463) - Bug in DataFrame.join() on a DataFrame with a MultiIndex returned the wrong result when one of both indexes had only one level (GH 36909)
- merge_asof() now raises a
ValueErrorinstead of a crypticTypeErrorin case of non-numerical merge columns (GH 29130) - Bug in DataFrame.join() not assigning values correctly when the DataFrame had a MultiIndex where at least one dimension had dtype
Categoricalwith non-alphabetically sorted categories (GH 38502) - Series.value_counts() and Series.mode() now return consistent keys in original order (GH 12679, GH 11227 and GH 39007)
- Bug in DataFrame.stack() not handling
NaNin MultiIndex columns correctly (GH 39481) - Bug in DataFrame.apply() would give incorrect results when the argument
funcwas a string,axis=1, and the axis argument was not supported; now raises aValueErrorinstead (GH 39211) - Bug in DataFrame.sort_values() not reshaping the index correctly after sorting on columns when
ignore_index=True(GH 39464) - Bug in
DataFrame.append()returning incorrect dtypes with combinations ofExtensionDtypedtypes (GH 39454) - Bug in
DataFrame.append()returning incorrect dtypes when used with combinations ofdatetime64andtimedelta64dtypes (GH 39574) - Bug in
DataFrame.append()with a DataFrame with a MultiIndex and appending a Series whose Index is not a MultiIndex (GH 41707) - Bug in DataFrame.pivot_table() returning a MultiIndex for a single value when operating on an empty DataFrame (GH 13483)
- Index can now be passed to the numpy.all() function (GH 40180)
- Bug in DataFrame.stack() not preserving
CategoricalDtypein a MultiIndex (GH 36991) - Bug in to_datetime() raising an error when the input sequence contained unhashable items (GH 39756)
- Bug in Series.explode() preserving the index when
ignore_indexwasTrueand values were scalars (GH 40487) - Bug in to_datetime() raising a
ValueErrorwhen Series containsNoneandNaTand has more than 50 elements (GH 39882) - Bug in Series.unstack() and DataFrame.unstack() with object-dtype values containing timezone-aware datetime objects incorrectly raising
TypeError(GH 41875) - Bug in DataFrame.melt() raising
InvalidIndexErrorwhen DataFrame has duplicate columns used asvalue_vars(GH 41951)
Sparse#
- Bug in DataFrame.sparse.to_coo() raising a
KeyErrorwith columns that are a numeric Index without a0(GH 18414) - Bug in
SparseArray.astype()withcopy=Falseproducing incorrect results when going from integer dtype to floating dtype (GH 34456) - Bug in
SparseArray.max()andSparseArray.min()would always return an empty result (GH 40921)
ExtensionArray#
- Bug in DataFrame.where() when
otheris a Series with anExtensionDtype(GH 38729) - Fixed bug where Series.idxmax(), Series.idxmin(), Series.argmax(), and Series.argmin() would fail when the underlying data is an
ExtensionArray(GH 32749, GH 33719, GH 36566) - Fixed bug where some properties of subclasses of
PandasExtensionDtypewhere improperly cached (GH 40329) - Bug in DataFrame.mask() where masking a DataFrame with an
ExtensionDtyperaises aValueError(GH 40941)
Styler#
- Bug in Styler where the
subsetargument in methods raised an error for some valid MultiIndex slices (GH 33562) - Styler rendered HTML output has seen minor alterations to support w3 good code standards (GH 39626)
- Bug in Styler where rendered HTML was missing a column class identifier for certain header cells (GH 39716)
- Bug in Styler.background_gradient() where text-color was not determined correctly (GH 39888)
- Bug in Styler.set_table_styles() where multiple elements in CSS-selectors of the
table_stylesargument were not correctly added (GH 34061) - Bug in Styler where copying from Jupyter dropped the top left cell and misaligned headers (GH 12147)
- Bug in
Styler.wherewherekwargswere not passed to the applicable callable (GH 40845) - Bug in Styler causing CSS to duplicate on multiple renders (GH 39395, GH 40334)
Other#
inspect.getmembers(Series)no longer raises anAbstractMethodError(GH 38782)- Bug in Series.where() with numeric dtype and
other=Nonenot casting tonan(GH 39761) - Bug in assert_series_equal(), assert_frame_equal(), assert_index_equal() and assert_extension_array_equal() incorrectly raising when an attribute has an unrecognized NA type (GH 39461)
- Bug in assert_index_equal() with
exact=Truenot raising when comparing CategoricalIndex instances withInt64IndexandRangeIndexcategories (GH 41263) - Bug in DataFrame.equals(), Series.equals(), and Index.equals() with object-dtype containing
np.datetime64("NaT")ornp.timedelta64("NaT")(GH 39650) - Bug in show_versions() where console JSON output was not proper JSON (GH 39701)
- pandas can now compile on z/OS when using xlc (GH 35826)
- Bug in pandas.util.hash_pandas_object() not recognizing
hash_key,encodingandcategorizewhen the input object type is a DataFrame (GH 41404)
Contributors#
A total of 251 people contributed patches to this release. People with a “+” by their names contributed a patch for the first time.
- Abhishek R +
- Ada Draginda
- Adam J. Stewart
- Adam Turner +
- Aidan Feldman +
- Ajitesh Singh +
- Akshat Jain +
- Albert Villanova del Moral
- Alexandre Prince-Levasseur +
- Andrew Hawyrluk +
- Andrew Wieteska
- AnglinaBhambra +
- Ankush Dua +
- Anna Daglis
- Ashlan Parker +
- Ashwani +
- Avinash Pancham
- Ayushman Kumar +
- BeanNan
- Benoît Vinot
- Bharat Raghunathan
- Bijay Regmi +
- Bobin Mathew +
- Bogdan Pilyavets +
- Brian Hulette +
- Brian Sun +
- Brock +
- Bryan Cutler
- Caleb +
- Calvin Ho +
- Chathura Widanage +
- Chinmay Rane +
- Chris Lynch
- Chris Withers
- Christos Petropoulos
- Corentin Girard +
- DaPy15 +
- Damodara Puddu +
- Daniel Hrisca
- Daniel Saxton
- DanielFEvans
- Dare Adewumi +
- Dave Willmer
- David Schlachter +
- David-dmh +
- Deepang Raval +
- Doris Lee +
- Dr. Jan-Philip Gehrcke +
- DriesS +
- Dylan Percy
- Erfan Nariman
- Eric Leung
- EricLeer +
- Eve
- Fangchen Li
- Felix Divo
- Florian Jetter
- Fred Reiss
- GFJ138 +
- Gaurav Sheni +
- Geoffrey B. Eisenbarth +
- Gesa Stupperich +
- Griffin Ansel +
- Gustavo C. Maciel +
- Heidi +
- Henry +
- Hung-Yi Wu +
- Ian Ozsvald +
- Irv Lustig
- Isaac Chung +
- Isaac Virshup
- JHM Darbyshire (MBP) +
- JHM Darbyshire (iMac) +
- Jack Liu +
- James Lamb +
- Jeet Parekh
- Jeff Reback
- Jiezheng2018 +
- Jody Klymak
- Johan Kåhrström +
- John McGuigan
- Joris Van den Bossche
- Jose
- JoseNavy
- Josh Dimarsky
- Josh Friedlander
- Joshua Klein +
- Julia Signell
- Julian Schnitzler +
- Kaiqi Dong
- Kasim Panjri +
- Katie Smith +
- Kelly +
- Kenil +
- Keppler, Kyle +
- Kevin Sheppard
- Khor Chean Wei +
- Kiley Hewitt +
- Larry Wong +
- Lightyears +
- Lucas Holtz +
- Lucas Rodés-Guirao
- Lucky Sivagurunathan +
- Luis Pinto
- Maciej Kos +
- Marc Garcia
- Marco Edward Gorelli +
- Marco Gorelli
- MarcoGorelli +
- Mark Graham
- Martin Dengler +
- Martin Grigorov +
- Marty Rudolf +
- Matt Roeschke
- Matthew Roeschke
- Matthew Zeitlin
- Max Bolingbroke
- Maxim Ivanov
- Maxim Kupfer +
- Mayur +
- MeeseeksMachine
- Micael Jarniac
- Michael Hsieh +
- Michel de Ruiter +
- Mike Roberts +
- Miroslav Šedivý
- Mohammad Jafar Mashhadi
- Morisa Manzella +
- Mortada Mehyar
- Muktan +
- Naveen Agrawal +
- Noah
- Nofar Mishraki +
- Oleh Kozynets
- Olga Matoula +
- Oli +
- Omar Afifi
- Omer Ozarslan +
- Owen Lamont +
- Ozan Öğreden +
- Pandas Development Team
- Paolo Lammens
- Parfait Gasana +
- Patrick Hoefler
- Paul McCarthy +
- Paulo S. Costa +
- Pav A
- Peter
- Pradyumna Rahul +
- Punitvara +
- QP Hou +
- Rahul Chauhan
- Rahul Sathanapalli
- Richard Shadrach
- Robert Bradshaw
- Robin to Roxel
- Rohit Gupta
- Sam Purkis +
- Samuel GIFFARD +
- Sean M. Law +
- Shahar Naveh +
- ShaharNaveh +
- Shiv Gupta +
- Shrey Dixit +
- Shudong Yang +
- Simon Boehm +
- Simon Hawkins
- Sioned Baker +
- Stefan Mejlgaard +
- Steven Pitman +
- Steven Schaerer +
- Stéphane Guillou +
- TLouf +
- Tegar D Pratama +
- Terji Petersen
- Theodoros Nikolaou +
- Thomas Dickson
- Thomas Li
- Thomas Smith
- Thomas Yu +
- ThomasBlauthQC +
- Tim Hoffmann
- Tom Augspurger
- Torsten Wörtwein
- Tyler Reddy
- UrielMaD
- Uwe L. Korn
- Venaturum +
- VirosaLi
- Vladimir Podolskiy
- Vyom Pathak +
- WANG Aiyong
- Waltteri Koskinen +
- Wenjun Si +
- William Ayd
- Yeshwanth N +
- Yuanhao Geng
- Zito Relova +
- aflah02 +
- arredond +
- attack68
- cdknox +
- chinggg +
- fathomer +
- ftrihardjo +
- github-actions[bot] +
- gunjan-solanki +
- guru kiran
- hasan-yaman
- i-aki-y +
- jbrockmendel
- jmholzer +
- jordi-crespo +
- jotasi +
- jreback
- juliansmidek +
- kylekeppler
- lrepiton +
- lucasrodes
- maroth96 +
- mikeronayne +
- mlondschien
- moink +
- morrme
- mschmookler +
- mzeitlin11
- na2 +
- nofarmishraki +
- partev
- patrick
- ptype
- realead
- rhshadrach
- rlukevie +
- rosagold +
- saucoide +
- sdementen +
- shawnbrown
- sstiijn +
- stphnlyd +
- sukriti1 +
- taytzehao
- theOehrly +
- theodorju +
- thordisstella +
- tonyyyyip +
- tsinggggg +
- tushushu +
- vangorade +
- vladu +
- wertha +