What’s new in 1.1.0 (July 28, 2020) — pandas 2.2.3 documentation (original) (raw)
These are the changes in pandas 1.1.0. See Release notes for a full changelog including other versions of pandas.
Enhancements#
KeyErrors raised by loc specify missing labels#
Previously, if labels were missing for a .loc
call, a KeyError was raised stating that this was no longer supported.
Now the error message also includes a list of the missing labels (max 10 items, display width 80 characters). See GH 34272.
All dtypes can now be converted to StringDtype
#
Previously, declaring or converting to StringDtype was in general only possible if the data was already only str
or nan-like (GH 31204).StringDtype now works in all situations where astype(str)
or dtype=str
work:
For example, the below now works:
In [1]: ser = pd.Series([1, "abc", np.nan], dtype="string")
In [2]: ser Out[2]: 0 1 1 abc 2 Length: 3, dtype: string
In [3]: ser[0] Out[3]: '1'
In [4]: pd.Series([1, 2, np.nan], dtype="Int64").astype("string") Out[4]: 0 1 1 2 2 Length: 3, dtype: string
Non-monotonic PeriodIndex partial string slicing#
PeriodIndex now supports partial string slicing for non-monotonic indexes, mirroring DatetimeIndex behavior (GH 31096)
For example:
In [5]: dti = pd.date_range("2014-01-01", periods=30, freq="30D")
In [6]: pi = dti.to_period("D")
In [7]: ser_monotonic = pd.Series(np.arange(30), index=pi)
In [8]: shuffler = list(range(0, 30, 2)) + list(range(1, 31, 2))
In [9]: ser = ser_monotonic.iloc[shuffler]
In [10]: ser Out[10]: 2014-01-01 0 2014-03-02 2 2014-05-01 4 2014-06-30 6 2014-08-29 8 .. 2015-09-23 21 2015-11-22 23 2016-01-21 25 2016-03-21 27 2016-05-20 29 Freq: D, Length: 30, dtype: int64
In [11]: ser["2014"] Out[11]: 2014-01-01 0 2014-03-02 2 2014-05-01 4 2014-06-30 6 2014-08-29 8 2014-10-28 10 2014-12-27 12 2014-01-31 1 2014-04-01 3 2014-05-31 5 2014-07-30 7 2014-09-28 9 2014-11-27 11 Freq: D, Length: 13, dtype: int64
In [12]: ser.loc["May 2015"] Out[12]: 2015-05-26 17 Freq: D, Length: 1, dtype: int64
Comparing two DataFrame
or two Series
and summarizing the differences#
We’ve added DataFrame.compare() and Series.compare() for comparing two DataFrame
or two Series
(GH 30429)
In [13]: df = pd.DataFrame( ....: { ....: "col1": ["a", "a", "b", "b", "a"], ....: "col2": [1.0, 2.0, 3.0, np.nan, 5.0], ....: "col3": [1.0, 2.0, 3.0, 4.0, 5.0] ....: }, ....: columns=["col1", "col2", "col3"], ....: ) ....:
In [14]: df Out[14]: col1 col2 col3 0 a 1.0 1.0 1 a 2.0 2.0 2 b 3.0 3.0 3 b NaN 4.0 4 a 5.0 5.0
[5 rows x 3 columns]
In [15]: df2 = df.copy()
In [16]: df2.loc[0, 'col1'] = 'c'
In [17]: df2.loc[2, 'col3'] = 4.0
In [18]: df2 Out[18]: col1 col2 col3 0 c 1.0 1.0 1 a 2.0 2.0 2 b 3.0 4.0 3 b NaN 4.0 4 a 5.0 5.0
[5 rows x 3 columns]
In [19]: df.compare(df2)
Out[19]:
col1 col3
self other self other
0 a c NaN NaN
2 NaN NaN 3.0 4.0
[2 rows x 4 columns]
See User Guide for more details.
Allow NA in groupby key#
With groupby , we’ve added a dropna
keyword to DataFrame.groupby() and Series.groupby() in order to allow NA
values in group keys. Users can define dropna
to False
if they want to includeNA
values in groupby keys. The default is set to True
for dropna
to keep backwards compatibility (GH 3729)
In [20]: df_list = [[1, 2, 3], [1, None, 4], [2, 1, 3], [1, 2, 2]]
In [21]: df_dropna = pd.DataFrame(df_list, columns=["a", "b", "c"])
In [22]: df_dropna Out[22]: a b c 0 1 2.0 3 1 1 NaN 4 2 2 1.0 3 3 1 2.0 2
[4 rows x 3 columns]
Default dropna
is set to True, which will exclude NaNs in keys
In [23]: df_dropna.groupby(by=["b"], dropna=True).sum()
Out[23]:
a c
b
1.0 2 3
2.0 2 5
[2 rows x 2 columns]
In order to allow NaN in keys, set dropna
to False
In [24]: df_dropna.groupby(by=["b"], dropna=False).sum()
Out[24]:
a c
b
1.0 2 3
2.0 2 5
NaN 1 4
[3 rows x 2 columns]
The default setting of dropna
argument is True
which means NA
are not included in group keys.
Sorting with keys#
We’ve added a key
argument to the DataFrame and Series sorting methods, includingDataFrame.sort_values(), DataFrame.sort_index(), Series.sort_values(), and Series.sort_index(). The key
can be any callable function which is applied column-by-column to each column used for sorting, before sorting is performed (GH 27237). See sort_values with keys and sort_index with keys for more information.
In [25]: s = pd.Series(['C', 'a', 'B'])
In [26]: s Out[26]: 0 C 1 a 2 B Length: 3, dtype: object
In [27]: s.sort_values() Out[27]: 2 B 0 C 1 a Length: 3, dtype: object
Note how this is sorted with capital letters first. If we apply the Series.str.lower()method, we get
In [28]: s.sort_values(key=lambda x: x.str.lower()) Out[28]: 1 a 2 B 0 C Length: 3, dtype: object
When applied to a DataFrame
, they key is applied per-column to all columns or a subset ifby
is specified, e.g.
In [29]: df = pd.DataFrame({'a': ['C', 'C', 'a', 'a', 'B', 'B'], ....: 'b': [1, 2, 3, 4, 5, 6]}) ....:
In [30]: df Out[30]: a b 0 C 1 1 C 2 2 a 3 3 a 4 4 B 5 5 B 6
[6 rows x 2 columns]
In [31]: df.sort_values(by=['a'], key=lambda col: col.str.lower()) Out[31]: a b 2 a 3 3 a 4 4 B 5 5 B 6 0 C 1 1 C 2
[6 rows x 2 columns]
For more details, see examples and documentation in DataFrame.sort_values(),Series.sort_values(), and sort_index().
Fold argument support in Timestamp constructor#
Timestamp:
now supports the keyword-only fold argument according to PEP 495 similar to parent datetime.datetime
class. It supports both accepting fold as an initialization argument and inferring fold from other constructor arguments (GH 25057, GH 31338). Support is limited to dateutil
timezones as pytz
doesn’t support fold.
For example:
In [32]: ts = pd.Timestamp("2019-10-27 01:30:00+00:00")
In [33]: ts.fold Out[33]: 0
In [34]: ts = pd.Timestamp(year=2019, month=10, day=27, hour=1, minute=30, ....: tz="dateutil/Europe/London", fold=1) ....:
In [35]: ts Out[35]: Timestamp('2019-10-27 01:30:00+0000', tz='dateutil//usr/share/zoneinfo/Europe/London')
For more on working with fold, see Fold subsection in the user guide.
Parsing timezone-aware format with different timezones in to_datetime#
to_datetime() now supports parsing formats containing timezone names (%Z
) and UTC offsets (%z
) from different timezones then converting them to UTC by setting utc=True
. This would return a DatetimeIndex with timezone at UTC as opposed to an Index with object
dtype if utc=True
is not set (GH 32792).
For example:
In [36]: tz_strs = ["2010-01-01 12:00:00 +0100", "2010-01-01 12:00:00 -0100", ....: "2010-01-01 12:00:00 +0300", "2010-01-01 12:00:00 +0400"] ....:
In [37]: pd.to_datetime(tz_strs, format='%Y-%m-%d %H:%M:%S %z', utc=True) Out[37]: DatetimeIndex(['2010-01-01 11:00:00+00:00', '2010-01-01 13:00:00+00:00', '2010-01-01 09:00:00+00:00', '2010-01-01 08:00:00+00:00'], dtype='datetime64[ns, UTC]', freq=None)
In[37]: pd.to_datetime(tz_strs, format='%Y-%m-%d %H:%M:%S %z') Out[37]: Index([2010-01-01 12:00:00+01:00, 2010-01-01 12:00:00-01:00, 2010-01-01 12:00:00+03:00, 2010-01-01 12:00:00+04:00], dtype='object')
Grouper and resample now supports the arguments origin and offset#
Grouper and DataFrame.resample() now supports the arguments origin
and offset
. It let the user control the timestamp on which to adjust the grouping. (GH 31809)
The bins of the grouping are adjusted based on the beginning of the day of the time series starting point. This works well with frequencies that are multiples of a day (like 30D
) or that divides a day (like 90s
or 1min
). But it can create inconsistencies with some frequencies that do not meet this criteria. To change this behavior you can now specify a fixed timestamp with the argument origin
.
Two arguments are now deprecated (more information in the documentation of DataFrame.resample()):
base
should be replaced byoffset
.loffset
should be replaced by directly adding an offset to the index DataFrame after being resampled.
Small example of the use of origin
:
In [38]: start, end = '2000-10-01 23:30:00', '2000-10-02 00:30:00'
In [39]: middle = '2000-10-02 00:00:00'
In [40]: rng = pd.date_range(start, end, freq='7min')
In [41]: ts = pd.Series(np.arange(len(rng)) * 3, index=rng)
In [42]: ts Out[42]: 2000-10-01 23:30:00 0 2000-10-01 23:37:00 3 2000-10-01 23:44:00 6 2000-10-01 23:51:00 9 2000-10-01 23:58:00 12 2000-10-02 00:05:00 15 2000-10-02 00:12:00 18 2000-10-02 00:19:00 21 2000-10-02 00:26:00 24 Freq: 7min, Length: 9, dtype: int64
Resample with the default behavior 'start_day'
(origin is 2000-10-01 00:00:00
):
In [43]: ts.resample('17min').sum() Out[43]: 2000-10-01 23:14:00 0 2000-10-01 23:31:00 9 2000-10-01 23:48:00 21 2000-10-02 00:05:00 54 2000-10-02 00:22:00 24 Freq: 17min, Length: 5, dtype: int64
In [44]: ts.resample('17min', origin='start_day').sum() Out[44]: 2000-10-01 23:14:00 0 2000-10-01 23:31:00 9 2000-10-01 23:48:00 21 2000-10-02 00:05:00 54 2000-10-02 00:22:00 24 Freq: 17min, Length: 5, dtype: int64
Resample using a fixed origin:
In [45]: ts.resample('17min', origin='epoch').sum() Out[45]: 2000-10-01 23🔞00 0 2000-10-01 23:35:00 18 2000-10-01 23:52:00 27 2000-10-02 00:09:00 39 2000-10-02 00:26:00 24 Freq: 17min, Length: 5, dtype: int64
In [46]: ts.resample('17min', origin='2000-01-01').sum() Out[46]: 2000-10-01 23:24:00 3 2000-10-01 23:41:00 15 2000-10-01 23:58:00 45 2000-10-02 00:15:00 45 Freq: 17min, Length: 4, dtype: int64
If needed you can adjust the bins with the argument offset
(a Timedelta) that would be added to the default origin
.
For a full example, see: Use origin or offset to adjust the start of the bins.
fsspec now used for filesystem handling#
For reading and writing to filesystems other than local and reading from HTTP(S), the optional dependency fsspec
will be used to dispatch operations (GH 33452). This will give unchanged functionality for S3 and GCS storage, which were already supported, but also add support for several other storage implementations such as Azure Datalake and Blob, SSH, FTP, dropbox and github. For docs and capabilities, see the fsspec docs.
The existing capability to interface with S3 and GCS will be unaffected by this change, as fsspec
will still bring in the same packages as before.
Other enhancements#
- Compatibility with matplotlib 3.3.0 (GH 34850)
IntegerArray.astype()
now supportsdatetime64
dtype (GH 32538)IntegerArray
now implements thesum
operation (GH 33172)- Added pandas.errors.InvalidIndexError (GH 34570).
- Added DataFrame.value_counts() (GH 5377)
- Added a pandas.api.indexers.FixedForwardWindowIndexer() class to support forward-looking windows during
rolling
operations. - Added a pandas.api.indexers.VariableOffsetWindowIndexer() class to support
rolling
operations with non-fixed offsets (GH 34994) - describe() now includes a
datetime_is_numeric
keyword to control how datetime columns are summarized (GH 30164, GH 34798) - Styler may now render CSS more efficiently where multiple cells have the same styling (GH 30876)
- highlight_null() now accepts
subset
argument (GH 31345) - When writing directly to a sqlite connection DataFrame.to_sql() now supports the
multi
method (GH 29921) - pandas.errors.OptionError is now exposed in
pandas.errors
(GH 27553) - Added
api.extensions.ExtensionArray.argmax()
andapi.extensions.ExtensionArray.argmin()
(GH 24382) - timedelta_range() will now infer a frequency when passed
start
,stop
, andperiods
(GH 32377) - Positional slicing on a IntervalIndex now supports slices with
step > 1
(GH 31658) - Series.str now has a
fullmatch
method that matches a regular expression against the entire string in each row of the Series, similar tore.fullmatch
(GH 32806). - DataFrame.sample() will now also allow array-like and BitGenerator objects to be passed to
random_state
as seeds (GH 32503) - Index.union() will now raise
RuntimeWarning
for MultiIndex objects if the object inside are unsortable. Passsort=False
to suppress this warning (GH 33015) - Added Series.dt.isocalendar() and
DatetimeIndex.isocalendar()
that returns a DataFrame with year, week, and day calculated according to the ISO 8601 calendar (GH 33206, GH 34392). - The DataFrame.to_feather() method now supports additional keyword arguments (e.g. to set the compression) that are added in pyarrow 0.17 (GH 33422).
- The cut() will now accept parameter
ordered
with defaultordered=True
. Ifordered=False
and no labels are provided, an error will be raised (GH 33141) - DataFrame.to_csv(), DataFrame.to_pickle(), and DataFrame.to_json() now support passing a dict of compression arguments when using the
gzip
andbz2
protocols. This can be used to set a custom compression level, e.g.,df.to_csv(path, compression={'method': 'gzip', 'compresslevel': 1}
(GH 33196) - melt() has gained an
ignore_index
(defaultTrue
) argument that, if set toFalse
, prevents the method from dropping the index (GH 17440). - Series.update() now accepts objects that can be coerced to a Series, such as
dict
andlist
, mirroring the behavior of DataFrame.update() (GH 33215) - DataFrameGroupBy.transform() and DataFrameGroupBy.aggregate() have gained
engine
andengine_kwargs
arguments that support executing functions withNumba
(GH 32854, GH 33388) - Resampler.interpolate() now supports SciPy interpolation method scipy.interpolate.CubicSpline as method
cubicspline
(GH 33670) DataFrameGroupBy
andSeriesGroupBy
now implement thesample
method for doing random sampling within groups (GH 31775)- DataFrame.to_numpy() now supports the
na_value
keyword to control the NA sentinel in the output array (GH 33820) - Added
api.extension.ExtensionArray.equals
to the extension array interface, similar to Series.equals() (GH 27081) - The minimum supported dta version has increased to 105 in read_stata() and
StataReader
(GH 26667). - to_stata() supports compression using the
compression
keyword argument. Compression can either be inferred or explicitly set using a string or a dictionary containing both the method and any additional arguments that are passed to the compression library. Compression was also added to the low-level Stata-file writersStataWriter
,StataWriter117
, andStataWriterUTF8
(GH 26599). - HDFStore.put() now accepts a
track_times
parameter. This parameter is passed to thecreate_table
method ofPyTables
(GH 32682). - Series.plot() and DataFrame.plot() now accepts
xlabel
andylabel
parameters to present labels on x and y axis (GH 9093). - Made
Rolling
andExpanding
iterable(GH 11704) - Made
option_context
a contextlib.ContextDecorator, which allows it to be used as a decorator over an entire function (GH 34253). - DataFrame.to_csv() and Series.to_csv() now accept an
errors
argument (GH 22610) DataFrameGroupBy.groupby.transform()
now allowsfunc
to bepad
,backfill
andcumcount
(GH 31269).- read_json() now accepts an
nrows
parameter. (GH 33916). - DataFrame.hist(), Series.hist(), core.groupby.DataFrameGroupBy.hist(), and core.groupby.SeriesGroupBy.hist() have gained the
legend
argument. Set to True to show a legend in the histogram. (GH 6279) - concat() and
append()
now preserve extension dtypes, for example combining a nullable integer column with a numpy integer column will no longer result in object dtype but preserve the integer dtype (GH 33607, GH 34339, GH 34095). - read_gbq() now allows to disable progress bar (GH 33360).
- read_gbq() now supports the
max_results
kwarg frompandas-gbq
(GH 34639). - DataFrame.cov() and Series.cov() now support a new parameter
ddof
to support delta degrees of freedom as in the corresponding numpy methods (GH 34611). - DataFrame.to_html() and DataFrame.to_string()’s
col_space
parameter now accepts a list or dict to change only some specific columns’ width (GH 28917). - DataFrame.to_excel() can now also write OpenOffice spreadsheet (.ods) files (GH 27222)
- explode() now accepts
ignore_index
to reset the index, similar topd.concat()
or DataFrame.sort_values() (GH 34932). - DataFrame.to_markdown() and Series.to_markdown() now accept
index
argument as an alias for tabulate’sshowindex
(GH 32667) - read_csv() now accepts string values like “0”, “0.0”, “1”, “1.0” as convertible to the nullable Boolean dtype (GH 34859)
ExponentialMovingWindow
now supports atimes
argument that allowsmean
to be calculated with observations spaced by the timestamps intimes
(GH 34839)- DataFrame.agg() and Series.agg() now accept named aggregation for renaming the output columns/indexes. (GH 26513)
compute.use_numba
now exists as a configuration option that utilizes the numba engine when available (GH 33966, GH 35374)- Series.plot() now supports asymmetric error bars. Previously, if Series.plot() received a “2xN” array with error values for
yerr
and/orxerr
, the left/lower values (first row) were mirrored, while the right/upper values (second row) were ignored. Now, the first row represents the left/lower error values and the second row the right/upper error values. (GH 9536)
Notable bug fixes#
These are bug fixes that might have notable behavior changes.
MultiIndex.get_indexer
interprets method
argument correctly#
This restores the behavior of MultiIndex.get_indexer() with method='backfill'
or method='pad'
to the behavior before pandas 0.23.0. In particular, MultiIndexes are treated as a list of tuples and padding or backfilling is done with respect to the ordering of these lists of tuples (GH 29896).
As an example of this, given:
In [47]: df = pd.DataFrame({ ....: 'a': [0, 0, 0, 0], ....: 'b': [0, 2, 3, 4], ....: 'c': ['A', 'B', 'C', 'D'], ....: }).set_index(['a', 'b']) ....:
In [48]: mi_2 = pd.MultiIndex.from_product([[0], [-1, 0, 1, 3, 4, 5]])
The differences in reindexing df
with mi_2
and using method='backfill'
can be seen here:
pandas >= 0.23, < 1.1.0:
In [1]: df.reindex(mi_2, method='backfill') Out[1]: c 0 -1 A 0 A 1 D 3 A 4 A 5 C
pandas <0.23, >= 1.1.0
In [49]: df.reindex(mi_2, method='backfill') Out[49]: c 0 -1 A 0 A 1 B 3 C 4 D 5 NaN
[6 rows x 1 columns]
And the differences in reindexing df
with mi_2
and using method='pad'
can be seen here:
pandas >= 0.23, < 1.1.0
In [1]: df.reindex(mi_2, method='pad') Out[1]: c 0 -1 NaN 0 NaN 1 D 3 NaN 4 A 5 C
pandas < 0.23, >= 1.1.0
In [50]: df.reindex(mi_2, method='pad') Out[50]: c 0 -1 NaN 0 A 1 A 3 C 4 D 5 D
[6 rows x 1 columns]
Failed label-based lookups always raise KeyError#
Label lookups series[key]
, series.loc[key]
and frame.loc[key]
used to raise either KeyError
or TypeError
depending on the type of key and type of Index. These now consistently raise KeyError
(GH 31867)
In [51]: ser1 = pd.Series(range(3), index=[0, 1, 2])
In [52]: ser2 = pd.Series(range(3), index=pd.date_range("2020-02-01", periods=3))
Previous behavior:
In [3]: ser1[1.5] ... TypeError: cannot do label indexing on Int64Index with these indexers [1.5] of type float
In [4] ser1["foo"] ... KeyError: 'foo'
In [5]: ser1.loc[1.5] ... TypeError: cannot do label indexing on Int64Index with these indexers [1.5] of type float
In [6]: ser1.loc["foo"] ... KeyError: 'foo'
In [7]: ser2.loc[1] ... TypeError: cannot do label indexing on DatetimeIndex with these indexers [1] of type int
In [8]: ser2.loc[pd.Timestamp(0)] ... KeyError: Timestamp('1970-01-01 00:00:00')
New behavior:
In [3]: ser1[1.5] ... KeyError: 1.5
In [4] ser1["foo"] ... KeyError: 'foo'
In [5]: ser1.loc[1.5] ... KeyError: 1.5
In [6]: ser1.loc["foo"] ... KeyError: 'foo'
In [7]: ser2.loc[1] ... KeyError: 1
In [8]: ser2.loc[pd.Timestamp(0)] ... KeyError: Timestamp('1970-01-01 00:00:00')
Similarly, DataFrame.at() and Series.at() will raise a TypeError
instead of a ValueError
if an incompatible key is passed, and KeyError
if a missing key is passed, matching the behavior of .loc[]
(GH 31722)
Failed Integer Lookups on MultiIndex Raise KeyError#
Indexing with integers with a MultiIndex that has an integer-dtype first level incorrectly failed to raise KeyError
when one or more of those integer keys is not present in the first level of the index (GH 33539)
In [53]: idx = pd.Index(range(4))
In [54]: dti = pd.date_range("2000-01-03", periods=3)
In [55]: mi = pd.MultiIndex.from_product([idx, dti])
In [56]: ser = pd.Series(range(len(mi)), index=mi)
Previous behavior:
In [5]: ser[[5]] Out[5]: Series([], dtype: int64)
New behavior:
In [5]: ser[[5]] ... KeyError: '[5] not in index'
DataFrame.merge() preserves right frame’s row order#
DataFrame.merge() now preserves the right frame’s row order when executing a right merge (GH 27453)
In [57]: left_df = pd.DataFrame({'animal': ['dog', 'pig'], ....: 'max_speed': [40, 11]}) ....:
In [58]: right_df = pd.DataFrame({'animal': ['quetzal', 'pig'], ....: 'max_speed': [80, 11]}) ....:
In [59]: left_df Out[59]: animal max_speed 0 dog 40 1 pig 11
[2 rows x 2 columns]
In [60]: right_df Out[60]: animal max_speed 0 quetzal 80 1 pig 11
[2 rows x 2 columns]
Previous behavior:
left_df.merge(right_df, on=['animal', 'max_speed'], how="right") animal max_speed 0 pig 11 1 quetzal 80
New behavior:
In [61]: left_df.merge(right_df, on=['animal', 'max_speed'], how="right") Out[61]: animal max_speed 0 quetzal 80 1 pig 11
[2 rows x 2 columns]
Assignment to multiple columns of a DataFrame when some columns do not exist#
Assignment to multiple columns of a DataFrame when some of the columns do not exist would previously assign the values to the last column. Now, new columns will be constructed with the right values. (GH 13658)
In [62]: df = pd.DataFrame({'a': [0, 1, 2], 'b': [3, 4, 5]})
In [63]: df Out[63]: a b 0 0 3 1 1 4 2 2 5
[3 rows x 2 columns]
Previous behavior:
In [3]: df[['a', 'c']] = 1 In [4]: df Out[4]: a b 0 1 1 1 1 1 2 1 1
New behavior:
In [64]: df[['a', 'c']] = 1
In [65]: df Out[65]: a b c 0 1 3 1 1 1 4 1 2 1 5 1
[3 rows x 3 columns]
Consistency across groupby reductions#
Using DataFrame.groupby() with as_index=True
and the aggregation nunique
would include the grouping column(s) in the columns of the result. Now the grouping column(s) only appear in the index, consistent with other reductions. (GH 32579)
In [66]: df = pd.DataFrame({"a": ["x", "x", "y", "y"], "b": [1, 1, 2, 3]})
In [67]: df Out[67]: a b 0 x 1 1 x 1 2 y 2 3 y 3
[4 rows x 2 columns]
Previous behavior:
In [3]: df.groupby("a", as_index=True).nunique() Out[4]: a b a x 1 1 y 1 2
New behavior:
In [68]: df.groupby("a", as_index=True).nunique()
Out[68]:
b
a
x 1
y 2
[2 rows x 1 columns]
Using DataFrame.groupby() with as_index=False
and the function idxmax
, idxmin
, mad
, nunique
, sem
, skew
, or std
would modify the grouping column. Now the grouping column remains unchanged, consistent with other reductions. (GH 21090, GH 10355)
Previous behavior:
In [3]: df.groupby("a", as_index=False).nunique() Out[4]: a b 0 1 1 1 1 2
New behavior:
In [69]: df.groupby("a", as_index=False).nunique() Out[69]: a b 0 x 1 1 y 2
[2 rows x 2 columns]
The method DataFrameGroupBy.size() would previously ignore as_index=False
. Now the grouping columns are returned as columns, making the result a DataFrame instead of a Series. (GH 32599)
Previous behavior:
In [3]: df.groupby("a", as_index=False).size() Out[4]: a x 2 y 2 dtype: int64
New behavior:
In [70]: df.groupby("a", as_index=False).size() Out[70]: a size 0 x 2 1 y 2
[2 rows x 2 columns]
DataFrameGroupby.agg()
lost results with as_index=False
when relabeling columns#
Previously DataFrameGroupby.agg()
lost the result columns, when the as_index
option was set to False
and the result columns were relabeled. In this case the result values were replaced with the previous index (GH 32240).
In [71]: df = pd.DataFrame({"key": ["x", "y", "z", "x", "y", "z"], ....: "val": [1.0, 0.8, 2.0, 3.0, 3.6, 0.75]}) ....:
In [72]: df Out[72]: key val 0 x 1.00 1 y 0.80 2 z 2.00 3 x 3.00 4 y 3.60 5 z 0.75
[6 rows x 2 columns]
Previous behavior:
In [2]: grouped = df.groupby("key", as_index=False) In [3]: result = grouped.agg(min_val=pd.NamedAgg(column="val", aggfunc="min")) In [4]: result Out[4]: min_val 0 x 1 y 2 z
New behavior:
In [73]: grouped = df.groupby("key", as_index=False)
In [74]: result = grouped.agg(min_val=pd.NamedAgg(column="val", aggfunc="min"))
In [75]: result Out[75]: key min_val 0 x 1.00 1 y 0.80 2 z 0.75
[3 rows x 2 columns]
apply and applymap on DataFrame
evaluates first row/column only once#
In [76]: df = pd.DataFrame({'a': [1, 2], 'b': [3, 6]})
In [77]: def func(row): ....: print(row) ....: return row ....:
Previous behavior:
In [4]: df.apply(func, axis=1) a 1 b 3 Name: 0, dtype: int64 a 1 b 3 Name: 0, dtype: int64 a 2 b 6 Name: 1, dtype: int64 Out[4]: a b 0 1 3 1 2 6
New behavior:
In [78]: df.apply(func, axis=1) a 1 b 3 Name: 0, Length: 2, dtype: int64 a 2 b 6 Name: 1, Length: 2, dtype: int64 Out[78]: a b 0 1 3 1 2 6
[2 rows x 2 columns]
Backwards incompatible API changes#
Added check_freq
argument to testing.assert_frame_equal
and testing.assert_series_equal
#
The check_freq
argument was added to testing.assert_frame_equal() and testing.assert_series_equal() in pandas 1.1.0 and defaults to True
. testing.assert_frame_equal() and testing.assert_series_equal() now raise AssertionError
if the indexes do not have the same frequency. Before pandas 1.1.0, the index frequency was not checked.
Increased minimum versions for dependencies#
Some minimum supported versions of dependencies were updated (GH 33718, GH 29766, GH 29723, pytables >= 3.4.3). If installed, we now require:
Package | Minimum Version | Required | Changed |
---|---|---|---|
numpy | 1.15.4 | X | X |
pytz | 2015.4 | X | |
python-dateutil | 2.7.3 | X | X |
bottleneck | 1.2.1 | ||
numexpr | 2.6.2 | ||
pytest (dev) | 4.0.2 |
For optional libraries the general recommendation is to use the latest version. The following table lists the lowest version per library that is currently being tested throughout the development of pandas. Optional libraries below the lowest tested version may still work, but are not considered supported.
Package | Minimum Version | Changed |
---|---|---|
beautifulsoup4 | 4.6.0 | |
fastparquet | 0.3.2 | |
fsspec | 0.7.4 | |
gcsfs | 0.6.0 | X |
lxml | 3.8.0 | |
matplotlib | 2.2.2 | |
numba | 0.46.0 | |
openpyxl | 2.5.7 | |
pyarrow | 0.13.0 | |
pymysql | 0.7.1 | |
pytables | 3.4.3 | X |
s3fs | 0.4.0 | X |
scipy | 1.2.0 | X |
sqlalchemy | 1.1.4 | |
xarray | 0.8.2 | |
xlrd | 1.1.0 | |
xlsxwriter | 0.9.8 | |
xlwt | 1.2.0 | |
pandas-gbq | 1.2.0 | X |
See Dependencies and Optional dependencies for more.
Development changes#
- The minimum version of Cython is now the most recent bug-fix version (0.29.16) (GH 33334).
Deprecations#
- Lookups on a Series with a single-item list containing a slice (e.g.
ser[[slice(0, 4)]]
) are deprecated and will raise in a future version. Either convert the list to a tuple, or pass the slice directly instead (GH 31333) - DataFrame.mean() and DataFrame.median() with
numeric_only=None
will includedatetime64
anddatetime64tz
columns in a future version (GH 29941) - Setting values with
.loc
using a positional slice is deprecated and will raise in a future version. Use.loc
with labels or.iloc
with positions instead (GH 31840) - DataFrame.to_dict() has deprecated accepting short names for
orient
and will raise in a future version (GH 32515) Categorical.to_dense()
is deprecated and will be removed in a future version, usenp.asarray(cat)
instead (GH 32639)- The
fastpath
keyword in theSingleBlockManager
constructor is deprecated and will be removed in a future version (GH 33092) - Providing
suffixes
as aset
in pandas.merge() is deprecated. Provide a tuple instead (GH 33740, GH 34741). - Indexing a Series with a multi-dimensional indexer like
[:, None]
to return anndarray
now raises aFutureWarning
. Convert to a NumPy array before indexing instead (GH 27837) Index.is_mixed()
is deprecated and will be removed in a future version, checkindex.inferred_type
directly instead (GH 32922)- Passing any arguments but the first one to read_html() as positional arguments is deprecated. All other arguments should be given as keyword arguments (GH 27573).
- Passing any arguments but
path_or_buf
(the first one) toread_json() as positional arguments is deprecated. All other arguments should be given as keyword arguments (GH 27573). - Passing any arguments but the first two to read_excel() as positional arguments is deprecated. All other arguments should be given as keyword arguments (GH 27573).
pandas.api.types.is_categorical()
is deprecated and will be removed in a future version; use pandas.api.types.is_categorical_dtype() instead (GH 33385)Index.get_value()
is deprecated and will be removed in a future version (GH 19728)Series.dt.week()
andSeries.dt.weekofyear()
are deprecated and will be removed in a future version, useSeries.dt.isocalendar().week()
instead (GH 33595)DatetimeIndex.week()
andDatetimeIndex.weekofyear
are deprecated and will be removed in a future version, useDatetimeIndex.isocalendar().week
instead (GH 33595)DatetimeArray.week()
andDatetimeArray.weekofyear
are deprecated and will be removed in a future version, useDatetimeArray.isocalendar().week
instead (GH 33595)DateOffset.__call__()
is deprecated and will be removed in a future version, useoffset + other
instead (GH 34171)apply_index()
is deprecated and will be removed in a future version. Useoffset + other
instead (GH 34580)DataFrame.tshift()
andSeries.tshift()
are deprecated and will be removed in a future version, use DataFrame.shift() and Series.shift() instead (GH 11631)- Indexing an Index object with a float key is deprecated, and will raise an
IndexError
in the future. You can manually convert to an integer key instead (GH 34191). - The
squeeze
keyword in groupby() is deprecated and will be removed in a future version (GH 32380) - The
tz
keyword in Period.to_timestamp() is deprecated and will be removed in a future version; useper.to_timestamp(...).tz_localize(tz)
instead (GH 34522) DatetimeIndex.to_perioddelta()
is deprecated and will be removed in a future version. Useindex - index.to_period(freq).to_timestamp()
instead (GH 34853)- DataFrame.melt() accepting a
value_name
that already exists is deprecated, and will be removed in a future version (GH 34731) - The
center
keyword in the DataFrame.expanding() function is deprecated and will be removed in a future version (GH 20647)
Performance improvements#
- Performance improvement in Timedelta constructor (GH 30543)
- Performance improvement in Timestamp constructor (GH 30543)
- Performance improvement in flex arithmetic ops between DataFrame and Series with
axis=0
(GH 31296) - Performance improvement in arithmetic ops between DataFrame and Series with
axis=1
(GH 33600) - The internal index method
_shallow_copy()
now copies cached attributes over to the new index, avoiding creating these again on the new index. This can speed up many operations that depend on creating copies of existing indexes (GH 28584, GH 32640, GH 32669) - Significant performance improvement when creating a DataFrame with sparse values from
scipy.sparse
matrices using theDataFrame.sparse.from_spmatrix() constructor (GH 32821,GH 32825, GH 32826, GH 32856, GH 32858). - Performance improvement for groupby methods
Groupby.first()
andGroupby.last()
(GH 34178) - Performance improvement in factorize() for nullable (integer and Boolean) dtypes (GH 33064).
- Performance improvement when constructing Categorical objects (GH 33921)
- Fixed performance regression in pandas.qcut() and pandas.cut() (GH 33921)
- Performance improvement in reductions (
sum
,prod
,min
,max
) for nullable (integer and Boolean) dtypes (GH 30982, GH 33261, GH 33442). - Performance improvement in arithmetic operations between two DataFrame objects (GH 32779)
- Performance improvement in
RollingGroupby
(GH 34052) - Performance improvement in arithmetic operations (
sub
,add
,mul
,div
) for MultiIndex (GH 34297) - Performance improvement in
DataFrame[bool_indexer]
whenbool_indexer
is alist
(GH 33924) - Significant performance improvement of
io.formats.style.Styler.render()
with styles added with various ways such as io.formats.style.Styler.apply(),io.formats.style.Styler.applymap()
or io.formats.style.Styler.bar() (GH 19917)
Bug fixes#
Categorical#
- Passing an invalid
fill_value
toCategorical.take()
raises aValueError
instead ofTypeError
(GH 33660) - Combining a Categorical with integer categories and which contains missing values with a float dtype column in operations such as concat() or
append()
will now result in a float column instead of an object dtype column (GH 33607) - Bug where merge() was unable to join on non-unique categorical indices (GH 28189)
- Bug when passing categorical data to Index constructor along with
dtype=object
incorrectly returning a CategoricalIndex instead of object-dtype Index (GH 32167) - Bug where Categorical comparison operator
__ne__
would incorrectly evaluate toFalse
when either element was missing (GH 32276) Categorical.fillna()
now accepts Categoricalother
argument (GH 32420)- Repr of Categorical was not distinguishing between
int
andstr
(GH 33676)
Datetimelike#
- Passing an integer dtype other than
int64
tonp.array(period_index, dtype=...)
will now raiseTypeError
instead of incorrectly usingint64
(GH 32255) - Series.to_timestamp() now raises a
TypeError
if the axis is not a PeriodIndex. Previously anAttributeError
was raised (GH 33327) - Series.to_period() now raises a
TypeError
if the axis is not a DatetimeIndex. Previously anAttributeError
was raised (GH 33327) - Period no longer accepts tuples for the
freq
argument (GH 34658) - Bug in Timestamp where constructing a Timestamp from ambiguous epoch time and calling constructor again changed the Timestamp.value() property (GH 24329)
DatetimeArray.searchsorted()
,TimedeltaArray.searchsorted()
,PeriodArray.searchsorted()
not recognizing non-pandas scalars and incorrectly raisingValueError
instead ofTypeError
(GH 30950)- Bug in Timestamp where constructing Timestamp with dateutil timezone less than 128 nanoseconds before daylight saving time switch from winter to summer would result in nonexistent time (GH 31043)
- Bug in Period.to_timestamp(), Period.start_time() with microsecond frequency returning a timestamp one nanosecond earlier than the correct time (GH 31475)
- Timestamp raised a confusing error message when year, month or day is missing (GH 31200)
- Bug in DatetimeIndex constructor incorrectly accepting
bool
-dtype inputs (GH 32668) - Bug in
DatetimeIndex.searchsorted()
not accepting alist
or Series as its argument (GH 32762) - Bug where PeriodIndex() raised when passed a Series of strings (GH 26109)
- Bug in Timestamp arithmetic when adding or subtracting an
np.ndarray
withtimedelta64
dtype (GH 33296) - Bug in DatetimeIndex.to_period() not inferring the frequency when called with no arguments (GH 33358)
- Bug in DatetimeIndex.tz_localize() incorrectly retaining
freq
in some cases where the originalfreq
is no longer valid (GH 30511) - Bug in
DatetimeIndex.intersection()
losingfreq
and timezone in some cases (GH 33604) - Bug in
DatetimeIndex.get_indexer()
where incorrect output would be returned for mixed datetime-like targets (GH 33741) - Bug in DatetimeIndex addition and subtraction with some types of
DateOffset
objects incorrectly retaining an invalidfreq
attribute (GH 33779) - Bug in DatetimeIndex where setting the
freq
attribute on an index could silently change thefreq
attribute on another index viewing the same data (GH 33552) - DataFrame.min() and DataFrame.max() were not returning consistent results with Series.min() and Series.max() when called on objects initialized with empty
pd.to_datetime()
- Bug in
DatetimeIndex.intersection()
andTimedeltaIndex.intersection()
with results not having the correctname
attribute (GH 33904) - Bug in
DatetimeArray.__setitem__()
,TimedeltaArray.__setitem__()
,PeriodArray.__setitem__()
incorrectly allowing values withint64
dtype to be silently cast (GH 33717) - Bug in subtracting TimedeltaIndex from Period incorrectly raising
TypeError
in some cases where it should succeed andIncompatibleFrequency
in some cases where it should raiseTypeError
(GH 33883) - Bug in constructing a Series or Index from a read-only NumPy array with non-ns resolution which converted to object dtype instead of coercing to
datetime64[ns]
dtype when within the timestamp bounds (GH 34843). - The
freq
keyword in Period, date_range(), period_range(),pd.tseries.frequencies.to_offset()
no longer allows tuples, pass as string instead (GH 34703) - Bug in
DataFrame.append()
when appending a Series containing a scalar tz-aware Timestamp to an empty DataFrame resulted in an object column instead ofdatetime64[ns, tz]
dtype (GH 35038) OutOfBoundsDatetime
issues an improved error message when timestamp is out of implementation bounds. (GH 32967)- Bug in
AbstractHolidayCalendar.holidays()
when no rules were defined (GH 31415) - Bug in
Tick
comparisons raisingTypeError
when comparing against timedelta-like objects (GH 34088) - Bug in
Tick
multiplication raisingTypeError
when multiplying by a float (GH 34486)
Timedelta#
- Bug in constructing a Timedelta with a high precision integer that would round the Timedelta components (GH 31354)
- Bug in dividing
np.nan
orNone
by Timedelta incorrectly returningNaT
(GH 31869) - Timedelta now understands
µs
as an identifier for microsecond (GH 32899) - Timedelta string representation now includes nanoseconds, when nanoseconds are non-zero (GH 9309)
- Bug in comparing a Timedelta object against an
np.ndarray
withtimedelta64
dtype incorrectly viewing all entries as unequal (GH 33441) - Bug in timedelta_range() that produced an extra point on a edge case (GH 30353, GH 33498)
- Bug in DataFrame.resample() that produced an extra point on a edge case (GH 30353, GH 13022, GH 33498)
- Bug in DataFrame.resample() that ignored the
loffset
argument when dealing with timedelta (GH 7687, GH 33498) - Bug in Timedelta and pandas.to_timedelta() that ignored the
unit
argument for string input (GH 12136)
Timezones#
- Bug in to_datetime() with
infer_datetime_format=True
where timezone names (e.g.UTC
) would not be parsed correctly (GH 33133)
Numeric#
- Bug in DataFrame.floordiv() with
axis=0
not treating division-by-zero like Series.floordiv() (GH 31271) - Bug in to_numeric() with string argument
"uint64"
anderrors="coerce"
silently fails (GH 32394) - Bug in to_numeric() with
downcast="unsigned"
fails for empty data (GH 32493) - Bug in DataFrame.mean() with
numeric_only=False
and eitherdatetime64
dtype orPeriodDtype
column incorrectly raisingTypeError
(GH 32426) - Bug in DataFrame.count() with
level="foo"
and index level"foo"
containing NaNs causes segmentation fault (GH 21824) - Bug in DataFrame.diff() with
axis=1
returning incorrect results with mixed dtypes (GH 32995) - Bug in DataFrame.corr() and DataFrame.cov() raising when handling nullable integer columns with
pandas.NA
(GH 33803) - Bug in arithmetic operations between DataFrame objects with non-overlapping columns with duplicate labels causing an infinite loop (GH 35194)
- Bug in DataFrame and Series addition and subtraction between object-dtype objects and
datetime64
dtype objects (GH 33824) - Bug in Index.difference() giving incorrect results when comparing a
Float64Index
and object Index (GH 35217) - Bug in DataFrame reductions (e.g.
df.min()
,df.max()
) withExtensionArray
dtypes (GH 34520, GH 32651) - Series.interpolate() and DataFrame.interpolate() now raise a ValueError if
limit_direction
is'forward'
or'both'
andmethod
is'backfill'
or'bfill'
orlimit_direction
is'backward'
or'both'
andmethod
is'pad'
or'ffill'
(GH 34746)
Conversion#
- Bug in Series construction from NumPy array with big-endian
datetime64
dtype (GH 29684) - Bug in Timedelta construction with large nanoseconds keyword value (GH 32402)
- Bug in DataFrame construction where sets would be duplicated rather than raising (GH 32582)
- The DataFrame constructor no longer accepts a list of DataFrame objects. Because of changes to NumPy, DataFrame objects are now consistently treated as 2D objects, so a list of DataFrame objects is considered 3D, and no longer acceptable for the DataFrame constructor (GH 32289).
- Bug in DataFrame when initiating a frame with lists and assign
columns
with nested list forMultiIndex
(GH 32173) - Improved error message for invalid construction of list when creating a new index (GH 35190)
Strings#
- Bug in the astype() method when converting “string” dtype data to nullable integer dtype (GH 32450).
- Fixed issue where taking
min
ormax
of aStringArray
orSeries
withStringDtype
type would raise. (GH 31746) - Bug in Series.str.cat() returning
NaN
output when other had Index type (GH 33425) pandas.api.dtypes.is_string_dtype()
no longer incorrectly identifies categorical series as string.
Interval#
- Bug in
IntervalArray
incorrectly allowing the underlying data to be changed when setting values (GH 32782)
Indexing#
- DataFrame.xs() now raises a
TypeError
if alevel
keyword is supplied and the axis is not a MultiIndex. Previously anAttributeError
was raised (GH 33610) - Bug in slicing on a DatetimeIndex with a partial-timestamp dropping high-resolution indices near the end of a year, quarter, or month (GH 31064)
- Bug in
PeriodIndex.get_loc()
treating higher-resolution strings differently fromPeriodIndex.get_value()
(GH 31172) - Bug in Series.at() and DataFrame.at() not matching
.loc
behavior when looking up an integer in aFloat64Index
(GH 31329) - Bug in
PeriodIndex.is_monotonic()
incorrectly returningTrue
when containing leadingNaT
entries (GH 31437) - Bug in
DatetimeIndex.get_loc()
raisingKeyError
with converted-integer key instead of the user-passed key (GH 31425) - Bug in Series.xs() incorrectly returning
Timestamp
instead ofdatetime64
in some object-dtype cases (GH 31630) - Bug in DataFrame.iat() incorrectly returning
Timestamp
instead ofdatetime
in some object-dtype cases (GH 32809) - Bug in DataFrame.at() when either columns or index is non-unique (GH 33041)
- Bug in Series.loc() and DataFrame.loc() when indexing with an integer key on a object-dtype Index that is not all-integers (GH 31905)
- Bug in
DataFrame.iloc.__setitem__()
on a DataFrame with duplicate columns incorrectly setting values for all matching columns (GH 15686, GH 22036) - Bug in DataFrame.loc() and Series.loc() with a DatetimeIndex, TimedeltaIndex, or PeriodIndex incorrectly allowing lookups of non-matching datetime-like dtypes (GH 32650)
- Bug in
Series.__getitem__()
indexing with non-standard scalars, e.g.np.dtype
(GH 32684) - Bug in Index constructor where an unhelpful error message was raised for NumPy scalars (GH 33017)
- Bug in
DataFrame.lookup()
incorrectly raising anAttributeError
whenframe.index
orframe.columns
is not unique; this will now raise aValueError
with a helpful error message (GH 33041) - Bug in Interval where a Timedelta could not be added or subtracted from a Timestamp interval (GH 32023)
- Bug in DataFrame.copy() not invalidating _item_cache after copy caused post-copy value updates to not be reflected (GH 31784)
- Fixed regression in DataFrame.loc() and Series.loc() throwing an error when a
datetime64[ns, tz]
value is provided (GH 32395) - Bug in
Series.__getitem__()
with an integer key and a MultiIndex with leading integer level failing to raiseKeyError
if the key is not present in the first level (GH 33355) - Bug in DataFrame.iloc() when slicing a single column DataFrame with
ExtensionDtype
(e.g.df.iloc[:, :1]
) returning an invalid result (GH 32957) - Bug in
DatetimeIndex.insert()
andTimedeltaIndex.insert()
causing indexfreq
to be lost when setting an element into an empty Series (GH 33573) - Bug in
Series.__setitem__()
with an IntervalIndex and a list-like key of integers (GH 33473) - Bug in
Series.__getitem__()
allowing missing labels withnp.ndarray
, Index, Series indexers but notlist
, these now all raiseKeyError
(GH 33646) - Bug in DataFrame.truncate() and Series.truncate() where index was assumed to be monotone increasing (GH 33756)
- Indexing with a list of strings representing datetimes failed on DatetimeIndex or PeriodIndex (GH 11278)
- Bug in Series.at() when used with a MultiIndex would raise an exception on valid inputs (GH 26989)
- Bug in DataFrame.loc() with dictionary of values changes columns with dtype of
int
tofloat
(GH 34573) - Bug in Series.loc() when used with a MultiIndex would raise an
IndexingError
when accessing aNone
value (GH 34318) - Bug in DataFrame.reset_index() and Series.reset_index() would not preserve data types on an empty DataFrame or Series with a MultiIndex (GH 19602)
- Bug in Series and DataFrame indexing with a
time
key on a DatetimeIndex withNaT
entries (GH 35114)
Missing#
- Calling
fillna()
on an empty Series now correctly returns a shallow copied object. The behaviour is now consistent with Index, DataFrame and a non-empty Series (GH 32543). - Bug in Series.replace() when argument
to_replace
is of type dict/list and is used on a Series containing<NA>
was raising aTypeError
. The method now handles this by ignoring<NA>
values when doing the comparison for the replacement (GH 32621) - Bug in any() and all() incorrectly returning
<NA>
for allFalse
or allTrue
values using the nulllable Boolean dtype and withskipna=False
(GH 33253) - Clarified documentation on interpolate with
method=akima
. Theder
parameter must be scalar orNone
(GH 33426) - DataFrame.interpolate() uses the correct axis convention now. Previously interpolating along columns lead to interpolation along indices and vice versa. Furthermore interpolating with methods
pad
,ffill
,bfill
andbackfill
are identical to using these methods with DataFrame.fillna() (GH 12918, GH 29146) - Bug in DataFrame.interpolate() when called on a DataFrame with column names of string type was throwing a ValueError. The method is now independent of the type of the column names (GH 33956)
- Passing NA into a format string using format specs will now work. For example
"{:.1f}".format(pd.NA)
would previously raise aValueError
, but will now return the string"<NA>"
(GH 34740) - Bug in Series.map() not raising on invalid
na_action
(GH 32815)
MultiIndex#
DataFrame.swaplevels()
now raises aTypeError
if the axis is not a MultiIndex. Previously anAttributeError
was raised (GH 31126)- Bug in
Dataframe.loc()
when used with a MultiIndex. The returned values were not in the same order as the given inputs (GH 22797)
In [79]: df = pd.DataFrame(np.arange(4), ....: index=[["a", "a", "b", "b"], [1, 2, 1, 2]]) ....:
Rows are now ordered as the requested keys
In [80]: df.loc[(['b', 'a'], [2, 1]), :] Out[80]: 0 b 2 3 1 2 a 2 1 1 0
[4 rows x 1 columns]
- Bug in
MultiIndex.intersection()
was not guaranteed to preserve order whensort=False
. (GH 31325) - Bug in DataFrame.truncate() was dropping MultiIndex names. (GH 34564)
In [81]: left = pd.MultiIndex.from_arrays([["b", "a"], [2, 1]])
In [82]: right = pd.MultiIndex.from_arrays([["a", "b", "c"], [1, 2, 3]])
Common elements are now guaranteed to be ordered by the left side
In [83]: left.intersection(right, sort=False) Out[83]: MultiIndex([('b', 2), ('a', 1)], )
- Bug when joining two MultiIndex without specifying level with different columns. Return-indexers parameter was ignored. (GH 34074)
IO#
- Passing a
set
asnames
argument to pandas.read_csv(), pandas.read_table(), or pandas.read_fwf() will raiseValueError: Names should be an ordered collection.
(GH 34946) - Bug in print-out when
display.precision
is zero. (GH 20359) - Bug in read_json() where integer overflow was occurring when json contains big number strings. (GH 30320)
- read_csv() will now raise a
ValueError
when the argumentsheader
andprefix
both are notNone
. (GH 27394) - Bug in DataFrame.to_json() was raising
NotFoundError
whenpath_or_buf
was an S3 URI (GH 28375) - Bug in DataFrame.to_parquet() overwriting pyarrow’s default for
coerce_timestamps
; following pyarrow’s default allows writing nanosecond timestamps withversion="2.0"
(GH 31652). - Bug in read_csv() was raising
TypeError
whensep=None
was used in combination withcomment
keyword (GH 31396) - Bug in
HDFStore
that caused it to set toint64
the dtype of adatetime64
column when reading a DataFrame in Python 3 from fixed format written in Python 2 (GH 31750) - read_sas() now handles dates and datetimes larger than Timestamp.max returning them as datetime.datetime objects (GH 20927)
- Bug in DataFrame.to_json() where
Timedelta
objects would not be serialized correctly withdate_format="iso"
(GH 28256) - read_csv() will raise a
ValueError
when the column names passed inparse_dates
are missing in theDataframe
(GH 31251) - Bug in read_excel() where a UTF-8 string with a high surrogate would cause a segmentation violation (GH 23809)
- Bug in read_csv() was causing a file descriptor leak on an empty file (GH 31488)
- Bug in read_csv() was causing a segfault when there were blank lines between the header and data rows (GH 28071)
- Bug in read_csv() was raising a misleading exception on a permissions issue (GH 23784)
- Bug in read_csv() was raising an
IndexError
whenheader=None
and two extra data columns - Bug in read_sas() was raising an
AttributeError
when reading files from Google Cloud Storage (GH 33069) - Bug in DataFrame.to_sql() where an
AttributeError
was raised when saving an out of bounds date (GH 26761) - Bug in read_excel() did not correctly handle multiple embedded spaces in OpenDocument text cells. (GH 32207)
- Bug in read_json() was raising
TypeError
when reading alist
of Booleans into a Series. (GH 31464) - Bug in
pandas.io.json.json_normalize()
where location specified byrecord_path
doesn’t point to an array. (GH 26284) - pandas.read_hdf() has a more explicit error message when loading an unsupported HDF file (GH 9539)
- Bug in
read_feather()
was raising anArrowIOError
when reading an s3 or http file path (GH 29055) - Bug in to_excel() could not handle the column name
render
and was raising anKeyError
(GH 34331) - Bug in
execute()
was raising aProgrammingError
for some DB-API drivers when the SQL statement contained the%
character and no parameters were present (GH 34211) - Bug in
StataReader()
which resulted in categorical variables with different dtypes when reading data using an iterator. (GH 31544) - HDFStore.keys() has now an optional
include
parameter that allows the retrieval of all native HDF5 table names (GH 29916) TypeError
exceptions raised by read_csv() and read_table() were showing asparser_f
when an unexpected keyword argument was passed (GH 25648)- Bug in read_excel() for ODS files removes 0.0 values (GH 27222)
- Bug in
ujson.encode()
was raising anOverflowError
with numbers larger thansys.maxsize
(GH 34395) - Bug in
HDFStore.append_to_multiple()
was raising aValueError
when themin_itemsize
parameter is set (GH 11238) - Bug in
create_table()
now raises an error whencolumn
argument was not specified indata_columns
on input (GH 28156) - read_json() now could read line-delimited json file from a file url while
lines
andchunksize
are set. - Bug in DataFrame.to_sql() when reading DataFrames with
-np.inf
entries with MySQL now has a more explicitValueError
(GH 34431) - Bug where capitalised files extensions were not decompressed by read_* functions (GH 35164)
- Bug in read_excel() that was raising a
TypeError
whenheader=None
andindex_col
is given as alist
(GH 31783) - Bug in read_excel() where datetime values are used in the header in a MultiIndex (GH 34748)
- read_excel() no longer takes
**kwds
arguments. This means that passing in the keyword argumentchunksize
now raises aTypeError
(previously raised aNotImplementedError
), while passing in the keyword argumentencoding
now raises aTypeError
(GH 34464) - Bug in DataFrame.to_records() was incorrectly losing timezone information in timezone-aware
datetime64
columns (GH 32535)
Plotting#
- DataFrame.plot() for line/bar now accepts color by dictionary (GH 8193).
- Bug in DataFrame.plot.hist() where weights are not working for multiple columns (GH 33173)
- Bug in DataFrame.boxplot() and
DataFrame.plot.boxplot()
lost color attributes ofmedianprops
,whiskerprops
,capprops
andboxprops
(GH 30346) - Bug in DataFrame.hist() where the order of
column
argument was ignored (GH 29235) - Bug in DataFrame.plot.scatter() that when adding multiple plots with different
cmap
, colorbars always use the firstcmap
(GH 33389) - Bug in DataFrame.plot.scatter() was adding a colorbar to the plot even if the argument
c
was assigned to a column containing color names (GH 34316) - Bug in pandas.plotting.bootstrap_plot() was causing cluttered axes and overlapping labels (GH 34905)
- Bug in DataFrame.plot.scatter() caused an error when plotting variable marker sizes (GH 32904)
GroupBy/resample/rolling#
- Using a pandas.api.indexers.BaseIndexer with
count
,min
,max
,median
,skew
,cov
,corr
will now return correct results for any monotonic pandas.api.indexers.BaseIndexer descendant (GH 32865) DataFrameGroupby.mean()
andSeriesGroupby.mean()
(and similarly formedian()
,std()
andvar()
) now raise aTypeError
if a non-accepted keyword argument is passed into it. Previously anUnsupportedFunctionCall
was raised (AssertionError
ifmin_count
passed intomedian()
) (GH 31485)- Bug in DataFrameGroupBy.apply() and SeriesGroupBy.apply() raising
ValueError
when theby
axis is not sorted, has duplicates, and the appliedfunc
does not mutate passed in objects (GH 30667) - Bug in
DataFrameGroupBy.transform()
produces an incorrect result with transformation functions (GH 30918) - Bug in DataFrameGroupBy.transform() and SeriesGroupBy.transform() were returning the wrong result when grouping by multiple keys of which some were categorical and others not (GH 32494)
- Bug in DataFrameGroupBy.count() and SeriesGroupBy.count() causing segmentation fault when grouped-by columns contain NaNs (GH 32841)
- Bug in DataFrame.groupby() and Series.groupby() produces inconsistent type when aggregating Boolean Series (GH 32894)
- Bug in
DataFrameGroupBy.sum()
andSeriesGroupBy.sum()
where a large negative number would be returned when the number of non-null values was belowmin_count
for nullable integer dtypes (GH 32861) - Bug in
SeriesGroupBy.quantile()
was raising on nullable integers (GH 33136) - Bug in DataFrame.resample() where an
AmbiguousTimeError
would be raised when the resulting timezone aware DatetimeIndex had a DST transition at midnight (GH 25758) - Bug in DataFrame.groupby() where a
ValueError
would be raised when grouping by a categorical column with read-only categories andsort=False
(GH 33410) - Bug in DataFrameGroupBy.agg(), SeriesGroupBy.agg(), DataFrameGroupBy.transform(), SeriesGroupBy.transform(), DataFrameGroupBy.resample(), and SeriesGroupBy.resample() where subclasses are not preserved (GH 28330)
- Bug in
SeriesGroupBy.agg()
where any column name was accepted in the named aggregation ofSeriesGroupBy
previously. The behaviour now allows onlystr
and callables else would raiseTypeError
. (GH 34422) - Bug in DataFrame.groupby() lost the name of the Index when one of the
agg
keys referenced an empty list (GH 32580) - Bug in
Rolling.apply()
wherecenter=True
was ignored whenengine='numba'
was specified (GH 34784) - Bug in
DataFrame.ewm.cov()
was throwingAssertionError
for MultiIndex inputs (GH 34440) - Bug in core.groupby.DataFrameGroupBy.quantile() raised
TypeError
for non-numeric types rather than dropping the columns (GH 27892) - Bug in core.groupby.DataFrameGroupBy.transform() when
func='nunique'
and columns are of typedatetime64
, the result would also be of typedatetime64
instead ofint64
(GH 35109) - Bug in DataFrame.groupby() raising an
AttributeError
when selecting a column and aggregating withas_index=False
(GH 35246). - Bug in
DataFrameGroupBy.first()
andDataFrameGroupBy.last()
that would raise an unnecessaryValueError
when grouping on multipleCategoricals
(GH 34951)
Reshaping#
- Bug effecting all numeric and Boolean reduction methods not returning subclassed data type. (GH 25596)
- Bug in DataFrame.pivot_table() when only
MultiIndexed
columns is set (GH 17038) - Bug in DataFrame.unstack() and Series.unstack() can take tuple names in
MultiIndexed
data (GH 19966) - Bug in DataFrame.pivot_table() when
margin
isTrue
and onlycolumn
is defined (GH 31016) - Fixed incorrect error message in DataFrame.pivot() when
columns
is set toNone
. (GH 30924) - Bug in crosstab() when inputs are two Series and have tuple names, the output will keep a dummy MultiIndex as columns. (GH 18321)
- DataFrame.pivot() can now take lists for
index
andcolumns
arguments (GH 21425) - Bug in concat() where the resulting indices are not copied when
copy=True
(GH 29879) - Bug in
SeriesGroupBy.aggregate()
was resulting in aggregations being overwritten when they shared the same name (GH 30880) - Bug where Index.astype() would lose the
name
attribute when converting fromFloat64Index
toInt64Index
, or when casting to anExtensionArray
dtype (GH 32013) Series.append()
will now raise aTypeError
when passed a DataFrame or a sequence containing DataFrame (GH 31413)- DataFrame.replace() and Series.replace() will raise a
TypeError
ifto_replace
is not an expected type. Previously thereplace
would fail silently (GH 18634) - Bug on inplace operation of a Series that was adding a column to the DataFrame from where it was originally dropped from (using
inplace=True
) (GH 30484) - Bug in DataFrame.apply() where callback was called with Series parameter even though
raw=True
requested. (GH 32423) - Bug in DataFrame.pivot_table() losing timezone information when creating a MultiIndex level from a column with timezone-aware dtype (GH 32558)
- Bug in concat() where when passing a non-dict mapping as
objs
would raise aTypeError
(GH 32863) - DataFrame.agg() now provides more descriptive
SpecificationError
message when attempting to aggregate a non-existent column (GH 32755) - Bug in DataFrame.unstack() when MultiIndex columns and MultiIndex rows were used (GH 32624, GH 24729 and GH 28306)
- Appending a dictionary to a DataFrame without passing
ignore_index=True
will raiseTypeError: Can only append a dict if ignore_index=True
instead ofTypeError: Can only append a :class:`Series` if ignore_index=True or if the :class:`Series` has a name
(GH 30871) - Bug in DataFrame.corrwith(), DataFrame.memory_usage(), DataFrame.dot(),DataFrame.idxmin(), DataFrame.idxmax(), DataFrame.duplicated(), DataFrame.isin(),DataFrame.count(), Series.explode(), Series.asof() and DataFrame.asof() not returning subclassed types. (GH 31331)
- Bug in concat() was not allowing for concatenation of DataFrame and Series with duplicate keys (GH 33654)
- Bug in cut() raised an error when the argument
labels
contains duplicates (GH 33141) - Ensure only named functions can be used in eval() (GH 32460)
- Bug in
Dataframe.aggregate()
and Series.aggregate() was causing a recursive loop in some cases (GH 34224) - Fixed bug in melt() where melting MultiIndex columns with
col_level > 0
would raise aKeyError
onid_vars
(GH 34129) - Bug in Series.where() with an empty Series and empty
cond
having non-bool dtype (GH 34592) - Fixed regression where DataFrame.apply() would raise
ValueError
for elements withS
dtype (GH 34529)
Sparse#
- Creating a
SparseArray
from timezone-aware dtype will issue a warning before dropping timezone information, instead of doing so silently (GH 32501) - Bug in
arrays.SparseArray.from_spmatrix()
wrongly read scipy sparse matrix (GH 31991) - Bug in Series.sum() with
SparseArray
raised aTypeError
(GH 25777) - Bug where DataFrame containing an all-sparse
SparseArray
filled withNaN
when indexed by a list-like (GH 27781, GH 29563) - The repr of SparseDtype now includes the repr of its
fill_value
attribute. Previously it usedfill_value
’s string representation (GH 34352) - Bug where empty DataFrame could not be cast to SparseDtype (GH 33113)
- Bug in arrays.SparseArray() was returning the incorrect type when indexing a sparse dataframe with an iterable (GH 34526, GH 34540)
ExtensionArray#
- Fixed bug where Series.value_counts() would raise on empty input of
Int64
dtype (GH 33317) - Fixed bug in concat() when concatenating DataFrame objects with non-overlapping columns resulting in object-dtype columns rather than preserving the extension dtype (GH 27692, GH 33027)
- Fixed bug where
StringArray.isna()
would returnFalse
for NA values whenpandas.options.mode.use_inf_as_na
was set toTrue
(GH 33655) - Fixed bug in Series construction with EA dtype and index but no data or scalar data fails (GH 26469)
- Fixed bug that caused
Series.__repr__()
to crash for extension types whose elements are multidimensional arrays (GH 33770). - Fixed bug where Series.update() would raise a
ValueError
forExtensionArray
dtypes with missing values (GH 33980) - Fixed bug where
StringArray.memory_usage()
was not implemented (GH 33963) - Fixed bug where
DataFrameGroupBy()
would ignore themin_count
argument for aggregations on nullable Boolean dtypes (GH 34051) - Fixed bug where the constructor of DataFrame with
dtype='string'
would fail (GH 27953, GH 33623) - Bug where DataFrame column set to scalar extension type was considered an object type rather than the extension type (GH 34832)
- Fixed bug in
IntegerArray.astype()
to correctly copy the mask as well (GH 34931).
Other#
- Set operations on an object-dtype Index now always return object-dtype results (GH 31401)
- Fixed pandas.testing.assert_series_equal() to correctly raise if the
left
argument is a different subclass withcheck_series_type=True
(GH 32670). - Getting a missing attribute in a DataFrame.query() or DataFrame.eval() string raises the correct
AttributeError
(GH 32408) - Fixed bug in pandas.testing.assert_series_equal() where dtypes were checked for
Interval
andExtensionArray
operands whencheck_dtype
wasFalse
(GH 32747) - Bug in
DataFrame.__dir__()
caused a segfault when using unicode surrogates in a column name (GH 25509) - Bug in DataFrame.equals() and Series.equals() in allowing subclasses to be equal (GH 34402).
Contributors#
A total of 368 people contributed patches to this release. People with a “+” by their names contributed a patch for the first time.
- 3vts +
- A Brooks +
- Abbie Popa +
- Achmad Syarif Hidayatullah +
- Adam W Bagaskarta +
- Adrian Mastronardi +
- Aidan Montare +
- Akbar Septriyan +
- Akos Furton +
- Alejandro Hall +
- Alex Hall +
- Alex Itkes +
- Alex Kirko
- Ali McMaster +
- Alvaro Aleman +
- Amy Graham +
- Andrew Schonfeld +
- Andrew Shumanskiy +
- Andrew Wieteska +
- Angela Ambroz
- Anjali Singh +
- Anna Daglis
- Anthony Milbourne +
- Antony Lee +
- Ari Sosnovsky +
- Arkadeep Adhikari +
- Arunim Samudra +
- Ashkan +
- Ashwin Prakash Nalwade +
- Ashwin Srinath +
- Atsushi Nukariya +
- Ayappan +
- Ayla Khan +
- Bart +
- Bart Broere +
- Benjamin Beier Liu +
- Benjamin Fischer +
- Bharat Raghunathan
- Bradley Dice +
- Brendan Sullivan +
- Brian Strand +
- Carsten van Weelden +
- Chamoun Saoma +
- ChrisRobo +
- Christian Chwala
- Christopher Whelan
- Christos Petropoulos +
- Chuanzhu Xu
- CloseChoice +
- Clément Robert +
- CuylenE +
- DanBasson +
- Daniel Saxton
- Danilo Horta +
- DavaIlhamHaeruzaman +
- Dave Hirschfeld
- Dave Hughes
- David Rouquet +
- David S +
- Deepyaman Datta
- Dennis Bakhuis +
- Derek McCammond +
- Devjeet Roy +
- Diane Trout
- Dina +
- Dom +
- Drew Seibert +
- EdAbati
- Emiliano Jordan +
- Erfan Nariman +
- Eric Groszman +
- Erik Hasse +
- Erkam Uyanik +
- Evan D +
- Evan Kanter +
- Fangchen Li +
- Farhan Reynaldo +
- Farhan Reynaldo Hutabarat +
- Florian Jetter +
- Fred Reiss +
- GYHHAHA +
- Gabriel Moreira +
- Gabriel Tutui +
- Galuh Sahid
- Gaurav Chauhan +
- George Hartzell +
- Gim Seng +
- Giovanni Lanzani +
- Gordon Chen +
- Graham Wetzler +
- Guillaume Lemaitre
- Guillem Sánchez +
- HH-MWB +
- Harshavardhan Bachina
- How Si Wei
- Ian Eaves
- Iqrar Agalosi Nureyza +
- Irv Lustig
- Iva Laginja +
- JDkuba
- Jack Greisman +
- Jacob Austin +
- Jacob Deppen +
- Jacob Peacock +
- Jake Tae +
- Jake Vanderplas +
- James Cobon-Kerr
- Jan Červenka +
- Jan Škoda
- Jane Chen +
- Jean-Francois Zinque +
- Jeanderson Barros Candido +
- Jeff Reback
- Jered Dominguez-Trujillo +
- Jeremy Schendel
- Jesse Farnham
- Jiaxiang
- Jihwan Song +
- Joaquim L. Viegas +
- Joel Nothman
- John Bodley +
- John Paton +
- Jon Thielen +
- Joris Van den Bossche
- Jose Manuel Martí +
- Joseph Gulian +
- Josh Dimarsky
- Joy Bhalla +
- João Veiga +
- Julian de Ruiter +
- Justin Essert +
- Justin Zheng
- KD-dev-lab +
- Kaiqi Dong
- Karthik Mathur +
- Kaushal Rohit +
- Kee Chong Tan
- Ken Mankoff +
- Kendall Masse
- Kenny Huynh +
- Ketan +
- Kevin Anderson +
- Kevin Bowey +
- Kevin Sheppard
- Kilian Lieret +
- Koki Nishihara +
- Krishna Chivukula +
- KrishnaSai2020 +
- Lesley +
- Lewis Cowles +
- Linda Chen +
- Linxiao Wu +
- Lucca Delchiaro Costabile +
- MBrouns +
- Mabel Villalba
- Mabroor Ahmed +
- Madhuri Palanivelu +
- Mak Sze Chun
- Malcolm +
- Marc Garcia
- Marco Gorelli
- Marian Denes +
- Martin Bjeldbak Madsen +
- Martin Durant +
- Martin Fleischmann +
- Martin Jones +
- Martin Winkel
- Martina Oefelein +
- Marvzinc +
- María Marino +
- Matheus Cardoso +
- Mathis Felardos +
- Matt Roeschke
- Matteo Felici +
- Matteo Santamaria +
- Matthew Roeschke
- Matthias Bussonnier
- Max Chen
- Max Halford +
- Mayank Bisht +
- Megan Thong +
- Michael Marino +
- Miguel Marques +
- Mike Kutzma
- Mohammad Hasnain Mohsin Rajan +
- Mohammad Jafar Mashhadi +
- MomIsBestFriend
- Monica +
- Natalie Jann
- Nate Armstrong +
- Nathanael +
- Nick Newman +
- Nico Schlömer +
- Niklas Weber +
- ObliviousParadigm +
- Olga Lyashevska +
- OlivierLuG +
- Pandas Development Team
- Parallels +
- Patrick +
- Patrick Cando +
- Paul Lilley +
- Paul Sanders +
- Pearcekieser +
- Pedro Larroy +
- Pedro Reys
- Peter Bull +
- Peter Steinbach +
- Phan Duc Nhat Minh +
- Phil Kirlin +
- Pierre-Yves Bourguignon +
- Piotr Kasprzyk +
- Piotr Niełacny +
- Prakhar Pandey
- Prashant Anand +
- Puneetha Pai +
- Quang Nguyễn +
- Rafael Jaimes III +
- Rafif +
- RaisaDZ +
- Rakshit Naidu +
- Ram Rachum +
- Red +
- Ricardo Alanis +
- Richard Shadrach +
- Rik-de-Kort
- Robert de Vries
- Robin to Roxel +
- Roger Erens +
- Rohith295 +
- Roman Yurchak
- Ror +
- Rushabh Vasani
- Ryan
- Ryan Nazareth
- SAI SRAVAN MEDICHERLA +
- SHUBH CHATTERJEE +
- Sam Cohan
- Samira-g-js +
- Sandu Ursu +
- Sang Agung +
- SanthoshBala18 +
- Sasidhar Kasturi +
- SatheeshKumar Mohan +
- Saul Shanabrook
- Scott Gigante +
- Sebastian Berg +
- Sebastián Vanrell
- Sergei Chipiga +
- Sergey +
- ShilpaSugan +
- Simon Gibbons
- Simon Hawkins
- Simon Legner +
- Soham Tiwari +
- Song Wenhao +
- Souvik Mandal
- Spencer Clark
- Steffen Rehberg +
- Steffen Schmitz +
- Stijn Van Hoey
- Stéphan Taljaard
- SultanOrazbayev +
- Sumanau Sareen
- SurajH1 +
- Suvayu Ali +
- Terji Petersen
- Thomas J Fan +
- Thomas Li
- Thomas Smith +
- Tim Swast
- Tobias Pitters +
- Tom +
- Tom Augspurger
- Uwe L. Korn
- Valentin Iovene +
- Vandana Iyer +
- Venkatesh Datta +
- Vijay Sai Mutyala +
- Vikas Pandey
- Vipul Rai +
- Vishwam Pandya +
- Vladimir Berkutov +
- Will Ayd
- Will Holmgren
- William +
- William Ayd
- Yago González +
- Yosuke KOBAYASHI +
- Zachary Lawrence +
- Zaky Bilfagih +
- Zeb Nicholls +
- alimcmaster1
- alm +
- andhikayusup +
- andresmcneill +
- avinashpancham +
- benabel +
- bernie gray +
- biddwan09 +
- brock +
- chris-b1
- cleconte987 +
- dan1261 +
- david-cortes +
- davidwales +
- dequadras +
- dhuettenmoser +
- dilex42 +
- elmonsomiat +
- epizzigoni +
- fjetter
- gabrielvf1 +
- gdex1 +
- gfyoung
- guru kiran +
- h-vishal
- iamshwin
- jamin-aws-ospo +
- jbrockmendel
- jfcorbett +
- jnecus +
- kernc
- kota matsuoka +
- kylekeppler +
- leandermaben +
- link2xt +
- manoj_koneni +
- marydmit +
- masterpiga +
- maxime.song +
- mglasder +
- moaraccounts +
- mproszewska
- neilkg
- nrebena
- ossdev07 +
- paihu
- pan Jacek +
- partev +
- patrick +
- pedrooa +
- pizzathief +
- proost
- pvanhauw +
- rbenes
- rebecca-palmer
- rhshadrach +
- rjfs +
- s-scherrer +
- sage +
- sagungrp +
- salem3358 +
- saloni30 +
- smartswdeveloper +
- smartvinnetou +
- themien +
- timhunderwood +
- tolhassianipar +
- tonywu1999
- tsvikas
- tv3141
- venkateshdatta1993 +
- vivikelapoutre +
- willbowditch +
- willpeppo +
- za +
- zaki-indra +