What’s new in 1.1.0 (July 28, 2020) — pandas 2.2.3 documentation (original) (raw)

These are the changes in pandas 1.1.0. See Release notes for a full changelog including other versions of pandas.

Enhancements#

KeyErrors raised by loc specify missing labels#

Previously, if labels were missing for a .loc call, a KeyError was raised stating that this was no longer supported.

Now the error message also includes a list of the missing labels (max 10 items, display width 80 characters). See GH 34272.

All dtypes can now be converted to StringDtype#

Previously, declaring or converting to StringDtype was in general only possible if the data was already only str or nan-like (GH 31204).StringDtype now works in all situations where astype(str) or dtype=str work:

For example, the below now works:

In [1]: ser = pd.Series([1, "abc", np.nan], dtype="string")

In [2]: ser Out[2]: 0 1 1 abc 2 Length: 3, dtype: string

In [3]: ser[0] Out[3]: '1'

In [4]: pd.Series([1, 2, np.nan], dtype="Int64").astype("string") Out[4]: 0 1 1 2 2 Length: 3, dtype: string

Non-monotonic PeriodIndex partial string slicing#

PeriodIndex now supports partial string slicing for non-monotonic indexes, mirroring DatetimeIndex behavior (GH 31096)

For example:

In [5]: dti = pd.date_range("2014-01-01", periods=30, freq="30D")

In [6]: pi = dti.to_period("D")

In [7]: ser_monotonic = pd.Series(np.arange(30), index=pi)

In [8]: shuffler = list(range(0, 30, 2)) + list(range(1, 31, 2))

In [9]: ser = ser_monotonic.iloc[shuffler]

In [10]: ser Out[10]: 2014-01-01 0 2014-03-02 2 2014-05-01 4 2014-06-30 6 2014-08-29 8 .. 2015-09-23 21 2015-11-22 23 2016-01-21 25 2016-03-21 27 2016-05-20 29 Freq: D, Length: 30, dtype: int64

In [11]: ser["2014"] Out[11]: 2014-01-01 0 2014-03-02 2 2014-05-01 4 2014-06-30 6 2014-08-29 8 2014-10-28 10 2014-12-27 12 2014-01-31 1 2014-04-01 3 2014-05-31 5 2014-07-30 7 2014-09-28 9 2014-11-27 11 Freq: D, Length: 13, dtype: int64

In [12]: ser.loc["May 2015"] Out[12]: 2015-05-26 17 Freq: D, Length: 1, dtype: int64

Comparing two DataFrame or two Series and summarizing the differences#

We’ve added DataFrame.compare() and Series.compare() for comparing two DataFrame or two Series (GH 30429)

In [13]: df = pd.DataFrame( ....: { ....: "col1": ["a", "a", "b", "b", "a"], ....: "col2": [1.0, 2.0, 3.0, np.nan, 5.0], ....: "col3": [1.0, 2.0, 3.0, 4.0, 5.0] ....: }, ....: columns=["col1", "col2", "col3"], ....: ) ....:

In [14]: df Out[14]: col1 col2 col3 0 a 1.0 1.0 1 a 2.0 2.0 2 b 3.0 3.0 3 b NaN 4.0 4 a 5.0 5.0

[5 rows x 3 columns]

In [15]: df2 = df.copy()

In [16]: df2.loc[0, 'col1'] = 'c'

In [17]: df2.loc[2, 'col3'] = 4.0

In [18]: df2 Out[18]: col1 col2 col3 0 c 1.0 1.0 1 a 2.0 2.0 2 b 3.0 4.0 3 b NaN 4.0 4 a 5.0 5.0

[5 rows x 3 columns]

In [19]: df.compare(df2) Out[19]: col1 col3
self other self other 0 a c NaN NaN 2 NaN NaN 3.0 4.0

[2 rows x 4 columns]

See User Guide for more details.

Allow NA in groupby key#

With groupby , we’ve added a dropna keyword to DataFrame.groupby() and Series.groupby() in order to allow NA values in group keys. Users can define dropna to False if they want to includeNA values in groupby keys. The default is set to True for dropna to keep backwards compatibility (GH 3729)

In [20]: df_list = [[1, 2, 3], [1, None, 4], [2, 1, 3], [1, 2, 2]]

In [21]: df_dropna = pd.DataFrame(df_list, columns=["a", "b", "c"])

In [22]: df_dropna Out[22]: a b c 0 1 2.0 3 1 1 NaN 4 2 2 1.0 3 3 1 2.0 2

[4 rows x 3 columns]

Default dropna is set to True, which will exclude NaNs in keys

In [23]: df_dropna.groupby(by=["b"], dropna=True).sum() Out[23]: a c b
1.0 2 3 2.0 2 5

[2 rows x 2 columns]

In order to allow NaN in keys, set dropna to False

In [24]: df_dropna.groupby(by=["b"], dropna=False).sum() Out[24]: a c b
1.0 2 3 2.0 2 5 NaN 1 4

[3 rows x 2 columns]

The default setting of dropna argument is True which means NA are not included in group keys.

Sorting with keys#

We’ve added a key argument to the DataFrame and Series sorting methods, includingDataFrame.sort_values(), DataFrame.sort_index(), Series.sort_values(), and Series.sort_index(). The key can be any callable function which is applied column-by-column to each column used for sorting, before sorting is performed (GH 27237). See sort_values with keys and sort_index with keys for more information.

In [25]: s = pd.Series(['C', 'a', 'B'])

In [26]: s Out[26]: 0 C 1 a 2 B Length: 3, dtype: object

In [27]: s.sort_values() Out[27]: 2 B 0 C 1 a Length: 3, dtype: object

Note how this is sorted with capital letters first. If we apply the Series.str.lower()method, we get

In [28]: s.sort_values(key=lambda x: x.str.lower()) Out[28]: 1 a 2 B 0 C Length: 3, dtype: object

When applied to a DataFrame, they key is applied per-column to all columns or a subset ifby is specified, e.g.

In [29]: df = pd.DataFrame({'a': ['C', 'C', 'a', 'a', 'B', 'B'], ....: 'b': [1, 2, 3, 4, 5, 6]}) ....:

In [30]: df Out[30]: a b 0 C 1 1 C 2 2 a 3 3 a 4 4 B 5 5 B 6

[6 rows x 2 columns]

In [31]: df.sort_values(by=['a'], key=lambda col: col.str.lower()) Out[31]: a b 2 a 3 3 a 4 4 B 5 5 B 6 0 C 1 1 C 2

[6 rows x 2 columns]

For more details, see examples and documentation in DataFrame.sort_values(),Series.sort_values(), and sort_index().

Fold argument support in Timestamp constructor#

Timestamp: now supports the keyword-only fold argument according to PEP 495 similar to parent datetime.datetime class. It supports both accepting fold as an initialization argument and inferring fold from other constructor arguments (GH 25057, GH 31338). Support is limited to dateutil timezones as pytz doesn’t support fold.

For example:

In [32]: ts = pd.Timestamp("2019-10-27 01:30:00+00:00")

In [33]: ts.fold Out[33]: 0

In [34]: ts = pd.Timestamp(year=2019, month=10, day=27, hour=1, minute=30, ....: tz="dateutil/Europe/London", fold=1) ....:

In [35]: ts Out[35]: Timestamp('2019-10-27 01:30:00+0000', tz='dateutil//usr/share/zoneinfo/Europe/London')

For more on working with fold, see Fold subsection in the user guide.

Parsing timezone-aware format with different timezones in to_datetime#

to_datetime() now supports parsing formats containing timezone names (%Z) and UTC offsets (%z) from different timezones then converting them to UTC by setting utc=True. This would return a DatetimeIndex with timezone at UTC as opposed to an Index with object dtype if utc=True is not set (GH 32792).

For example:

In [36]: tz_strs = ["2010-01-01 12:00:00 +0100", "2010-01-01 12:00:00 -0100", ....: "2010-01-01 12:00:00 +0300", "2010-01-01 12:00:00 +0400"] ....:

In [37]: pd.to_datetime(tz_strs, format='%Y-%m-%d %H:%M:%S %z', utc=True) Out[37]: DatetimeIndex(['2010-01-01 11:00:00+00:00', '2010-01-01 13:00:00+00:00', '2010-01-01 09:00:00+00:00', '2010-01-01 08:00:00+00:00'], dtype='datetime64[ns, UTC]', freq=None)

In[37]: pd.to_datetime(tz_strs, format='%Y-%m-%d %H:%M:%S %z') Out[37]: Index([2010-01-01 12:00:00+01:00, 2010-01-01 12:00:00-01:00, 2010-01-01 12:00:00+03:00, 2010-01-01 12:00:00+04:00], dtype='object')

Grouper and resample now supports the arguments origin and offset#

Grouper and DataFrame.resample() now supports the arguments origin and offset. It let the user control the timestamp on which to adjust the grouping. (GH 31809)

The bins of the grouping are adjusted based on the beginning of the day of the time series starting point. This works well with frequencies that are multiples of a day (like 30D) or that divides a day (like 90s or 1min). But it can create inconsistencies with some frequencies that do not meet this criteria. To change this behavior you can now specify a fixed timestamp with the argument origin.

Two arguments are now deprecated (more information in the documentation of DataFrame.resample()):

Small example of the use of origin:

In [38]: start, end = '2000-10-01 23:30:00', '2000-10-02 00:30:00'

In [39]: middle = '2000-10-02 00:00:00'

In [40]: rng = pd.date_range(start, end, freq='7min')

In [41]: ts = pd.Series(np.arange(len(rng)) * 3, index=rng)

In [42]: ts Out[42]: 2000-10-01 23:30:00 0 2000-10-01 23:37:00 3 2000-10-01 23:44:00 6 2000-10-01 23:51:00 9 2000-10-01 23:58:00 12 2000-10-02 00:05:00 15 2000-10-02 00:12:00 18 2000-10-02 00:19:00 21 2000-10-02 00:26:00 24 Freq: 7min, Length: 9, dtype: int64

Resample with the default behavior 'start_day' (origin is 2000-10-01 00:00:00):

In [43]: ts.resample('17min').sum() Out[43]: 2000-10-01 23:14:00 0 2000-10-01 23:31:00 9 2000-10-01 23:48:00 21 2000-10-02 00:05:00 54 2000-10-02 00:22:00 24 Freq: 17min, Length: 5, dtype: int64

In [44]: ts.resample('17min', origin='start_day').sum() Out[44]: 2000-10-01 23:14:00 0 2000-10-01 23:31:00 9 2000-10-01 23:48:00 21 2000-10-02 00:05:00 54 2000-10-02 00:22:00 24 Freq: 17min, Length: 5, dtype: int64

Resample using a fixed origin:

In [45]: ts.resample('17min', origin='epoch').sum() Out[45]: 2000-10-01 23🔞00 0 2000-10-01 23:35:00 18 2000-10-01 23:52:00 27 2000-10-02 00:09:00 39 2000-10-02 00:26:00 24 Freq: 17min, Length: 5, dtype: int64

In [46]: ts.resample('17min', origin='2000-01-01').sum() Out[46]: 2000-10-01 23:24:00 3 2000-10-01 23:41:00 15 2000-10-01 23:58:00 45 2000-10-02 00:15:00 45 Freq: 17min, Length: 4, dtype: int64

If needed you can adjust the bins with the argument offset (a Timedelta) that would be added to the default origin.

For a full example, see: Use origin or offset to adjust the start of the bins.

fsspec now used for filesystem handling#

For reading and writing to filesystems other than local and reading from HTTP(S), the optional dependency fsspec will be used to dispatch operations (GH 33452). This will give unchanged functionality for S3 and GCS storage, which were already supported, but also add support for several other storage implementations such as Azure Datalake and Blob, SSH, FTP, dropbox and github. For docs and capabilities, see the fsspec docs.

The existing capability to interface with S3 and GCS will be unaffected by this change, as fsspec will still bring in the same packages as before.

Other enhancements#

Notable bug fixes#

These are bug fixes that might have notable behavior changes.

MultiIndex.get_indexer interprets method argument correctly#

This restores the behavior of MultiIndex.get_indexer() with method='backfill' or method='pad' to the behavior before pandas 0.23.0. In particular, MultiIndexes are treated as a list of tuples and padding or backfilling is done with respect to the ordering of these lists of tuples (GH 29896).

As an example of this, given:

In [47]: df = pd.DataFrame({ ....: 'a': [0, 0, 0, 0], ....: 'b': [0, 2, 3, 4], ....: 'c': ['A', 'B', 'C', 'D'], ....: }).set_index(['a', 'b']) ....:

In [48]: mi_2 = pd.MultiIndex.from_product([[0], [-1, 0, 1, 3, 4, 5]])

The differences in reindexing df with mi_2 and using method='backfill' can be seen here:

pandas >= 0.23, < 1.1.0:

In [1]: df.reindex(mi_2, method='backfill') Out[1]: c 0 -1 A 0 A 1 D 3 A 4 A 5 C

pandas <0.23, >= 1.1.0

In [49]: df.reindex(mi_2, method='backfill') Out[49]: c 0 -1 A 0 A 1 B 3 C 4 D 5 NaN

[6 rows x 1 columns]

And the differences in reindexing df with mi_2 and using method='pad' can be seen here:

pandas >= 0.23, < 1.1.0

In [1]: df.reindex(mi_2, method='pad') Out[1]: c 0 -1 NaN 0 NaN 1 D 3 NaN 4 A 5 C

pandas < 0.23, >= 1.1.0

In [50]: df.reindex(mi_2, method='pad') Out[50]: c 0 -1 NaN 0 A 1 A 3 C 4 D 5 D

[6 rows x 1 columns]

Failed label-based lookups always raise KeyError#

Label lookups series[key], series.loc[key] and frame.loc[key]used to raise either KeyError or TypeError depending on the type of key and type of Index. These now consistently raise KeyError (GH 31867)

In [51]: ser1 = pd.Series(range(3), index=[0, 1, 2])

In [52]: ser2 = pd.Series(range(3), index=pd.date_range("2020-02-01", periods=3))

Previous behavior:

In [3]: ser1[1.5] ... TypeError: cannot do label indexing on Int64Index with these indexers [1.5] of type float

In [4] ser1["foo"] ... KeyError: 'foo'

In [5]: ser1.loc[1.5] ... TypeError: cannot do label indexing on Int64Index with these indexers [1.5] of type float

In [6]: ser1.loc["foo"] ... KeyError: 'foo'

In [7]: ser2.loc[1] ... TypeError: cannot do label indexing on DatetimeIndex with these indexers [1] of type int

In [8]: ser2.loc[pd.Timestamp(0)] ... KeyError: Timestamp('1970-01-01 00:00:00')

New behavior:

In [3]: ser1[1.5] ... KeyError: 1.5

In [4] ser1["foo"] ... KeyError: 'foo'

In [5]: ser1.loc[1.5] ... KeyError: 1.5

In [6]: ser1.loc["foo"] ... KeyError: 'foo'

In [7]: ser2.loc[1] ... KeyError: 1

In [8]: ser2.loc[pd.Timestamp(0)] ... KeyError: Timestamp('1970-01-01 00:00:00')

Similarly, DataFrame.at() and Series.at() will raise a TypeError instead of a ValueError if an incompatible key is passed, and KeyError if a missing key is passed, matching the behavior of .loc[] (GH 31722)

Failed Integer Lookups on MultiIndex Raise KeyError#

Indexing with integers with a MultiIndex that has an integer-dtype first level incorrectly failed to raise KeyError when one or more of those integer keys is not present in the first level of the index (GH 33539)

In [53]: idx = pd.Index(range(4))

In [54]: dti = pd.date_range("2000-01-03", periods=3)

In [55]: mi = pd.MultiIndex.from_product([idx, dti])

In [56]: ser = pd.Series(range(len(mi)), index=mi)

Previous behavior:

In [5]: ser[[5]] Out[5]: Series([], dtype: int64)

New behavior:

In [5]: ser[[5]] ... KeyError: '[5] not in index'

DataFrame.merge() preserves right frame’s row order#

DataFrame.merge() now preserves the right frame’s row order when executing a right merge (GH 27453)

In [57]: left_df = pd.DataFrame({'animal': ['dog', 'pig'], ....: 'max_speed': [40, 11]}) ....:

In [58]: right_df = pd.DataFrame({'animal': ['quetzal', 'pig'], ....: 'max_speed': [80, 11]}) ....:

In [59]: left_df Out[59]: animal max_speed 0 dog 40 1 pig 11

[2 rows x 2 columns]

In [60]: right_df Out[60]: animal max_speed 0 quetzal 80 1 pig 11

[2 rows x 2 columns]

Previous behavior:

left_df.merge(right_df, on=['animal', 'max_speed'], how="right") animal max_speed 0 pig 11 1 quetzal 80

New behavior:

In [61]: left_df.merge(right_df, on=['animal', 'max_speed'], how="right") Out[61]: animal max_speed 0 quetzal 80 1 pig 11

[2 rows x 2 columns]

Assignment to multiple columns of a DataFrame when some columns do not exist#

Assignment to multiple columns of a DataFrame when some of the columns do not exist would previously assign the values to the last column. Now, new columns will be constructed with the right values. (GH 13658)

In [62]: df = pd.DataFrame({'a': [0, 1, 2], 'b': [3, 4, 5]})

In [63]: df Out[63]: a b 0 0 3 1 1 4 2 2 5

[3 rows x 2 columns]

Previous behavior:

In [3]: df[['a', 'c']] = 1 In [4]: df Out[4]: a b 0 1 1 1 1 1 2 1 1

New behavior:

In [64]: df[['a', 'c']] = 1

In [65]: df Out[65]: a b c 0 1 3 1 1 1 4 1 2 1 5 1

[3 rows x 3 columns]

Consistency across groupby reductions#

Using DataFrame.groupby() with as_index=True and the aggregation nunique would include the grouping column(s) in the columns of the result. Now the grouping column(s) only appear in the index, consistent with other reductions. (GH 32579)

In [66]: df = pd.DataFrame({"a": ["x", "x", "y", "y"], "b": [1, 1, 2, 3]})

In [67]: df Out[67]: a b 0 x 1 1 x 1 2 y 2 3 y 3

[4 rows x 2 columns]

Previous behavior:

In [3]: df.groupby("a", as_index=True).nunique() Out[4]: a b a x 1 1 y 1 2

New behavior:

In [68]: df.groupby("a", as_index=True).nunique() Out[68]: b a
x 1 y 2

[2 rows x 1 columns]

Using DataFrame.groupby() with as_index=False and the function idxmax, idxmin, mad, nunique, sem, skew, or std would modify the grouping column. Now the grouping column remains unchanged, consistent with other reductions. (GH 21090, GH 10355)

Previous behavior:

In [3]: df.groupby("a", as_index=False).nunique() Out[4]: a b 0 1 1 1 1 2

New behavior:

In [69]: df.groupby("a", as_index=False).nunique() Out[69]: a b 0 x 1 1 y 2

[2 rows x 2 columns]

The method DataFrameGroupBy.size() would previously ignore as_index=False. Now the grouping columns are returned as columns, making the result a DataFrame instead of a Series. (GH 32599)

Previous behavior:

In [3]: df.groupby("a", as_index=False).size() Out[4]: a x 2 y 2 dtype: int64

New behavior:

In [70]: df.groupby("a", as_index=False).size() Out[70]: a size 0 x 2 1 y 2

[2 rows x 2 columns]

DataFrameGroupby.agg() lost results with as_index=False when relabeling columns#

Previously DataFrameGroupby.agg() lost the result columns, when the as_index option was set to False and the result columns were relabeled. In this case the result values were replaced with the previous index (GH 32240).

In [71]: df = pd.DataFrame({"key": ["x", "y", "z", "x", "y", "z"], ....: "val": [1.0, 0.8, 2.0, 3.0, 3.6, 0.75]}) ....:

In [72]: df Out[72]: key val 0 x 1.00 1 y 0.80 2 z 2.00 3 x 3.00 4 y 3.60 5 z 0.75

[6 rows x 2 columns]

Previous behavior:

In [2]: grouped = df.groupby("key", as_index=False) In [3]: result = grouped.agg(min_val=pd.NamedAgg(column="val", aggfunc="min")) In [4]: result Out[4]: min_val 0 x 1 y 2 z

New behavior:

In [73]: grouped = df.groupby("key", as_index=False)

In [74]: result = grouped.agg(min_val=pd.NamedAgg(column="val", aggfunc="min"))

In [75]: result Out[75]: key min_val 0 x 1.00 1 y 0.80 2 z 0.75

[3 rows x 2 columns]

apply and applymap on DataFrame evaluates first row/column only once#

In [76]: df = pd.DataFrame({'a': [1, 2], 'b': [3, 6]})

In [77]: def func(row): ....: print(row) ....: return row ....:

Previous behavior:

In [4]: df.apply(func, axis=1) a 1 b 3 Name: 0, dtype: int64 a 1 b 3 Name: 0, dtype: int64 a 2 b 6 Name: 1, dtype: int64 Out[4]: a b 0 1 3 1 2 6

New behavior:

In [78]: df.apply(func, axis=1) a 1 b 3 Name: 0, Length: 2, dtype: int64 a 2 b 6 Name: 1, Length: 2, dtype: int64 Out[78]: a b 0 1 3 1 2 6

[2 rows x 2 columns]

Backwards incompatible API changes#

Added check_freq argument to testing.assert_frame_equal and testing.assert_series_equal#

The check_freq argument was added to testing.assert_frame_equal() and testing.assert_series_equal() in pandas 1.1.0 and defaults to True. testing.assert_frame_equal() and testing.assert_series_equal() now raise AssertionError if the indexes do not have the same frequency. Before pandas 1.1.0, the index frequency was not checked.

Increased minimum versions for dependencies#

Some minimum supported versions of dependencies were updated (GH 33718, GH 29766, GH 29723, pytables >= 3.4.3). If installed, we now require:

Package Minimum Version Required Changed
numpy 1.15.4 X X
pytz 2015.4 X
python-dateutil 2.7.3 X X
bottleneck 1.2.1
numexpr 2.6.2
pytest (dev) 4.0.2

For optional libraries the general recommendation is to use the latest version. The following table lists the lowest version per library that is currently being tested throughout the development of pandas. Optional libraries below the lowest tested version may still work, but are not considered supported.

Package Minimum Version Changed
beautifulsoup4 4.6.0
fastparquet 0.3.2
fsspec 0.7.4
gcsfs 0.6.0 X
lxml 3.8.0
matplotlib 2.2.2
numba 0.46.0
openpyxl 2.5.7
pyarrow 0.13.0
pymysql 0.7.1
pytables 3.4.3 X
s3fs 0.4.0 X
scipy 1.2.0 X
sqlalchemy 1.1.4
xarray 0.8.2
xlrd 1.1.0
xlsxwriter 0.9.8
xlwt 1.2.0
pandas-gbq 1.2.0 X

See Dependencies and Optional dependencies for more.

Development changes#

Deprecations#

Performance improvements#

Bug fixes#

Categorical#

Datetimelike#

Timedelta#

Timezones#

Numeric#

Conversion#

Strings#

Interval#

Indexing#

Missing#

MultiIndex#

In [79]: df = pd.DataFrame(np.arange(4), ....: index=[["a", "a", "b", "b"], [1, 2, 1, 2]]) ....:

Rows are now ordered as the requested keys

In [80]: df.loc[(['b', 'a'], [2, 1]), :] Out[80]: 0 b 2 3 1 2 a 2 1 1 0

[4 rows x 1 columns]

In [81]: left = pd.MultiIndex.from_arrays([["b", "a"], [2, 1]])

In [82]: right = pd.MultiIndex.from_arrays([["a", "b", "c"], [1, 2, 3]])

Common elements are now guaranteed to be ordered by the left side

In [83]: left.intersection(right, sort=False) Out[83]: MultiIndex([('b', 2), ('a', 1)], )

IO#

Plotting#

GroupBy/resample/rolling#

Reshaping#

Sparse#

ExtensionArray#

Other#

Contributors#

A total of 368 people contributed patches to this release. People with a “+” by their names contributed a patch for the first time.