What’s new in 0.25.0 (July 18, 2019) — pandas 2.2.3 documentation (original) (raw)

Warning

Starting with the 0.25.x series of releases, pandas only supports Python 3.5.3 and higher. See Dropping Python 2.7 for more details.

Warning

The minimum supported Python version will be bumped to 3.6 in a future release.

Warning

Panel has been fully removed. For N-D labeled data structures, please use xarray

Warning

read_pickle() and read_msgpack() are only guaranteed backwards compatible back to pandas version 0.20.3 (GH 27082)

These are the changes in pandas 0.25.0. See Release notes for a full changelog including other versions of pandas.

Enhancements#

GroupBy aggregation with relabeling#

pandas has added special groupby behavior, known as “named aggregation”, for naming the output columns when applying multiple aggregation functions to specific columns (GH 18366, GH 26512).

In [1]: animals = pd.DataFrame({'kind': ['cat', 'dog', 'cat', 'dog'], ...: 'height': [9.1, 6.0, 9.5, 34.0], ...: 'weight': [7.9, 7.5, 9.9, 198.0]}) ...:

In [2]: animals Out[2]: kind height weight 0 cat 9.1 7.9 1 dog 6.0 7.5 2 cat 9.5 9.9 3 dog 34.0 198.0

[4 rows x 3 columns]

In [3]: animals.groupby("kind").agg( ...: min_height=pd.NamedAgg(column='height', aggfunc='min'), ...: max_height=pd.NamedAgg(column='height', aggfunc='max'), ...: average_weight=pd.NamedAgg(column='weight', aggfunc="mean"), ...: ) ...: Out[3]: min_height max_height average_weight kind
cat 9.1 9.5 8.90 dog 6.0 34.0 102.75

[2 rows x 3 columns]

Pass the desired columns names as the **kwargs to .agg. The values of **kwargsshould be tuples where the first element is the column selection, and the second element is the aggregation function to apply. pandas provides the pandas.NamedAgg namedtuple to make it clearer what the arguments to the function are, but plain tuples are accepted as well.

In [4]: animals.groupby("kind").agg( ...: min_height=('height', 'min'), ...: max_height=('height', 'max'), ...: average_weight=('weight', 'mean'), ...: ) ...: Out[4]: min_height max_height average_weight kind
cat 9.1 9.5 8.90 dog 6.0 34.0 102.75

[2 rows x 3 columns]

Named aggregation is the recommended replacement for the deprecated “dict-of-dicts” approach to naming the output of column-specific aggregations (Deprecate groupby.agg() with a dictionary when renaming).

A similar approach is now available for Series groupby objects as well. Because there’s no need for column selection, the values can just be the functions to apply

In [5]: animals.groupby("kind").height.agg( ...: min_height="min", ...: max_height="max", ...: ) ...: Out[5]: min_height max_height kind
cat 9.1 9.5 dog 6.0 34.0

[2 rows x 2 columns]

This type of aggregation is the recommended alternative to the deprecated behavior when passing a dict to a Series groupby aggregation (Deprecate groupby.agg() with a dictionary when renaming).

See Named aggregation for more.

GroupBy aggregation with multiple lambdas#

You can now provide multiple lambda functions to a list-like aggregation inGroupBy.agg (GH 26430).

In [6]: animals.groupby('kind').height.agg([ ...: lambda x: x.iloc[0], lambda x: x.iloc[-1] ...: ]) ...: Out[6]: kind
cat 9.1 9.5 dog 6.0 34.0

[2 rows x 2 columns]

In [7]: animals.groupby('kind').agg([ ...: lambda x: x.iloc[0] - x.iloc[1], ...: lambda x: x.iloc[0] + x.iloc[1] ...: ]) ...: Out[7]: height weight
kind
cat -0.4 18.6 -2.0 17.8 dog -28.0 40.0 -190.5 205.5

[2 rows x 4 columns]

Previously, these raised a SpecificationError.

Better repr for MultiIndex#

Printing of MultiIndex instances now shows tuples of each row and ensures that the tuple items are vertically aligned, so it’s now easier to understand the structure of the MultiIndex. (GH 13480):

The repr now looks like this:

In [8]: pd.MultiIndex.from_product([['a', 'abc'], range(500)]) Out[8]: MultiIndex([( 'a', 0), ( 'a', 1), ( 'a', 2), ( 'a', 3), ( 'a', 4), ( 'a', 5), ( 'a', 6), ( 'a', 7), ( 'a', 8), ( 'a', 9), ... ('abc', 490), ('abc', 491), ('abc', 492), ('abc', 493), ('abc', 494), ('abc', 495), ('abc', 496), ('abc', 497), ('abc', 498), ('abc', 499)], length=1000)

Previously, outputting a MultiIndex printed all the levels andcodes of the MultiIndex, which was visually unappealing and made the output more difficult to navigate. For example (limiting the range to 5):

In [1]: pd.MultiIndex.from_product([['a', 'abc'], range(5)]) Out[1]: MultiIndex(levels=[['a', 'abc'], [0, 1, 2, 3]], ...: codes=[[0, 0, 0, 0, 1, 1, 1, 1], [0, 1, 2, 3, 0, 1, 2, 3]])

In the new repr, all values will be shown, if the number of rows is smaller than options.display.max_seq_items (default: 100 items). Horizontally, the output will truncate, if it’s wider than options.display.width(default: 80 characters).

Shorter truncated repr for Series and DataFrame#

Currently, the default display options of pandas ensure that when a Series or DataFrame has more than 60 rows, its repr gets truncated to this maximum of 60 rows (the display.max_rows option). However, this still gives a repr that takes up a large part of the vertical screen estate. Therefore, a new option display.min_rows is introduced with a default of 10 which determines the number of rows showed in the truncated repr:

This dual option allows to still see the full content of relatively small objects (e.g. df.head(20) shows all 20 rows), while giving a brief repr for large objects.

To restore the previous behaviour of a single threshold, setpd.options.display.min_rows = None.

JSON normalize with max_level param support#

json_normalize() normalizes the provided input dict to all nested levels. The new max_level parameter provides more control over which level to end normalization (GH 23843):

The repr now looks like this:

from pandas.io.json import json_normalize data = [{ 'CreatedBy': {'Name': 'User001'}, 'Lookup': {'TextField': 'Some text', 'UserField': {'Id': 'ID001', 'Name': 'Name001'}}, 'Image': {'a': 'b'} }] json_normalize(data, max_level=1)

Series.explode to split list-like values to rows#

Series and DataFrame have gained the DataFrame.explode() methods to transform list-likes to individual rows. See section on Exploding list-like column in docs for more information (GH 16538, GH 10511)

Here is a typical usecase. You have comma separated string in a column.

In [9]: df = pd.DataFrame([{'var1': 'a,b,c', 'var2': 1}, ...: {'var1': 'd,e,f', 'var2': 2}]) ...:

In [10]: df Out[10]: var1 var2 0 a,b,c 1 1 d,e,f 2

[2 rows x 2 columns]

Creating a long form DataFrame is now straightforward using chained operations

In [11]: df.assign(var1=df.var1.str.split(',')).explode('var1') Out[11]: var1 var2 0 a 1 0 b 1 0 c 1 1 d 2 1 e 2 1 f 2

[6 rows x 2 columns]

Other enhancements#

Backwards incompatible API changes#

Indexing with date strings with UTC offsets#

Indexing a DataFrame or Series with a DatetimeIndex with a date string with a UTC offset would previously ignore the UTC offset. Now, the UTC offset is respected in indexing. (GH 24076, GH 16785)

In [12]: df = pd.DataFrame([0], index=pd.DatetimeIndex(['2019-01-01'], tz='US/Pacific'))

In [13]: df Out[13]: 0 2019-01-01 00:00:00-08:00 0

[1 rows x 1 columns]

Previous behavior:

In [3]: df['2019-01-01 00:00:00+04:00':'2019-01-01 01:00:00+04:00'] Out[3]: 0 2019-01-01 00:00:00-08:00 0

New behavior:

In [14]: df['2019-01-01 12:00:00+04:00':'2019-01-01 13:00:00+04:00'] Out[14]: 0 2019-01-01 00:00:00-08:00 0

[1 rows x 1 columns]

MultiIndex constructed from levels and codes#

Constructing a MultiIndex with NaN levels or codes value < -1 was allowed previously. Now, construction with codes value < -1 is not allowed and NaN levels’ corresponding codes would be reassigned as -1. (GH 19387)

Previous behavior:

In [1]: pd.MultiIndex(levels=[[np.nan, None, pd.NaT, 128, 2]], ...: codes=[[0, -1, 1, 2, 3, 4]]) ...: Out[1]: MultiIndex(levels=[[nan, None, NaT, 128, 2]], codes=[[0, -1, 1, 2, 3, 4]])

In [2]: pd.MultiIndex(levels=[[1, 2]], codes=[[0, -2]]) Out[2]: MultiIndex(levels=[[1, 2]], codes=[[0, -2]])

New behavior:

In [15]: pd.MultiIndex(levels=[[np.nan, None, pd.NaT, 128, 2]], ....: codes=[[0, -1, 1, 2, 3, 4]]) ....: Out[15]: MultiIndex([(nan,), (nan,), (nan,), (nan,), (128,), ( 2,)], )

In [16]: pd.MultiIndex(levels=[[1, 2]], codes=[[0, -2]])

ValueError Traceback (most recent call last) Cell In[16], line 1 ----> 1 pd.MultiIndex(levels=[[1, 2]], codes=[[0, -2]])

File ~/work/pandas/pandas/pandas/core/indexes/multi.py:364, in MultiIndex.new(cls, levels, codes, sortorder, names, dtype, copy, name, verify_integrity) 361 result.sortorder = sortorder 363 if verify_integrity: --> 364 new_codes = result._verify_integrity() 365 result._codes = new_codes 367 result._reset_identity()

File ~/work/pandas/pandas/pandas/core/indexes/multi.py:451, in MultiIndex._verify_integrity(self, codes, levels, levels_to_verify) 445 raise ValueError( 446 f"On level {i}, code max ({level_codes.max()}) >= length of " 447 f"level ({len(level)}). NOTE: this index is in an " 448 "inconsistent state" 449 ) 450 if len(level_codes) and level_codes.min() < -1: --> 451 raise ValueError(f"On level {i}, code value ({level_codes.min()}) < -1") 452 if not level.is_unique: 453 raise ValueError( 454 f"Level values must be unique: {list(level)} on level {i}" 455 )

ValueError: On level 0, code value (-2) < -1

GroupBy.apply on DataFrame evaluates first group only once#

The implementation of DataFrameGroupBy.apply()previously evaluated the supplied function consistently twice on the first group to infer if it is safe to use a fast code path. Particularly for functions with side effects, this was an undesired behavior and may have led to surprises. (GH 2936, GH 2656, GH 7739, GH 10519, GH 12155, GH 20084, GH 21417)

Now every group is evaluated only a single time.

In [17]: df = pd.DataFrame({"a": ["x", "y"], "b": [1, 2]})

In [18]: df Out[18]: a b 0 x 1 1 y 2

[2 rows x 2 columns]

In [19]: def func(group): ....: print(group.name) ....: return group ....:

Previous behavior:

In [3]: df.groupby('a').apply(func) x x y Out[3]: a b 0 x 1 1 y 2

New behavior:

In [3]: df.groupby('a').apply(func) x y Out[3]: a b 0 x 1 1 y 2

Concatenating sparse values#

When passed DataFrames whose values are sparse, concat() will now return aSeries or DataFrame with sparse values, rather than a SparseDataFrame (GH 25702).

In [20]: df = pd.DataFrame({"A": pd.arrays.SparseArray([0, 1])})

Previous behavior:

In [2]: type(pd.concat([df, df])) pandas.core.sparse.frame.SparseDataFrame

New behavior:

In [21]: type(pd.concat([df, df])) Out[21]: pandas.core.frame.DataFrame

This now matches the existing behavior of concat on Series with sparse values.concat() will continue to return a SparseDataFrame when all the values are instances of SparseDataFrame.

This change also affects routines using concat() internally, like get_dummies(), which now returns a DataFrame in all cases (previously a SparseDataFrame was returned if all the columns were dummy encoded, and a DataFrame otherwise).

Providing any SparseSeries or SparseDataFrame to concat() will cause a SparseSeries or SparseDataFrame to be returned, as before.

The .str-accessor performs stricter type checks#

Due to the lack of more fine-grained dtypes, Series.str so far only checked whether the data was of object dtype. Series.str will now infer the dtype data within the Series; in particular,'bytes'-only data will raise an exception (except for Series.str.decode(), Series.str.get(),Series.str.len(), Series.str.slice()), see GH 23163, GH 23011, GH 23551.

Previous behavior:

In [1]: s = pd.Series(np.array(['a', 'ba', 'cba'], 'S'), dtype=object)

In [2]: s Out[2]: 0 b'a' 1 b'ba' 2 b'cba' dtype: object

In [3]: s.str.startswith(b'a') Out[3]: 0 True 1 False 2 False dtype: bool

New behavior:

In [22]: s = pd.Series(np.array(['a', 'ba', 'cba'], 'S'), dtype=object)

In [23]: s Out[23]: 0 b'a' 1 b'ba' 2 b'cba' Length: 3, dtype: object

In [24]: s.str.startswith(b'a')

TypeError Traceback (most recent call last) Cell In[24], line 1 ----> 1 s.str.startswith(b'a')

File ~/work/pandas/pandas/pandas/core/strings/accessor.py:136, in forbid_nonstring_types.._forbid_nonstring_types..wrapper(self, *args, **kwargs) 131 if self._inferred_dtype not in allowed_types: 132 msg = ( 133 f"Cannot use .str.{func_name} with values of " 134 f"inferred dtype '{self._inferred_dtype}'." 135 ) --> 136 raise TypeError(msg) 137 return func(self, *args, **kwargs)

TypeError: Cannot use .str.startswith with values of inferred dtype 'bytes'.

Categorical dtypes are preserved during GroupBy#

Previously, columns that were categorical, but not the groupby key(s) would be converted to object dtype during groupby operations. pandas now will preserve these dtypes. (GH 18502)

In [25]: cat = pd.Categorical(["foo", "bar", "bar", "qux"], ordered=True)

In [26]: df = pd.DataFrame({'payload': [-1, -2, -1, -2], 'col': cat})

In [27]: df Out[27]: payload col 0 -1 foo 1 -2 bar 2 -1 bar 3 -2 qux

[4 rows x 2 columns]

In [28]: df.dtypes Out[28]: payload int64 col category Length: 2, dtype: object

Previous Behavior:

In [5]: df.groupby('payload').first().col.dtype Out[5]: dtype('O')

New Behavior:

In [29]: df.groupby('payload').first().col.dtype Out[29]: CategoricalDtype(categories=['bar', 'foo', 'qux'], ordered=True, categories_dtype=object)

Incompatible Index type unions#

When performing Index.union() operations between objects of incompatible dtypes, the result will be a base Index of dtype object. This behavior holds true for unions between Index objects that previously would have been prohibited. The dtype of empty Index objects will now be evaluated before performing union operations rather than simply returning the other Index object. Index.union() can now be considered commutative, such that A.union(B) == B.union(A) (GH 23525).

Previous behavior:

In [1]: pd.period_range('19910905', periods=2).union(pd.Int64Index([1, 2, 3])) ... ValueError: can only call with other PeriodIndex-ed objects

In [2]: pd.Index([], dtype=object).union(pd.Index([1, 2, 3])) Out[2]: Int64Index([1, 2, 3], dtype='int64')

New behavior:

In [3]: pd.period_range('19910905', periods=2).union(pd.Int64Index([1, 2, 3])) Out[3]: Index([1991-09-05, 1991-09-06, 1, 2, 3], dtype='object') In [4]: pd.Index([], dtype=object).union(pd.Index([1, 2, 3])) Out[4]: Index([1, 2, 3], dtype='object')

Note that integer- and floating-dtype indexes are considered “compatible”. The integer values are coerced to floating point, which may result in loss of precision. SeeSet operations on Index objects for more.

DataFrame GroupBy ffill/bfill no longer return group labels#

The methods ffill, bfill, pad and backfill ofDataFrameGroupBypreviously included the group labels in the return value, which was inconsistent with other groupby transforms. Now only the filled values are returned. (GH 21521)

In [30]: df = pd.DataFrame({"a": ["x", "y"], "b": [1, 2]})

In [31]: df Out[31]: a b 0 x 1 1 y 2

[2 rows x 2 columns]

Previous behavior:

In [3]: df.groupby("a").ffill() Out[3]: a b 0 x 1 1 y 2

New behavior:

In [32]: df.groupby("a").ffill() Out[32]: b 0 1 1 2

[2 rows x 1 columns]

DataFrame describe on an empty Categorical / object column will return top and freq#

When calling DataFrame.describe() with an empty categorical / object column, the ‘top’ and ‘freq’ columns were previously omitted, which was inconsistent with the output for non-empty columns. Now the ‘top’ and ‘freq’ columns will always be included, with numpy.nan in the case of an empty DataFrame (GH 26397)

In [33]: df = pd.DataFrame({"empty_col": pd.Categorical([])})

In [34]: df Out[34]: Empty DataFrame Columns: [empty_col] Index: []

[0 rows x 1 columns]

Previous behavior:

In [3]: df.describe() Out[3]: empty_col count 0 unique 0

New behavior:

In [35]: df.describe() Out[35]: empty_col count 0 unique 0 top NaN freq NaN

[4 rows x 1 columns]

__str__ methods now call __repr__ rather than vice versa#

pandas has until now mostly defined string representations in a pandas objects’__str__/__unicode__/__bytes__ methods, and called __str__ from the __repr__method, if a specific __repr__ method is not found. This is not needed for Python3. In pandas 0.25, the string representations of pandas objects are now generally defined in __repr__, and calls to __str__ in general now pass the call on to the __repr__, if a specific __str__ method doesn’t exist, as is standard for Python. This change is backward compatible for direct usage of pandas, but if you subclass pandas objects and give your subclasses specific __str__/__repr__ methods, you may have to adjust your __str__/__repr__ methods (GH 26495).

Indexing an IntervalIndex with Interval objects#

Indexing methods for IntervalIndex have been modified to require exact matches only for Interval queries.IntervalIndex methods previously matched on any overlapping Interval. Behavior with scalar points, e.g. querying with an integer, is unchanged (GH 16316).

In [36]: ii = pd.IntervalIndex.from_tuples([(0, 4), (1, 5), (5, 8)])

In [37]: ii Out[37]: IntervalIndex([(0, 4], (1, 5], (5, 8]], dtype='interval[int64, right]')

The in operator (__contains__) now only returns True for exact matches to Intervals in the IntervalIndex, whereas this would previously return True for any Interval overlapping an Interval in the IntervalIndex.

Previous behavior:

In [4]: pd.Interval(1, 2, closed='neither') in ii Out[4]: True

In [5]: pd.Interval(-10, 10, closed='both') in ii Out[5]: True

New behavior:

In [38]: pd.Interval(1, 2, closed='neither') in ii Out[38]: False

In [39]: pd.Interval(-10, 10, closed='both') in ii Out[39]: False

The get_loc() method now only returns locations for exact matches to Interval queries, as opposed to the previous behavior of returning locations for overlapping matches. A KeyError will be raised if an exact match is not found.

Previous behavior:

In [6]: ii.get_loc(pd.Interval(1, 5)) Out[6]: array([0, 1])

In [7]: ii.get_loc(pd.Interval(2, 6)) Out[7]: array([0, 1, 2])

New behavior:

In [6]: ii.get_loc(pd.Interval(1, 5)) Out[6]: 1

In [7]: ii.get_loc(pd.Interval(2, 6))

KeyError: Interval(2, 6, closed='right')

Likewise, get_indexer() and get_indexer_non_unique() will also only return locations for exact matches to Interval queries, with -1 denoting that an exact match was not found.

These indexing changes extend to querying a Series or DataFrame with an IntervalIndex index.

In [40]: s = pd.Series(list('abc'), index=ii)

In [41]: s Out[41]: (0, 4] a (1, 5] b (5, 8] c Length: 3, dtype: object

Selecting from a Series or DataFrame using [] (__getitem__) or loc now only returns exact matches for Interval queries.

Previous behavior:

In [8]: s[pd.Interval(1, 5)] Out[8]: (0, 4] a (1, 5] b dtype: object

In [9]: s.loc[pd.Interval(1, 5)] Out[9]: (0, 4] a (1, 5] b dtype: object

New behavior:

In [42]: s[pd.Interval(1, 5)] Out[42]: 'b'

In [43]: s.loc[pd.Interval(1, 5)] Out[43]: 'b'

Similarly, a KeyError will be raised for non-exact matches instead of returning overlapping matches.

Previous behavior:

In [9]: s[pd.Interval(2, 3)] Out[9]: (0, 4] a (1, 5] b dtype: object

In [10]: s.loc[pd.Interval(2, 3)] Out[10]: (0, 4] a (1, 5] b dtype: object

New behavior:

In [6]: s[pd.Interval(2, 3)]

KeyError: Interval(2, 3, closed='right')

In [7]: s.loc[pd.Interval(2, 3)]

KeyError: Interval(2, 3, closed='right')

The overlaps() method can be used to create a boolean indexer that replicates the previous behavior of returning overlapping matches.

New behavior:

In [44]: idxr = s.index.overlaps(pd.Interval(2, 3))

In [45]: idxr Out[45]: array([ True, True, False])

In [46]: s[idxr] Out[46]: (0, 4] a (1, 5] b Length: 2, dtype: object

In [47]: s.loc[idxr] Out[47]: (0, 4] a (1, 5] b Length: 2, dtype: object

Binary ufuncs on Series now align#

Applying a binary ufunc like numpy.power() now aligns the inputs when both are Series (GH 23293).

In [48]: s1 = pd.Series([1, 2, 3], index=['a', 'b', 'c'])

In [49]: s2 = pd.Series([3, 4, 5], index=['d', 'c', 'b'])

In [50]: s1 Out[50]: a 1 b 2 c 3 Length: 3, dtype: int64

In [51]: s2 Out[51]: d 3 c 4 b 5 Length: 3, dtype: int64

Previous behavior

In [5]: np.power(s1, s2) Out[5]: a 1 b 16 c 243 dtype: int64

New behavior

In [52]: np.power(s1, s2) Out[52]: a 1.0 b 32.0 c 81.0 d NaN Length: 4, dtype: float64

This matches the behavior of other binary operations in pandas, like Series.add(). To retain the previous behavior, convert the other Series to an array before applying the ufunc.

In [53]: np.power(s1, s2.array) Out[53]: a 1 b 16 c 243 Length: 3, dtype: int64

Categorical.argsort now places missing values at the end#

Categorical.argsort() now places missing values at the end of the array, making it consistent with NumPy and the rest of pandas (GH 21801).

In [54]: cat = pd.Categorical(['b', None, 'a'], categories=['a', 'b'], ordered=True)

Previous behavior

In [2]: cat = pd.Categorical(['b', None, 'a'], categories=['a', 'b'], ordered=True)

In [3]: cat.argsort() Out[3]: array([1, 2, 0])

In [4]: cat[cat.argsort()] Out[4]: [NaN, a, b] categories (2, object): [a < b]

New behavior

In [55]: cat.argsort() Out[55]: array([2, 0, 1])

In [56]: cat[cat.argsort()] Out[56]: ['a', 'b', NaN] Categories (2, object): ['a' < 'b']

Column order is preserved when passing a list of dicts to DataFrame#

Starting with Python 3.7 the key-order of dict is guaranteed. In practice, this has been true since Python 3.6. The DataFrame constructor now treats a list of dicts in the same way as it does a list of OrderedDict, i.e. preserving the order of the dicts. This change applies only when pandas is running on Python>=3.6 (GH 27309).

In [57]: data = [ ....: {'name': 'Joe', 'state': 'NY', 'age': 18}, ....: {'name': 'Jane', 'state': 'KY', 'age': 19, 'hobby': 'Minecraft'}, ....: {'name': 'Jean', 'state': 'OK', 'age': 20, 'finances': 'good'} ....: ] ....:

Previous Behavior:

The columns were lexicographically sorted previously,

In [1]: pd.DataFrame(data) Out[1]: age finances hobby name state 0 18 NaN NaN Joe NY 1 19 NaN Minecraft Jane KY 2 20 good NaN Jean OK

New Behavior:

The column order now matches the insertion-order of the keys in the dict, considering all the records from top to bottom. As a consequence, the column order of the resulting DataFrame has changed compared to previous pandas versions.

In [58]: pd.DataFrame(data) Out[58]: name state age hobby finances 0 Joe NY 18 NaN NaN 1 Jane KY 19 Minecraft NaN 2 Jean OK 20 NaN good

[3 rows x 5 columns]

Increased minimum versions for dependencies#

Due to dropping support for Python 2.7, a number of optional dependencies have updated minimum versions (GH 25725, GH 24942, GH 25752). Independently, some minimum supported versions of dependencies were updated (GH 23519, GH 25554). If installed, we now require:

Package Minimum Version Required
numpy 1.13.3 X
pytz 2015.4 X
python-dateutil 2.6.1 X
bottleneck 1.2.1
numexpr 2.6.2
pytest (dev) 4.0.2

For optional libraries the general recommendation is to use the latest version. The following table lists the lowest version per library that is currently being tested throughout the development of pandas. Optional libraries below the lowest tested version may still work, but are not considered supported.

Package Minimum Version
beautifulsoup4 4.6.0
fastparquet 0.2.1
gcsfs 0.2.2
lxml 3.8.0
matplotlib 2.2.2
openpyxl 2.4.8
pyarrow 0.9.0
pymysql 0.7.1
pytables 3.4.2
scipy 0.19.0
sqlalchemy 1.1.4
xarray 0.8.2
xlrd 1.1.0
xlsxwriter 0.9.8
xlwt 1.2.0

See Dependencies and Optional dependencies for more.

Other API changes#

Deprecations#

Sparse subclasses#

The SparseSeries and SparseDataFrame subclasses are deprecated. Their functionality is better-provided by a Series or DataFrame with sparse values.

Previous way

df = pd.SparseDataFrame({"A": [0, 0, 1, 2]}) df.dtypes

New way

In [59]: df = pd.DataFrame({"A": pd.arrays.SparseArray([0, 0, 1, 2])})

In [60]: df.dtypes Out[60]: A Sparse[int64, 0] Length: 1, dtype: object

The memory usage of the two approaches is identical (GH 19239).

msgpack format#

The msgpack format is deprecated as of 0.25 and will be removed in a future version. It is recommended to use pyarrow for on-the-wire transmission of pandas objects. (GH 27084)

Other deprecations#

Removal of prior version deprecations/changes#

Performance improvements#

Bug fixes#

Categorical#

Datetimelike#

Timedelta#

Timezones#

Numeric#

Conversion#

Strings#

Interval#

Indexing#

Missing#

MultiIndex#

IO#

Plotting#

GroupBy/resample/rolling#

Reshaping#

Sparse#

Build changes#

ExtensionArray#

Other#

Contributors#

A total of 231 people contributed patches to this release. People with a “+” by their names contributed a patch for the first time.