What’s new in 2.0.0 (April 3, 2023) — pandas 2.2.3 documentation (original) (raw)

These are the changes in pandas 2.0.0. See Release notes for a full changelog including other versions of pandas.

Enhancements#

Index can now hold numpy numeric dtypes#

It is now possible to use any numpy numeric dtype in a Index (GH 42717).

Previously it was only possible to use int64, uint64 & float64 dtypes:

In [1]: pd.Index([1, 2, 3], dtype=np.int8) Out[1]: Int64Index([1, 2, 3], dtype="int64") In [2]: pd.Index([1, 2, 3], dtype=np.uint16) Out[2]: UInt64Index([1, 2, 3], dtype="uint64") In [3]: pd.Index([1, 2, 3], dtype=np.float32) Out[3]: Float64Index([1.0, 2.0, 3.0], dtype="float64")

Int64Index, UInt64Index & Float64Index were deprecated in pandas version 1.4 and have now been removed. Instead Index should be used directly, and can it now take all numpy numeric dtypes, i.e.int8/ int16/int32/int64/uint8/uint16/uint32/uint64/float32/float64 dtypes:

In [1]: pd.Index([1, 2, 3], dtype=np.int8) Out[1]: Index([1, 2, 3], dtype='int8')

In [2]: pd.Index([1, 2, 3], dtype=np.uint16) Out[2]: Index([1, 2, 3], dtype='uint16')

In [3]: pd.Index([1, 2, 3], dtype=np.float32) Out[3]: Index([1.0, 2.0, 3.0], dtype='float32')

The ability for Index to hold the numpy numeric dtypes has meant some changes in Pandas functionality. In particular, operations that previously were forced to create 64-bit indexes, can now create indexes with lower bit sizes, e.g. 32-bit indexes.

Below is a possibly non-exhaustive list of changes:

  1. Instantiating using a numpy numeric array now follows the dtype of the numpy array. Previously, all indexes created from numpy numeric arrays were forced to 64-bit. Now, for example, Index(np.array([1, 2, 3])) will be int32 on 32-bit systems, where it previously would have been int64 even on 32-bit systems. Instantiating Index using a list of numbers will still return 64bit dtypes, e.g. Index([1, 2, 3]) will have a int64 dtype, which is the same as previously.
  2. The various numeric datetime attributes of DatetimeIndex (day,month, year etc.) were previously in of dtype int64, while they were int32 for arrays.DatetimeArray. They are nowint32 on DatetimeIndex also:
    In [4]: idx = pd.date_range(start='1/1/2018', periods=3, freq='ME')
    In [5]: idx.array.year
    Out[5]: array([2018, 2018, 2018], dtype=int32)
    In [6]: idx.year
    Out[6]: Index([2018, 2018, 2018], dtype='int32')
  3. Level dtypes on Indexes from Series.sparse.from_coo() are now of dtype int32, the same as they are on the rows/cols on a scipy sparse matrix. Previously they were of dtype int64.
    In [7]: from scipy import sparse
    In [8]: A = sparse.coo_matrix(
    ...: ([3.0, 1.0, 2.0], ([1, 0, 0], [0, 2, 3])), shape=(3, 4)
    ...: )
    ...:
    In [9]: ser = pd.Series.sparse.from_coo(A)
    In [10]: ser.index.dtypes
    Out[10]:
    level_0 int32
    level_1 int32
    dtype: object
  4. Index cannot be instantiated using a float16 dtype. Previously instantiating an Index using dtype float16 resulted in a Float64Index with afloat64 dtype. It now raises a NotImplementedError:
    In [11]: pd.Index([1, 2, 3], dtype=np.float16)

NotImplementedError Traceback (most recent call last)
Cell In[11], line 1
----> 1 pd.Index([1, 2, 3], dtype=np.float16)
File ~/work/pandas/pandas/pandas/core/indexes/base.py:576, in Index.new(cls, data, dtype, copy, name, tupleize_cols)
572 arr = ensure_wrapped_if_datetimelike(arr)
574 klass = cls._dtype_to_subclass(arr.dtype)
--> 576 arr = klass._ensure_array(arr, arr.dtype, copy=False)
577 result = klass.simple_new(arr, name, refs=refs)
578 if dtype is None and is_pandas_object and data_dtype == np.object
:
File ~/work/pandas/pandas/pandas/core/indexes/base.py:601, in Index._ensure_array(cls, data, dtype, copy)
598 raise ValueError("Index data must be 1-dimensional")
599 elif dtype == np.float16:
600 # float16 not supported (no indexing engine)
--> 601 raise NotImplementedError("float16 indexes are not supported")
603 if copy:
604 # asarray_tuplesafe does not always copy underlying data,
605 # so need to make sure that this happens
606 data = data.copy()
NotImplementedError: float16 indexes are not supported

Argument dtype_backend, to return pyarrow-backed or numpy-backed nullable dtypes#

The following functions gained a new keyword dtype_backend (GH 36712)

When this option is set to "numpy_nullable" it will return a DataFrame that is backed by nullable dtypes.

When this keyword is set to "pyarrow", then these functions will return pyarrow-backed nullable ArrowDtype DataFrames (GH 48957, GH 49997):

In [12]: import io

In [13]: data = io.StringIO("""a,b,c,d,e,f,g,h,i ....: 1,2.5,True,a,,,,, ....: 3,4.5,False,b,6,7.5,True,a, ....: """) ....:

In [14]: df = pd.read_csv(data, dtype_backend="pyarrow")

In [15]: df.dtypes Out[15]: a int64[pyarrow] b double[pyarrow] c bool[pyarrow] d string[pyarrow] e int64[pyarrow] f double[pyarrow] g bool[pyarrow] h string[pyarrow] i null[pyarrow] dtype: object

In [16]: data.seek(0) Out[16]: 0

In [17]: df_pyarrow = pd.read_csv(data, dtype_backend="pyarrow", engine="pyarrow")

In [18]: df_pyarrow.dtypes Out[18]: a int64[pyarrow] b double[pyarrow] c bool[pyarrow] d string[pyarrow] e int64[pyarrow] f double[pyarrow] g bool[pyarrow] h string[pyarrow] i null[pyarrow] dtype: object

Copy-on-Write improvements#

Copy-on-Write can be enabled through one of

pd.set_option("mode.copy_on_write", True)

pd.options.mode.copy_on_write = True

Alternatively, copy on write can be enabled locally through:

with pd.option_context("mode.copy_on_write", True): ...

Other enhancements#

Notable bug fixes#

These are bug fixes that might have notable behavior changes.

DataFrameGroupBy.cumsum() and DataFrameGroupBy.cumprod() overflow instead of lossy casting to float#

In previous versions we cast to float when applying cumsum and cumprod which lead to incorrect results even if the result could be hold by int64 dtype. Additionally, the aggregation overflows consistent with numpy and the regularDataFrame.cumprod() and DataFrame.cumsum() methods when the limit ofint64 is reached (GH 37493).

Old Behavior

In [1]: df = pd.DataFrame({"key": ["b"] * 7, "value": 625}) In [2]: df.groupby("key")["value"].cumprod()[5] Out[2]: 5.960464477539062e+16

We return incorrect results with the 6th value.

New Behavior

In [19]: df = pd.DataFrame({"key": ["b"] * 7, "value": 625})

In [20]: df.groupby("key")["value"].cumprod() Out[20]: 0 625 1 390625 2 244140625 3 152587890625 4 95367431640625 5 59604644775390625 6 359414837200037393 Name: value, dtype: int64

We overflow with the 7th value, but the 6th value is still correct.

DataFrameGroupBy.nth() and SeriesGroupBy.nth() now behave as filtrations#

In previous versions of pandas, DataFrameGroupBy.nth() andSeriesGroupBy.nth() acted as if they were aggregations. However, for most inputs n, they may return either zero or multiple rows per group. This means that they are filtrations, similar to e.g. DataFrameGroupBy.head(). pandas now treats them as filtrations (GH 13666).

In [21]: df = pd.DataFrame({"a": [1, 1, 2, 1, 2], "b": [np.nan, 2.0, 3.0, 4.0, 5.0]})

In [22]: gb = df.groupby("a")

Old Behavior

In [5]: gb.nth(n=1) Out[5]: A B 1 1 2.0 4 2 5.0

New Behavior

In [23]: gb.nth(n=1) Out[23]: a b 1 1 2.0 4 2 5.0

In particular, the index of the result is derived from the input by selecting the appropriate rows. Also, when n is larger than the group, no rows instead ofNaN is returned.

Old Behavior

In [5]: gb.nth(n=3, dropna="any") Out[5]: B A 1 NaN 2 NaN

New Behavior

In [24]: gb.nth(n=3, dropna="any") Out[24]: Empty DataFrame Columns: [a, b] Index: []

Backwards incompatible API changes#

Construction with datetime64 or timedelta64 dtype with unsupported resolution#

In past versions, when constructing a Series or DataFrame and passing a “datetime64” or “timedelta64” dtype with unsupported resolution (i.e. anything other than “ns”), pandas would silently replace the given dtype with its nanosecond analogue:

Previous behavior:

In [5]: pd.Series(["2016-01-01"], dtype="datetime64[s]") Out[5]: 0 2016-01-01 dtype: datetime64[ns]

In [6] pd.Series(["2016-01-01"], dtype="datetime64[D]") Out[6]: 0 2016-01-01 dtype: datetime64[ns]

In pandas 2.0 we support resolutions “s”, “ms”, “us”, and “ns”. When passing a supported dtype (e.g. “datetime64[s]”), the result now has exactly the requested dtype:

New behavior:

In [25]: pd.Series(["2016-01-01"], dtype="datetime64[s]") Out[25]: 0 2016-01-01 dtype: datetime64[s]

With an un-supported dtype, pandas now raises instead of silently swapping in a supported dtype:

New behavior:

In [26]: pd.Series(["2016-01-01"], dtype="datetime64[D]")

TypeError Traceback (most recent call last) Cell In[26], line 1 ----> 1 pd.Series(["2016-01-01"], dtype="datetime64[D]")

File ~/work/pandas/pandas/pandas/core/series.py:584, in Series.init(self, data, index, dtype, name, copy, fastpath) 582 data = data.copy() 583 else: --> 584 data = sanitize_array(data, index, dtype, copy) 586 manager = _get_option("mode.data_manager", silent=True) 587 if manager == "block":

File ~/work/pandas/pandas/pandas/core/construction.py:651, in sanitize_array(data, index, dtype, copy, allow_2d) 648 subarr = np.array([], dtype=np.float64) 650 elif dtype is not None: --> 651 subarr = _try_cast(data, dtype, copy) 653 else: 654 subarr = maybe_convert_platform(data)

File ~/work/pandas/pandas/pandas/core/construction.py:811, in _try_cast(arr, dtype, copy) 806 return lib.ensure_string_array(arr, convert_na_value=False, copy=copy).reshape( 807 shape 808 ) 810 elif dtype.kind in "mM": --> 811 return maybe_cast_to_datetime(arr, dtype) 813 # GH#15832: Check if we are requesting a numeric dtype and 814 # that we can convert the data to the requested dtype. 815 elif dtype.kind in "iu": 816 # this will raise if we have e.g. floats

File ~/work/pandas/pandas/pandas/core/dtypes/cast.py:1218, in maybe_cast_to_datetime(value, dtype) 1214 raise TypeError("value must be listlike") 1216 # TODO: _from_sequence would raise ValueError in cases where 1217 # _ensure_nanosecond_dtype raises TypeError -> 1218 _ensure_nanosecond_dtype(dtype) 1220 if lib.is_np_dtype(dtype, "m"): 1221 res = TimedeltaArray._from_sequence(value, dtype=dtype)

File ~/work/pandas/pandas/pandas/core/dtypes/cast.py:1275, in _ensure_nanosecond_dtype(dtype) 1272 raise ValueError(msg) 1273 # TODO: ValueError or TypeError? existing test 1274 # test_constructor_generic_timestamp_bad_frequency expects TypeError -> 1275 raise TypeError( 1276 f"dtype={dtype} is not supported. Supported resolutions are 's', " 1277 "'ms', 'us', and 'ns'" 1278 )

TypeError: dtype=datetime64[D] is not supported. Supported resolutions are 's', 'ms', 'us', and 'ns'

Value counts sets the resulting name to count#

In past versions, when running Series.value_counts(), the result would inherit the original object’s name, and the result index would be nameless. This would cause confusion when resetting the index, and the column names would not correspond with the column values. Now, the result name will be 'count' (or 'proportion' if normalize=True was passed), and the index will be named after the original object (GH 49497).

Previous behavior:

In [8]: pd.Series(['quetzal', 'quetzal', 'elk'], name='animal').value_counts()

Out[2]: quetzal 2 elk 1 Name: animal, dtype: int64

New behavior:

In [27]: pd.Series(['quetzal', 'quetzal', 'elk'], name='animal').value_counts() Out[27]: animal quetzal 2 elk 1 Name: count, dtype: int64

Likewise for other value_counts methods (for example, DataFrame.value_counts()).

Disallow astype conversion to non-supported datetime64/timedelta64 dtypes#

In previous versions, converting a Series or DataFramefrom datetime64[ns] to a different datetime64[X] dtype would return with datetime64[ns] dtype instead of the requested dtype. In pandas 2.0, support is added for “datetime64[s]”, “datetime64[ms]”, and “datetime64[us]” dtypes, so converting to those dtypes gives exactly the requested dtype:

Previous behavior:

In [28]: idx = pd.date_range("2016-01-01", periods=3)

In [29]: ser = pd.Series(idx)

Previous behavior:

In [4]: ser.astype("datetime64[s]") Out[4]: 0 2016-01-01 1 2016-01-02 2 2016-01-03 dtype: datetime64[ns]

With the new behavior, we get exactly the requested dtype:

New behavior:

In [30]: ser.astype("datetime64[s]") Out[30]: 0 2016-01-01 1 2016-01-02 2 2016-01-03 dtype: datetime64[s]

For non-supported resolutions e.g. “datetime64[D]”, we raise instead of silently ignoring the requested dtype:

New behavior:

In [31]: ser.astype("datetime64[D]")

TypeError Traceback (most recent call last) Cell In[31], line 1 ----> 1 ser.astype("datetime64[D]")

File ~/work/pandas/pandas/pandas/core/generic.py:6643, in NDFrame.astype(self, dtype, copy, errors) 6637 results = [ 6638 ser.astype(dtype, copy=copy, errors=errors) for _, ser in self.items() 6639 ] 6641 else: 6642 # else, only a single dtype is given -> 6643 new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors) 6644 res = self._constructor_from_mgr(new_data, axes=new_data.axes) 6645 return res.finalize(self, method="astype")

File ~/work/pandas/pandas/pandas/core/internals/managers.py:430, in BaseBlockManager.astype(self, dtype, copy, errors) 427 elif using_copy_on_write(): 428 copy = False --> 430 return self.apply( 431 "astype", 432 dtype=dtype, 433 copy=copy, 434 errors=errors, 435 using_cow=using_copy_on_write(), 436 )

File ~/work/pandas/pandas/pandas/core/internals/managers.py:363, in BaseBlockManager.apply(self, f, align_keys, **kwargs) 361 applied = b.apply(f, **kwargs) 362 else: --> 363 applied = getattr(b, f)(**kwargs) 364 result_blocks = extend_blocks(applied, result_blocks) 366 out = type(self).from_blocks(result_blocks, self.axes)

File ~/work/pandas/pandas/pandas/core/internals/blocks.py:758, in Block.astype(self, dtype, copy, errors, using_cow, squeeze) 755 raise ValueError("Can not squeeze with more than one column.") 756 values = values[0, :] # type: ignore[call-overload] --> 758 new_values = astype_array_safe(values, dtype, copy=copy, errors=errors) 760 new_values = maybe_coerce_values(new_values) 762 refs = None

File ~/work/pandas/pandas/pandas/core/dtypes/astype.py:237, in astype_array_safe(values, dtype, copy, errors) 234 dtype = dtype.numpy_dtype 236 try: --> 237 new_values = astype_array(values, dtype, copy=copy) 238 except (ValueError, TypeError): 239 # e.g. _astype_nansafe can fail on object-dtype of strings 240 # trying to convert to float 241 if errors == "ignore":

File ~/work/pandas/pandas/pandas/core/dtypes/astype.py:179, in astype_array(values, dtype, copy) 175 return values 177 if not isinstance(values, np.ndarray): 178 # i.e. ExtensionArray --> 179 values = values.astype(dtype, copy=copy) 181 else: 182 values = _astype_nansafe(values, dtype, copy=copy)

File ~/work/pandas/pandas/pandas/core/arrays/datetimes.py:739, in DatetimeArray.astype(self, dtype, copy) 737 elif isinstance(dtype, PeriodDtype): 738 return self.to_period(freq=dtype.freq) --> 739 return dtl.DatetimeLikeArrayMixin.astype(self, dtype, copy)

File ~/work/pandas/pandas/pandas/core/arrays/datetimelike.py:494, in DatetimeLikeArrayMixin.astype(self, dtype, copy) 490 elif (dtype.kind in "mM" and self.dtype != dtype) or dtype.kind == "f": 491 # disallow conversion between datetime/timedelta, 492 # and conversions for any datetimelike to float 493 msg = f"Cannot cast {type(self).name} to dtype {dtype}" --> 494 raise TypeError(msg) 495 else: 496 return np.asarray(self, dtype=dtype)

TypeError: Cannot cast DatetimeArray to dtype datetime64[D]

For conversion from timedelta64[ns] dtypes, the old behavior converted to a floating point format.

Previous behavior:

In [32]: idx = pd.timedelta_range("1 Day", periods=3)

In [33]: ser = pd.Series(idx)

Previous behavior:

In [7]: ser.astype("timedelta64[s]") Out[7]: 0 86400.0 1 172800.0 2 259200.0 dtype: float64

In [8]: ser.astype("timedelta64[D]") Out[8]: 0 1.0 1 2.0 2 3.0 dtype: float64

The new behavior, as for datetime64, either gives exactly the requested dtype or raises:

New behavior:

In [34]: ser.astype("timedelta64[s]") Out[34]: 0 1 days 1 2 days 2 3 days dtype: timedelta64[s]

In [35]: ser.astype("timedelta64[D]")

ValueError Traceback (most recent call last) Cell In[35], line 1 ----> 1 ser.astype("timedelta64[D]")

File ~/work/pandas/pandas/pandas/core/generic.py:6643, in NDFrame.astype(self, dtype, copy, errors) 6637 results = [ 6638 ser.astype(dtype, copy=copy, errors=errors) for _, ser in self.items() 6639 ] 6641 else: 6642 # else, only a single dtype is given -> 6643 new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors) 6644 res = self._constructor_from_mgr(new_data, axes=new_data.axes) 6645 return res.finalize(self, method="astype")

File ~/work/pandas/pandas/pandas/core/internals/managers.py:430, in BaseBlockManager.astype(self, dtype, copy, errors) 427 elif using_copy_on_write(): 428 copy = False --> 430 return self.apply( 431 "astype", 432 dtype=dtype, 433 copy=copy, 434 errors=errors, 435 using_cow=using_copy_on_write(), 436 )

File ~/work/pandas/pandas/pandas/core/internals/managers.py:363, in BaseBlockManager.apply(self, f, align_keys, **kwargs) 361 applied = b.apply(f, **kwargs) 362 else: --> 363 applied = getattr(b, f)(**kwargs) 364 result_blocks = extend_blocks(applied, result_blocks) 366 out = type(self).from_blocks(result_blocks, self.axes)

File ~/work/pandas/pandas/pandas/core/internals/blocks.py:758, in Block.astype(self, dtype, copy, errors, using_cow, squeeze) 755 raise ValueError("Can not squeeze with more than one column.") 756 values = values[0, :] # type: ignore[call-overload] --> 758 new_values = astype_array_safe(values, dtype, copy=copy, errors=errors) 760 new_values = maybe_coerce_values(new_values) 762 refs = None

File ~/work/pandas/pandas/pandas/core/dtypes/astype.py:237, in astype_array_safe(values, dtype, copy, errors) 234 dtype = dtype.numpy_dtype 236 try: --> 237 new_values = astype_array(values, dtype, copy=copy) 238 except (ValueError, TypeError): 239 # e.g. _astype_nansafe can fail on object-dtype of strings 240 # trying to convert to float 241 if errors == "ignore":

File ~/work/pandas/pandas/pandas/core/dtypes/astype.py:179, in astype_array(values, dtype, copy) 175 return values 177 if not isinstance(values, np.ndarray): 178 # i.e. ExtensionArray --> 179 values = values.astype(dtype, copy=copy) 181 else: 182 values = _astype_nansafe(values, dtype, copy=copy)

File ~/work/pandas/pandas/pandas/core/arrays/timedeltas.py:358, in TimedeltaArray.astype(self, dtype, copy) 354 return type(self)._simple_new( 355 res_values, dtype=res_values.dtype, freq=self.freq 356 ) 357 else: --> 358 raise ValueError( 359 f"Cannot convert from {self.dtype} to {dtype}. " 360 "Supported resolutions are 's', 'ms', 'us', 'ns'" 361 ) 363 return dtl.DatetimeLikeArrayMixin.astype(self, dtype, copy=copy)

ValueError: Cannot convert from timedelta64[ns] to timedelta64[D]. Supported resolutions are 's', 'ms', 'us', 'ns'

UTC and fixed-offset timezones default to standard-library tzinfo objects#

In previous versions, the default tzinfo object used to represent UTC was pytz.UTC. In pandas 2.0, we default to datetime.timezone.utc instead. Similarly, for timezones represent fixed UTC offsets, we use datetime.timezoneobjects instead of pytz.FixedOffset objects. See (GH 34916)

Previous behavior:

In [2]: ts = pd.Timestamp("2016-01-01", tz="UTC") In [3]: type(ts.tzinfo) Out[3]: pytz.UTC

In [4]: ts2 = pd.Timestamp("2016-01-01 04:05:06-07:00") In [3]: type(ts2.tzinfo) Out[5]: pytz._FixedOffset

New behavior:

In [36]: ts = pd.Timestamp("2016-01-01", tz="UTC")

In [37]: type(ts.tzinfo) Out[37]: datetime.timezone

In [38]: ts2 = pd.Timestamp("2016-01-01 04:05:06-07:00")

In [39]: type(ts2.tzinfo) Out[39]: datetime.timezone

For timezones that are neither UTC nor fixed offsets, e.g. “US/Pacific”, we continue to default to pytz objects.

Empty DataFrames/Series will now default to have a RangeIndex#

Before, constructing an empty (where data is None or an empty list-like argument) Series or DataFrame without specifying the axes (index=None, columns=None) would return the axes as empty Index with object dtype.

Now, the axes return an empty RangeIndex (GH 49572).

Previous behavior:

In [8]: pd.Series().index Out[8]: Index([], dtype='object')

In [9] pd.DataFrame().axes Out[9]: [Index([], dtype='object'), Index([], dtype='object')]

New behavior:

In [40]: pd.Series().index Out[40]: RangeIndex(start=0, stop=0, step=1)

In [41]: pd.DataFrame().axes Out[41]: [RangeIndex(start=0, stop=0, step=1), RangeIndex(start=0, stop=0, step=1)]

DataFrame to LaTeX has a new render engine#

The existing DataFrame.to_latex() has been restructured to utilise the extended implementation previously available under Styler.to_latex(). The arguments signature is similar, albeit col_space has been removed since it is ignored by LaTeX engines. This render engine also requires jinja2 as a dependency which needs to be installed, since rendering is based upon jinja2 templates.

The pandas latex options below are no longer used and have been removed. The generic max rows and columns arguments remain but for this functionality should be replaced by the Styler equivalents. The alternative options giving similar functionality are indicated below:

Note that due to this change some defaults have also changed:

Note that the behaviour of _repr_latex_ is also changed. Previously setting display.latex.repr would generate LaTeX only when using nbconvert for a JupyterNotebook, and not when the user is running the notebook. Now thestyler.render.repr option allows control of the specific output within JupyterNotebooks for operations (not just on nbconvert). See GH 39911.

Increased minimum versions for dependencies#

Some minimum supported versions of dependencies were updated. If installed, we now require:

Package Minimum Version Required Changed
mypy (dev) 1.0 X
pytest (dev) 7.0.0 X
pytest-xdist (dev) 2.2.0 X
hypothesis (dev) 6.34.2 X
python-dateutil 2.8.2 X X
tzdata 2022.1 X X

For optional libraries the general recommendation is to use the latest version. The following table lists the lowest version per library that is currently being tested throughout the development of pandas. Optional libraries below the lowest tested version may still work, but are not considered supported.

Package Minimum Version Changed
pyarrow 7.0.0 X
matplotlib 3.6.1 X
fastparquet 0.6.3 X
xarray 0.21.0 X

See Dependencies and Optional dependencies for more.

Datetimes are now parsed with a consistent format#

In the past, to_datetime() guessed the format for each element independently. This was appropriate for some cases where elements had mixed date formats - however, it would regularly cause problems when users expected a consistent format but the function would switch formats between elements. As of version 2.0.0, parsing will use a consistent format, determined by the first non-NA value (unless the user specifies a format, in which case that is used).

Old behavior:

In [1]: ser = pd.Series(['13-01-2000', '12-01-2000']) In [2]: pd.to_datetime(ser) Out[2]: 0 2000-01-13 1 2000-12-01 dtype: datetime64[ns]

New behavior:

In [42]: ser = pd.Series(['13-01-2000', '12-01-2000'])

In [43]: pd.to_datetime(ser) Out[43]: 0 2000-01-13 1 2000-01-12 dtype: datetime64[ns]

Note that this affects read_csv() as well.

If you still need to parse dates with inconsistent formats, you can useformat='mixed' (possibly alongside dayfirst)

ser = pd.Series(['13-01-2000', '12 January 2000']) pd.to_datetime(ser, format='mixed', dayfirst=True)

or, if your formats are all ISO8601 (but possibly not identically-formatted)

ser = pd.Series(['2020-01-01', '2020-01-01 03:00']) pd.to_datetime(ser, format='ISO8601')

Other API changes#

Note

A current PDEP proposes the deprecation and removal of the keywords inplace and copyfor all but a small subset of methods from the pandas API. The current discussion takes place at here. The keywords won’t be necessary anymore in the context of Copy-on-Write. If this proposal is accepted, both keywords would be deprecated in the next release of pandas and removed in pandas 3.0.

Deprecations#

Removal of prior version deprecations/changes#

Performance improvements#

Bug fixes#

Categorical#

Datetimelike#

Timedelta#

Timezones#

Numeric#

Conversion#

Strings#

Interval#

Indexing#

Missing#

MultiIndex#

I/O#

Period#

Plotting#

Groupby/resample/rolling#

Reshaping#

Sparse#

ExtensionArray#

Styler#

Metadata#

Other#

Contributors#

A total of 260 people contributed patches to this release. People with a “+” by their names contributed a patch for the first time.