What’s new in 1.0.0 (January 29, 2020) — pandas 2.2.3 documentation (original) (raw)

These are the changes in pandas 1.0.0. See Release notes for a full changelog including other versions of pandas.

Note

The pandas 1.0 release removed a lot of functionality that was deprecated in previous releases (see belowfor an overview). It is recommended to first upgrade to pandas 0.25 and to ensure your code is working without warnings, before upgrading to pandas 1.0.

New deprecation policy#

Starting with pandas 1.0.0, pandas will adopt a variant of SemVer to version releases. Briefly,

See Version policy for more.

Enhancements#

Using Numba in rolling.apply and expanding.apply#

We’ve added an engine keyword to apply() and apply()that allows the user to execute the routine using Numba instead of Cython. Using the Numba engine can yield significant performance gains if the apply function can operate on numpy arrays and the data set is larger (1 million rows or greater). For more details, seerolling apply documentation (GH 28987, GH 30936)

Defining custom windows for rolling operations#

We’ve added a pandas.api.indexers.BaseIndexer() class that allows users to define how window bounds are created during rolling operations. Users can define their own get_window_boundsmethod on a pandas.api.indexers.BaseIndexer() subclass that will generate the start and end indices used for each window during the rolling aggregation. For more details and example usage, see the custom window rolling documentation

Converting to markdown#

We’ve added to_markdown() for creating a markdown table (GH 11052)

In [1]: df = pd.DataFrame({"A": [1, 2, 3], "B": [1, 2, 3]}, index=['a', 'a', 'b'])

In [2]: print(df.to_markdown())

A B
a 1 1
a 2 2
b 3 3

Experimental new features#

Experimental NA scalar to denote missing values#

A new pd.NA value (singleton) is introduced to represent scalar missing values. Up to now, pandas used several values to represent missing data: np.nan is used for this for float data, np.nan orNone for object-dtype data and pd.NaT for datetime-like data. The goal of pd.NA is to provide a “missing” indicator that can be used consistently across data types. pd.NA is currently used by the nullable integer and boolean data types and the new string data type (GH 28095).

Warning

Experimental: the behaviour of pd.NA can still change without warning.

For example, creating a Series using the nullable integer dtype:

In [3]: s = pd.Series([1, 2, None], dtype="Int64")

In [4]: s Out[4]: 0 1 1 2 2 Length: 3, dtype: Int64

In [5]: s[2] Out[5]:

Compared to np.nan, pd.NA behaves differently in certain operations. In addition to arithmetic operations, pd.NA also propagates as “missing” or “unknown” in comparison operations:

In [6]: np.nan > 1 Out[6]: False

In [7]: pd.NA > 1 Out[7]:

For logical operations, pd.NA follows the rules of thethree-valued logic (or_Kleene logic_). For example:

In [8]: pd.NA | True Out[8]: True

For more, see NA section in the user guide on missing data.

Dedicated string data type#

We’ve added StringDtype, an extension type dedicated to string data. Previously, strings were typically stored in object-dtype NumPy arrays. (GH 29975)

Warning

StringDtype is currently considered experimental. The implementation and parts of the API may change without warning.

The 'string' extension type solves several issues with object-dtype NumPy arrays:

  1. You can accidentally store a mixture of strings and non-strings in anobject dtype array. A StringArray can only store strings.
  2. object dtype breaks dtype-specific operations like DataFrame.select_dtypes(). There isn’t a clear way to select just text while excluding non-text, but still object-dtype columns.
  3. When reading code, the contents of an object dtype array is less clear than string.

In [9]: pd.Series(['abc', None, 'def'], dtype=pd.StringDtype()) Out[9]: 0 abc 1 2 def Length: 3, dtype: string

You can use the alias "string" as well.

In [10]: s = pd.Series(['abc', None, 'def'], dtype="string")

In [11]: s Out[11]: 0 abc 1 2 def Length: 3, dtype: string

The usual string accessor methods work. Where appropriate, the return type of the Series or columns of a DataFrame will also have string dtype.

In [12]: s.str.upper() Out[12]: 0 ABC 1 2 DEF Length: 3, dtype: string

In [13]: s.str.split('b', expand=True).dtypes Out[13]: 0 string[python] 1 string[python] Length: 2, dtype: object

String accessor methods returning integers will return a value with Int64Dtype

In [14]: s.str.count("a") Out[14]: 0 1 1 2 0 Length: 3, dtype: Int64

We recommend explicitly using the string data type when working with strings. See Text data types for more.

Boolean data type with missing values support#

We’ve added BooleanDtype / BooleanArray, an extension type dedicated to boolean data that can hold missing values. The defaultbool data type based on a bool-dtype NumPy array, the column can only holdTrue or False, and not missing values. This new BooleanArraycan store missing values as well by keeping track of this in a separate mask. (GH 29555, GH 30095, GH 31131)

In [15]: pd.Series([True, False, None], dtype=pd.BooleanDtype()) Out[15]: 0 True 1 False 2 Length: 3, dtype: boolean

You can use the alias "boolean" as well.

In [16]: s = pd.Series([True, False, None], dtype="boolean")

In [17]: s Out[17]: 0 True 1 False 2 Length: 3, dtype: boolean

Method convert_dtypes to ease use of supported extension dtypes#

In order to encourage use of the extension dtypes StringDtype,BooleanDtype, Int64Dtype, Int32Dtype, etc., that support pd.NA, the methods DataFrame.convert_dtypes() and Series.convert_dtypes()have been introduced. (GH 29752) (GH 30929)

Example:

In [18]: df = pd.DataFrame({'x': ['abc', None, 'def'], ....: 'y': [1, 2, np.nan], ....: 'z': [True, False, True]}) ....:

In [19]: df Out[19]: x y z 0 abc 1.0 True 1 None 2.0 False 2 def NaN True

[3 rows x 3 columns]

In [20]: df.dtypes Out[20]: x object y float64 z bool Length: 3, dtype: object

In [21]: converted = df.convert_dtypes()

In [22]: converted Out[22]: x y z 0 abc 1 True 1 2 False 2 def True

[3 rows x 3 columns]

In [23]: converted.dtypes Out[23]: x string[python] y Int64 z boolean Length: 3, dtype: object

This is especially useful after reading in data using readers such as read_csv()and read_excel(). See here for a description.

Other enhancements#

Backwards incompatible API changes#

Avoid using names from MultiIndex.levels#

As part of a larger refactor to MultiIndex the level names are now stored separately from the levels (GH 27242). We recommend usingMultiIndex.names to access the names, and Index.set_names()to update the names.

For backwards compatibility, you can still access the names via the levels.

In [24]: mi = pd.MultiIndex.from_product([[1, 2], ['a', 'b']], names=['x', 'y'])

In [25]: mi.levels[0].name Out[25]: 'x'

However, it is no longer possible to update the names of the MultiIndexvia the level.

In [26]: mi.levels[0].name = "new name"

RuntimeError Traceback (most recent call last) Cell In[26], line 1 ----> 1 mi.levels[0].name = "new name"

File ~/work/pandas/pandas/pandas/core/indexes/base.py:1690, in Index.name(self, value) 1686 @name.setter 1687 def name(self, value: Hashable) -> None: 1688 if self._no_setting_name: 1689 # Used in MultiIndex.levels to avoid silently ignoring name updates. -> 1690 raise RuntimeError( 1691 "Cannot set name on a level of a MultiIndex. Use " 1692 "'MultiIndex.set_names' instead." 1693 ) 1694 maybe_extract_name(value, None, type(self)) 1695 self._name = value

RuntimeError: Cannot set name on a level of a MultiIndex. Use 'MultiIndex.set_names' instead.

In [27]: mi.names Out[27]: FrozenList(['x', 'y'])

To update, use MultiIndex.set_names, which returns a new MultiIndex.

In [28]: mi2 = mi.set_names("new name", level=0)

In [29]: mi2.names Out[29]: FrozenList(['new name', 'y'])

New repr for IntervalArray#

pandas.arrays.IntervalArray adopts a new __repr__ in accordance with other array classes (GH 25022)

pandas 0.25.x

In [1]: pd.arrays.IntervalArray.from_tuples([(0, 1), (2, 3)]) Out[2]: IntervalArray([(0, 1], (2, 3]], closed='right', dtype='interval[int64]')

pandas 1.0.0

In [30]: pd.arrays.IntervalArray.from_tuples([(0, 1), (2, 3)]) Out[30]: [(0, 1], (2, 3]] Length: 2, dtype: interval[int64, right]

DataFrame.rename now only accepts one positional argument#

DataFrame.rename() would previously accept positional arguments that would lead to ambiguous or undefined behavior. From pandas 1.0, only the very first argument, which maps labels to their new names along the default axis, is allowed to be passed by position (GH 29136).

pandas 0.25.x

In [1]: df = pd.DataFrame([[1]]) In [2]: df.rename({0: 1}, {0: 2}) Out[2]: FutureWarning: ...Use named arguments to resolve ambiguity... 2 1 1

pandas 1.0.0

In [3]: df.rename({0: 1}, {0: 2}) Traceback (most recent call last): ... TypeError: rename() takes from 1 to 2 positional arguments but 3 were given

Note that errors will now be raised when conflicting or potentially ambiguous arguments are provided.

pandas 0.25.x

In [4]: df.rename({0: 1}, index={0: 2}) Out[4]: 0 1 1

In [5]: df.rename(mapper={0: 1}, index={0: 2}) Out[5]: 0 2 1

pandas 1.0.0

In [6]: df.rename({0: 1}, index={0: 2}) Traceback (most recent call last): ... TypeError: Cannot specify both 'mapper' and any of 'index' or 'columns'

In [7]: df.rename(mapper={0: 1}, index={0: 2}) Traceback (most recent call last): ... TypeError: Cannot specify both 'mapper' and any of 'index' or 'columns'

You can still change the axis along which the first positional argument is applied by supplying the axis keyword argument.

In [31]: df.rename({0: 1}) Out[31]: 0 1 1

[1 rows x 1 columns]

In [32]: df.rename({0: 1}, axis=1) Out[32]: 1 0 1

[1 rows x 1 columns]

If you would like to update both the index and column labels, be sure to use the respective keywords.

In [33]: df.rename(index={0: 1}, columns={0: 2}) Out[33]: 2 1 1

[1 rows x 1 columns]

Extended verbose info output for DataFrame#

DataFrame.info() now shows line numbers for the columns summary (GH 17304)

pandas 0.25.x

In [1]: df = pd.DataFrame({"int_col": [1, 2, 3], ... "text_col": ["a", "b", "c"], ... "float_col": [0.0, 0.1, 0.2]}) In [2]: df.info(verbose=True) <class 'pandas.core.frame.DataFrame'> RangeIndex: 3 entries, 0 to 2 Data columns (total 3 columns): int_col 3 non-null int64 text_col 3 non-null object float_col 3 non-null float64 dtypes: float64(1), int64(1), object(1) memory usage: 152.0+ bytes

pandas 1.0.0

In [34]: df = pd.DataFrame({"int_col": [1, 2, 3], ....: "text_col": ["a", "b", "c"], ....: "float_col": [0.0, 0.1, 0.2]}) ....:

In [35]: df.info(verbose=True) <class 'pandas.core.frame.DataFrame'> RangeIndex: 3 entries, 0 to 2 Data columns (total 3 columns):

Column Non-Null Count Dtype


0 int_col 3 non-null int64
1 text_col 3 non-null object 2 float_col 3 non-null float64 dtypes: float64(1), int64(1), object(1) memory usage: 200.0+ bytes

pandas.array() inference changes#

pandas.array() now infers pandas’ new extension types in several cases (GH 29791):

  1. String data (including missing values) now returns a arrays.StringArray.
  2. Integer data (including missing values) now returns a arrays.IntegerArray.
  3. Boolean data (including missing values) now returns the new arrays.BooleanArray

pandas 0.25.x

In [1]: pd.array(["a", None]) Out[1]: ['a', None] Length: 2, dtype: object

In [2]: pd.array([1, None]) Out[2]: [1, None] Length: 2, dtype: object

pandas 1.0.0

In [36]: pd.array(["a", None]) Out[36]: ['a', ] Length: 2, dtype: string

In [37]: pd.array([1, None]) Out[37]: [1, ] Length: 2, dtype: Int64

As a reminder, you can specify the dtype to disable all inference.

arrays.IntegerArray now uses pandas.NA#

arrays.IntegerArray now uses pandas.NA rather thannumpy.nan as its missing value marker (GH 29964).

pandas 0.25.x

In [1]: a = pd.array([1, 2, None], dtype="Int64") In [2]: a Out[2]: [1, 2, NaN] Length: 3, dtype: Int64

In [3]: a[2] Out[3]: nan

pandas 1.0.0

In [38]: a = pd.array([1, 2, None], dtype="Int64")

In [39]: a Out[39]: [1, 2, ] Length: 3, dtype: Int64

In [40]: a[2] Out[40]:

This has a few API-breaking consequences.

Converting to a NumPy ndarray

When converting to a NumPy array missing values will be pd.NA, which cannot be converted to a float. So calling np.asarray(integer_array, dtype="float")will now raise.

pandas 0.25.x

In [1]: np.asarray(a, dtype="float") Out[1]: array([ 1., 2., nan])

pandas 1.0.0

In [41]: np.asarray(a, dtype="float") Out[41]: array([ 1., 2., nan])

Use arrays.IntegerArray.to_numpy() with an explicit na_value instead.

In [42]: a.to_numpy(dtype="float", na_value=np.nan) Out[42]: array([ 1., 2., nan])

Reductions can return pd.NA

When performing a reduction such as a sum with skipna=False, the result will now be pd.NA instead of np.nan in presence of missing values (GH 30958).

pandas 0.25.x

In [1]: pd.Series(a).sum(skipna=False) Out[1]: nan

pandas 1.0.0

In [43]: pd.Series(a).sum(skipna=False) Out[43]:

value_counts returns a nullable integer dtype

Series.value_counts() with a nullable integer dtype now returns a nullable integer dtype for the values.

pandas 0.25.x

In [1]: pd.Series([2, 1, 1, None], dtype="Int64").value_counts().dtype Out[1]: dtype('int64')

pandas 1.0.0

In [44]: pd.Series([2, 1, 1, None], dtype="Int64").value_counts().dtype Out[44]: Int64Dtype()

See NA semantics for more on the differences between pandas.NAand numpy.nan.

arrays.IntegerArray comparisons return arrays.BooleanArray#

Comparison operations on a arrays.IntegerArray now returns aarrays.BooleanArray rather than a NumPy array (GH 29964).

pandas 0.25.x

In [1]: a = pd.array([1, 2, None], dtype="Int64") In [2]: a Out[2]: [1, 2, NaN] Length: 3, dtype: Int64

In [3]: a > 1 Out[3]: array([False, True, False])

pandas 1.0.0

In [45]: a = pd.array([1, 2, None], dtype="Int64")

In [46]: a > 1 Out[46]: [False, True, ] Length: 3, dtype: boolean

Note that missing values now propagate, rather than always comparing unequal like numpy.nan. See NA semantics for more.

By default Categorical.min() now returns the minimum instead of np.nan#

When Categorical contains np.nan,Categorical.min() no longer return np.nan by default (skipna=True) (GH 25303)

pandas 0.25.x

In [1]: pd.Categorical([1, 2, np.nan], ordered=True).min() Out[1]: nan

pandas 1.0.0

In [47]: pd.Categorical([1, 2, np.nan], ordered=True).min() Out[47]: 1

Default dtype of empty pandas.Series#

Initialising an empty pandas.Series without specifying a dtype will raise a DeprecationWarning now (GH 17261). The default dtype will change from float64 to object in future releases so that it is consistent with the behaviour of DataFrame and Index.

pandas 1.0.0

In [1]: pd.Series() Out[2]: DeprecationWarning: The default dtype for empty Series will be 'object' instead of 'float64' in a future version. Specify a dtype explicitly to silence this warning. Series([], dtype: float64)

Result dtype inference changes for resample operations#

The rules for the result dtype in DataFrame.resample() aggregations have changed for extension types (GH 31359). Previously, pandas would attempt to convert the result back to the original dtype, falling back to the usual inference rules if that was not possible. Now, pandas will only return a result of the original dtype if the scalar values in the result are instances of the extension dtype’s scalar type.

In [48]: df = pd.DataFrame({"A": ['a', 'b']}, dtype='category', ....: index=pd.date_range('2000', periods=2)) ....:

In [49]: df Out[49]: A 2000-01-01 a 2000-01-02 b

[2 rows x 1 columns]

pandas 0.25.x

In [1]> df.resample("2D").agg(lambda x: 'a').A.dtype Out[1]: CategoricalDtype(categories=['a', 'b'], ordered=False)

pandas 1.0.0

In [50]: df.resample("2D").agg(lambda x: 'a').A.dtype Out[50]: CategoricalDtype(categories=['a', 'b'], ordered=False, categories_dtype=object)

This fixes an inconsistency between resample and groupby. This also fixes a potential bug, where the values of the result might change depending on how the results are cast back to the original dtype.

pandas 0.25.x

In [1] df.resample("2D").agg(lambda x: 'c') Out[1]:

 A

0 NaN

pandas 1.0.0

In [51]: df.resample("2D").agg(lambda x: 'c') Out[51]: A 2000-01-01 c

[1 rows x 1 columns]

Increased minimum version for Python#

pandas 1.0.0 supports Python 3.6.1 and higher (GH 29212).

Increased minimum versions for dependencies#

Some minimum supported versions of dependencies were updated (GH 29766, GH 29723). If installed, we now require:

Package Minimum Version Required Changed
numpy 1.13.3 X
pytz 2015.4 X
python-dateutil 2.6.1 X
bottleneck 1.2.1
numexpr 2.6.2
pytest (dev) 4.0.2

For optional libraries the general recommendation is to use the latest version. The following table lists the lowest version per library that is currently being tested throughout the development of pandas. Optional libraries below the lowest tested version may still work, but are not considered supported.

Package Minimum Version Changed
beautifulsoup4 4.6.0
fastparquet 0.3.2 X
gcsfs 0.2.2
lxml 3.8.0
matplotlib 2.2.2
numba 0.46.0 X
openpyxl 2.5.7 X
pyarrow 0.13.0 X
pymysql 0.7.1
pytables 3.4.2
s3fs 0.3.0 X
scipy 0.19.0
sqlalchemy 1.1.4
xarray 0.8.2
xlrd 1.1.0
xlsxwriter 0.9.8
xlwt 1.2.0

See Dependencies and Optional dependencies for more.

Build changes#

pandas has added a pyproject.toml file and will no longer include cythonized files in the source distribution uploaded to PyPI (GH 28341, GH 20775). If you’re installing a built distribution (wheel) or via conda, this shouldn’t have any effect on you. If you’re building pandas from source, you should no longer need to install Cython into your build environment before calling pip install pandas.

Other API changes#

Documentation improvements#

Deprecations#

Selecting Columns from a Grouped DataFrame

When selecting columns from a DataFrameGroupBy object, passing individual keys (or a tuple of keys) inside single brackets is deprecated, a list of items should be used instead. (GH 23566) For example:

df = pd.DataFrame({ "A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"], "B": np.random.randn(8), "C": np.random.randn(8), }) g = df.groupby('A')

single key, returns SeriesGroupBy

g['B']

tuple of single key, returns SeriesGroupBy

g[('B',)]

tuple of multiple keys, returns DataFrameGroupBy, raises FutureWarning

g[('B', 'C')]

multiple keys passed directly, returns DataFrameGroupBy, raises FutureWarning

(implicitly converts the passed strings into a single tuple)

g['B', 'C']

proper way, returns DataFrameGroupBy

g[['B', 'C']]

Removal of prior version deprecations/changes#

Removed SparseSeries and SparseDataFrame

SparseSeries, SparseDataFrame and the DataFrame.to_sparse method have been removed (GH 28425). We recommend using a Series orDataFrame with sparse values instead.

Matplotlib unit registration

Previously, pandas would register converters with matplotlib as a side effect of importing pandas (GH 18720). This changed the output of plots made via matplotlib plots after pandas was imported, even if you were using matplotlib directly rather than plot().

To use pandas formatters with a matplotlib plot, specify

In [1]: import pandas as pd In [2]: pd.options.plotting.matplotlib.register_converters = True

Note that plots created by DataFrame.plot() and Series.plot() do register the converters automatically. The only behavior change is when plotting a date-like object via matplotlib.pyplot.plotor matplotlib.Axes.plot. See Custom formatters for timeseries plots for more.

Other removals

Performance improvements#

Bug fixes#

Categorical#

Datetimelike#

Timedelta#

Timezones#

Numeric#

Conversion#

Strings#

Interval#

Indexing#

Missing#

MultiIndex#

IO#

Plotting#

GroupBy/resample/rolling#

Reshaping#

Sparse#

ExtensionArray#

Other#

Contributors#

A total of 308 people contributed patches to this release. People with a “+” by their names contributed a patch for the first time.