Version 0.19.0 (October 2, 2016) — pandas 2.2.3 documentation (original) (raw)

This is a major release from 0.18.1 and includes number of API changes, several new features, enhancements, and performance improvements along with a large number of bug fixes. We recommend that all users upgrade to this version.

Highlights include:

Warning

pandas >= 0.19.0 will no longer silence numpy ufunc warnings upon import, see here.

What’s new in v0.19.0

New features#

Function merge_asof for asof-style time-series joining#

A long-time requested feature has been added through the merge_asof() function, to support asof style joining of time-series (GH 1870, GH 13695, GH 13709, GH 13902). Full documentation ishere.

The merge_asof() performs an asof merge, which is similar to a left-join except that we match on nearest key rather than equal keys.

In [1]: left = pd.DataFrame({"a": [1, 5, 10], "left_val": ["a", "b", "c"]})

In [2]: right = pd.DataFrame({"a": [1, 2, 3, 6, 7], "right_val": [1, 2, 3, 6, 7]})

In [3]: left Out[3]: a left_val 0 1 a 1 5 b 2 10 c

[3 rows x 2 columns]

In [4]: right Out[4]: a right_val 0 1 1 1 2 2 2 3 3 3 6 6 4 7 7

[5 rows x 2 columns]

We typically want to match exactly when possible, and use the most recent value otherwise.

In [5]: pd.merge_asof(left, right, on="a") Out[5]: a left_val right_val 0 1 a 1 1 5 b 3 2 10 c 7

[3 rows x 3 columns]

We can also match rows ONLY with prior data, and not an exact match.

In [6]: pd.merge_asof(left, right, on="a", allow_exact_matches=False) Out[6]: a left_val right_val 0 1 a NaN 1 5 b 3.0 2 10 c 7.0

[3 rows x 3 columns]

In a typical time-series example, we have trades and quotes and we want to asof-join them. This also illustrates using the by parameter to group data before merging.

In [7]: trades = pd.DataFrame( ...: { ...: "time": pd.to_datetime( ...: [ ...: "20160525 13:30:00.023", ...: "20160525 13:30:00.038", ...: "20160525 13:30:00.048", ...: "20160525 13:30:00.048", ...: "20160525 13:30:00.048", ...: ] ...: ), ...: "ticker": ["MSFT", "MSFT", "GOOG", "GOOG", "AAPL"], ...: "price": [51.95, 51.95, 720.77, 720.92, 98.00], ...: "quantity": [75, 155, 100, 100, 100], ...: }, ...: columns=["time", "ticker", "price", "quantity"], ...: ) ...:

In [8]: quotes = pd.DataFrame( ...: { ...: "time": pd.to_datetime( ...: [ ...: "20160525 13:30:00.023", ...: "20160525 13:30:00.023", ...: "20160525 13:30:00.030", ...: "20160525 13:30:00.041", ...: "20160525 13:30:00.048", ...: "20160525 13:30:00.049", ...: "20160525 13:30:00.072", ...: "20160525 13:30:00.075", ...: ] ...: ), ...: "ticker": ["GOOG", "MSFT", "MSFT", "MSFT", "GOOG", "AAPL", "GOOG", "MSFT"], ...: "bid": [720.50, 51.95, 51.97, 51.99, 720.50, 97.99, 720.50, 52.01], ...: "ask": [720.93, 51.96, 51.98, 52.00, 720.93, 98.01, 720.88, 52.03], ...: }, ...: columns=["time", "ticker", "bid", "ask"], ...: ) ...:

In [9]: trades Out[9]: time ticker price quantity 0 2016-05-25 13:30:00.023 MSFT 51.95 75 1 2016-05-25 13:30:00.038 MSFT 51.95 155 2 2016-05-25 13:30:00.048 GOOG 720.77 100 3 2016-05-25 13:30:00.048 GOOG 720.92 100 4 2016-05-25 13:30:00.048 AAPL 98.00 100

[5 rows x 4 columns]

In [10]: quotes Out[10]: time ticker bid ask 0 2016-05-25 13:30:00.023 GOOG 720.50 720.93 1 2016-05-25 13:30:00.023 MSFT 51.95 51.96 2 2016-05-25 13:30:00.030 MSFT 51.97 51.98 3 2016-05-25 13:30:00.041 MSFT 51.99 52.00 4 2016-05-25 13:30:00.048 GOOG 720.50 720.93 5 2016-05-25 13:30:00.049 AAPL 97.99 98.01 6 2016-05-25 13:30:00.072 GOOG 720.50 720.88 7 2016-05-25 13:30:00.075 MSFT 52.01 52.03

[8 rows x 4 columns]

An asof merge joins on the on, typically a datetimelike field, which is ordered, and in this case we are using a grouper in the by field. This is like a left-outer join, except that forward filling happens automatically taking the most recent non-NaN value.

In [11]: pd.merge_asof(trades, quotes, on="time", by="ticker") Out[11]: time ticker price quantity bid ask 0 2016-05-25 13:30:00.023 MSFT 51.95 75 51.95 51.96 1 2016-05-25 13:30:00.038 MSFT 51.95 155 51.97 51.98 2 2016-05-25 13:30:00.048 GOOG 720.77 100 720.50 720.93 3 2016-05-25 13:30:00.048 GOOG 720.92 100 720.50 720.93 4 2016-05-25 13:30:00.048 AAPL 98.00 100 NaN NaN

[5 rows x 6 columns]

This returns a merged DataFrame with the entries in the same order as the original left passed DataFrame (trades in this case), with the fields of the quotes merged.

Method .rolling() is now time-series aware#

.rolling() objects are now time-series aware and can accept a time-series offset (or convertible) for the window argument (GH 13327, GH 12995). See the full documentation here.

In [12]: dft = pd.DataFrame( ....: {"B": [0, 1, 2, np.nan, 4]}, ....: index=pd.date_range("20130101 09:00:00", periods=5, freq="s"), ....: ) ....:

In [13]: dft Out[13]: B 2013-01-01 09:00:00 0.0 2013-01-01 09:00:01 1.0 2013-01-01 09:00:02 2.0 2013-01-01 09:00:03 NaN 2013-01-01 09:00:04 4.0

[5 rows x 1 columns]

This is a regular frequency index. Using an integer window parameter works to roll along the window frequency.

In [14]: dft.rolling(2).sum() Out[14]: B 2013-01-01 09:00:00 NaN 2013-01-01 09:00:01 1.0 2013-01-01 09:00:02 3.0 2013-01-01 09:00:03 NaN 2013-01-01 09:00:04 NaN

[5 rows x 1 columns]

In [15]: dft.rolling(2, min_periods=1).sum() Out[15]: B 2013-01-01 09:00:00 0.0 2013-01-01 09:00:01 1.0 2013-01-01 09:00:02 3.0 2013-01-01 09:00:03 2.0 2013-01-01 09:00:04 4.0

[5 rows x 1 columns]

Specifying an offset allows a more intuitive specification of the rolling frequency.

In [16]: dft.rolling("2s").sum() Out[16]: B 2013-01-01 09:00:00 0.0 2013-01-01 09:00:01 1.0 2013-01-01 09:00:02 3.0 2013-01-01 09:00:03 2.0 2013-01-01 09:00:04 4.0

[5 rows x 1 columns]

Using a non-regular, but still monotonic index, rolling with an integer window does not impart any special calculation.

In [17]: dft = pd.DataFrame( ....: {"B": [0, 1, 2, np.nan, 4]}, ....: index=pd.Index( ....: [ ....: pd.Timestamp("20130101 09:00:00"), ....: pd.Timestamp("20130101 09:00:02"), ....: pd.Timestamp("20130101 09:00:03"), ....: pd.Timestamp("20130101 09:00:05"), ....: pd.Timestamp("20130101 09:00:06"), ....: ], ....: name="foo", ....: ), ....: ) ....:

In [18]: dft Out[18]: B foo
2013-01-01 09:00:00 0.0 2013-01-01 09:00:02 1.0 2013-01-01 09:00:03 2.0 2013-01-01 09:00:05 NaN 2013-01-01 09:00:06 4.0

[5 rows x 1 columns]

In [19]: dft.rolling(2).sum() Out[19]: B foo
2013-01-01 09:00:00 NaN 2013-01-01 09:00:02 1.0 2013-01-01 09:00:03 3.0 2013-01-01 09:00:05 NaN 2013-01-01 09:00:06 NaN

[5 rows x 1 columns]

Using the time-specification generates variable windows for this sparse data.

In [20]: dft.rolling("2s").sum() Out[20]: B foo
2013-01-01 09:00:00 0.0 2013-01-01 09:00:02 1.0 2013-01-01 09:00:03 3.0 2013-01-01 09:00:05 NaN 2013-01-01 09:00:06 4.0

[5 rows x 1 columns]

Furthermore, we now allow an optional on parameter to specify a column (rather than the default of the index) in a DataFrame.

In [21]: dft = dft.reset_index()

In [22]: dft Out[22]: foo B 0 2013-01-01 09:00:00 0.0 1 2013-01-01 09:00:02 1.0 2 2013-01-01 09:00:03 2.0 3 2013-01-01 09:00:05 NaN 4 2013-01-01 09:00:06 4.0

[5 rows x 2 columns]

In [23]: dft.rolling("2s", on="foo").sum() Out[23]: foo B 0 2013-01-01 09:00:00 0.0 1 2013-01-01 09:00:02 1.0 2 2013-01-01 09:00:03 3.0 3 2013-01-01 09:00:05 NaN 4 2013-01-01 09:00:06 4.0

[5 rows x 2 columns]

Method read_csv has improved support for duplicate column names#

Duplicate column names are now supported in read_csv() whether they are in the file or passed in as the names parameter (GH 7160, GH 9424)

In [24]: data = "0,1,2\n3,4,5"

In [25]: names = ["a", "b", "a"]

Previous behavior:

In [2]: pd.read_csv(StringIO(data), names=names) Out[2]: a b a 0 2 1 2 1 5 4 5

The first a column contained the same data as the second a column, when it should have contained the values [0, 3].

New behavior:

In [26]: pd.read_csv(StringIO(data), names=names)

ValueError Traceback (most recent call last) Cell In[26], line 1 ----> 1 pd.read_csv(StringIO(data), names=names)

File ~/work/pandas/pandas/pandas/io/parsers/readers.py:1026, in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, date_format, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options, dtype_backend) 1013 kwds_defaults = _refine_defaults_read( 1014 dialect, 1015 delimiter, (...) 1022 dtype_backend=dtype_backend, 1023 ) 1024 kwds.update(kwds_defaults) -> 1026 return _read(filepath_or_buffer, kwds)

File ~/work/pandas/pandas/pandas/io/parsers/readers.py:617, in _read(filepath_or_buffer, kwds) 614 nrows = kwds.get("nrows", None) 616 # Check for duplicates in names. --> 617 _validate_names(kwds.get("names", None)) 619 # Create the parser. 620 parser = TextFileReader(filepath_or_buffer, **kwds)

File ~/work/pandas/pandas/pandas/io/parsers/readers.py:576, in _validate_names(names) 574 if names is not None: 575 if len(names) != len(set(names)): --> 576 raise ValueError("Duplicate names are not allowed.") 577 if not ( 578 is_list_like(names, allow_sets=False) or isinstance(names, abc.KeysView) 579 ): 580 raise ValueError("Names should be an ordered collection.")

ValueError: Duplicate names are not allowed.

Method read_csv supports parsing Categorical directly#

The read_csv() function now supports parsing a Categorical column when specified as a dtype (GH 10153). Depending on the structure of the data, this can result in a faster parse time and lower memory usage compared to converting to Categorical after parsing. See the io docs here.

In [27]: data = """ ....: col1,col2,col3 ....: a,b,1 ....: a,b,2 ....: c,d,3 ....: """ ....:

In [28]: pd.read_csv(StringIO(data)) Out[28]: col1 col2 col3 0 a b 1 1 a b 2 2 c d 3

[3 rows x 3 columns]

In [29]: pd.read_csv(StringIO(data)).dtypes Out[29]: col1 object col2 object col3 int64 Length: 3, dtype: object

In [30]: pd.read_csv(StringIO(data), dtype="category").dtypes Out[30]: col1 category col2 category col3 category Length: 3, dtype: object

Individual columns can be parsed as a Categorical using a dict specification

In [31]: pd.read_csv(StringIO(data), dtype={"col1": "category"}).dtypes Out[31]: col1 category col2 object col3 int64 Length: 3, dtype: object

Note

The resulting categories will always be parsed as strings (object dtype). If the categories are numeric they can be converted using theto_numeric() function, or as appropriate, another converter such as to_datetime().

In [32]: df = pd.read_csv(StringIO(data), dtype="category")

In [33]: df.dtypes Out[33]: col1 category col2 category col3 category Length: 3, dtype: object

In [34]: df["col3"] Out[34]: 0 1 1 2 2 3 Name: col3, Length: 3, dtype: category Categories (3, object): ['1', '2', '3']

In [35]: new_categories = pd.to_numeric(df["col3"].cat.categories)

In [36]: df["col3"] = df["col3"].cat.rename_categories(new_categories)

In [37]: df["col3"] Out[37]: 0 1 1 2 2 3 Name: col3, Length: 3, dtype: category Categories (3, int64): [1, 2, 3]

Categorical concatenation#

Previous behavior:

In [1]: pd.concat([s1, s2]) ValueError: incompatible categories in categorical concat

New behavior:

In [44]: pd.concat([s1, s2]) Out[44]: 0 a 1 b 0 b 1 c Length: 4, dtype: object

Semi-month offsets#

pandas has gained new frequency offsets, SemiMonthEnd (‘SM’) and SemiMonthBegin (‘SMS’). These provide date offsets anchored (by default) to the 15th and end of month, and 15th and 1st of month respectively. (GH 1543)

In [45]: from pandas.tseries.offsets import SemiMonthEnd, SemiMonthBegin

SemiMonthEnd:

In [46]: pd.Timestamp("2016-01-01") + SemiMonthEnd() Out[46]: Timestamp('2016-01-15 00:00:00')

In [47]: pd.date_range("2015-01-01", freq="SM", periods=4) Out[47]: DatetimeIndex(['2015-01-15', '2015-01-31', '2015-02-15', '2015-02-28'], dtype='datetime64[ns]', freq='SM-15')

SemiMonthBegin:

In [46]: pd.Timestamp("2016-01-01") + SemiMonthBegin() Out[46]: Timestamp('2016-01-15 00:00:00')

In [47]: pd.date_range("2015-01-01", freq="SMS", periods=4) Out[47]: DatetimeIndex(['2015-01-01', '2015-01-15', '2015-02-01', '2015-02-15'], dtype='datetime64[ns]', freq='SMS-15')

Using the anchoring suffix, you can also specify the day of month to use instead of the 15th.

In [50]: pd.date_range("2015-01-01", freq="SMS-16", periods=4) Out[50]: DatetimeIndex(['2015-01-01', '2015-01-16', '2015-02-01', '2015-02-16'], dtype='datetime64[ns]', freq='SMS-16')

In [51]: pd.date_range("2015-01-01", freq="SM-14", periods=4) Out[51]: DatetimeIndex(['2015-01-14', '2015-01-31', '2015-02-14', '2015-02-28'], dtype='datetime64[ns]', freq='SM-14')

New Index methods#

The following methods and options are added to Index, to be more consistent with the Series and DataFrame API.

Index now supports the .where() function for same shape indexing (GH 13170)

In [48]: idx = pd.Index(["a", "b", "c"])

In [49]: idx.where([True, False, True]) Out[49]: Index(['a', None, 'c'], dtype='object')

Index now supports .dropna() to exclude missing values (GH 6194)

In [50]: idx = pd.Index([1, 2, np.nan, 4])

In [51]: idx.dropna() Out[51]: Index([1.0, 2.0, 4.0], dtype='float64')

For MultiIndex, values are dropped if any level is missing by default. Specifyinghow='all' only drops values where all levels are missing.

In [52]: midx = pd.MultiIndex.from_arrays([[1, 2, np.nan, 4], [1, 2, np.nan, np.nan]])

In [53]: midx Out[53]: MultiIndex([(1.0, 1.0), (2.0, 2.0), (nan, nan), (4.0, nan)], )

In [54]: midx.dropna() Out[54]: MultiIndex([(1, 1), (2, 2)], )

In [55]: midx.dropna(how="all") Out[55]: MultiIndex([(1, 1.0), (2, 2.0), (4, nan)], )

Index now supports .str.extractall() which returns a DataFrame, see the docs here (GH 10008, GH 13156)

In [56]: idx = pd.Index(["a1a2", "b1", "c1"])

In [57]: idx.str.extractall(r"ab") Out[57]: digit match
0 0 1 1 2 1 0 1

[3 rows x 1 columns]

Index.astype() now accepts an optional boolean argument copy, which allows optional copying if the requirements on dtype are satisfied (GH 13209)

Google BigQuery enhancements#

Fine-grained NumPy errstate#

Previous versions of pandas would permanently silence numpy’s ufunc error handling when pandas was imported. pandas did this in order to silence the warnings that would arise from using numpy ufuncs on missing data, which are usually represented as NaN s. Unfortunately, this silenced legitimate warnings arising in non-pandas code in the application. Starting with 0.19.0, pandas will use the numpy.errstate context manager to silence these warnings in a more fine-grained manner, only around where these operations are actually used in the pandas code base. (GH 13109, GH 13145)

After upgrading pandas, you may see new RuntimeWarnings being issued from your code. These are likely legitimate, and the underlying cause likely existed in the code when using previous versions of pandas that simply silenced the warning. Use numpy.errstate around the source of the RuntimeWarning to control how these conditions are handled.

Method get_dummies now returns integer dtypes#

The pd.get_dummies function now returns dummy-encoded columns as small integers, rather than floats (GH 8725). This should provide an improved memory footprint.

Previous behavior:

In [1]: pd.get_dummies(['a', 'b', 'a', 'c']).dtypes

Out[1]: a float64 b float64 c float64 dtype: object

New behavior:

In [58]: pd.get_dummies(["a", "b", "a", "c"]).dtypes Out[58]: a bool b bool c bool Length: 3, dtype: object

Downcast values to smallest possible dtype in to_numeric#

pd.to_numeric() now accepts a downcast parameter, which will downcast the data if possible to smallest specified numerical dtype (GH 13352)

In [59]: s = ["1", 2, 3]

In [60]: pd.to_numeric(s, downcast="unsigned") Out[60]: array([1, 2, 3], dtype=uint8)

In [61]: pd.to_numeric(s, downcast="integer") Out[61]: array([1, 2, 3], dtype=int8)

pandas development API#

As part of making pandas API more uniform and accessible in the future, we have created a standard sub-package of pandas, pandas.api to hold public API’s. We are starting by exposing type introspection functions in pandas.api.types. More sub-packages and officially sanctioned API’s will be published in future versions of pandas (GH 13147, GH 13634)

The following are now part of this API:

In [62]: import pprint

In [63]: from pandas.api import types

In [64]: funcs = [f for f in dir(types) if not f.startswith("_")]

In [65]: pprint.pprint(funcs) ['CategoricalDtype', 'DatetimeTZDtype', 'IntervalDtype', 'PeriodDtype', 'infer_dtype', 'is_any_real_numeric_dtype', 'is_array_like', 'is_bool', 'is_bool_dtype', 'is_categorical_dtype', 'is_complex', 'is_complex_dtype', 'is_datetime64_any_dtype', 'is_datetime64_dtype', 'is_datetime64_ns_dtype', 'is_datetime64tz_dtype', 'is_dict_like', 'is_dtype_equal', 'is_extension_array_dtype', 'is_file_like', 'is_float', 'is_float_dtype', 'is_hashable', 'is_int64_dtype', 'is_integer', 'is_integer_dtype', 'is_interval', 'is_interval_dtype', 'is_iterator', 'is_list_like', 'is_named_tuple', 'is_number', 'is_numeric_dtype', 'is_object_dtype', 'is_period_dtype', 'is_re', 'is_re_compilable', 'is_scalar', 'is_signed_integer_dtype', 'is_sparse', 'is_string_dtype', 'is_timedelta64_dtype', 'is_timedelta64_ns_dtype', 'is_unsigned_integer_dtype', 'pandas_dtype', 'union_categoricals']

Note

Calling these functions from the internal module pandas.core.common will now show a DeprecationWarning (GH 13990)

Other enhancements#

v d
1 2015-01-04 2015-01-04 0
2 2015-01-11 2015-01-11 1
3 2015-01-18 2015-01-18 2
4 2015-01-25 2015-01-25 3
5 2015-02-01 2015-02-01 4
[5 rows x 2 columns]
In [74]: df.resample("M", on="date")[["a"]].sum()
Out[74]:
a
date
2015-01-31 6
2015-02-28 4
[2 rows x 1 columns]
In [75]: df.resample("M", level="d")[["a"]].sum()
Out[75]:
a
d
2015-01-31 6
2015-02-28 4
[2 rows x 1 columns]

row1 2 3 4
row2 7 5 8
[2 rows x 3 columns]
In [72]: df.sort_values(by="row2", axis=1)
Out[72]:
B A C
row1 3 2 4
row2 5 7 8
[2 rows x 3 columns]

API changes#

Series.tolist() will now return Python types#

Series.tolist() will now return Python types in the output, mimicking NumPy .tolist() behavior (GH 10904)

In [73]: s = pd.Series([1, 2, 3])

Previous behavior:

In [7]: type(s.tolist()[0]) Out[7]: <class 'numpy.int64'>

New behavior:

In [74]: type(s.tolist()[0]) Out[74]: int

Series operators for different indexes#

Following Series operators have been changed to make all operators consistent, including DataFrame (GH 1134, GH 4581, GH 13538)

Warning

Until 0.18.1, comparing Series with the same length, would succeed even if the .index are different (the result ignores .index). As of 0.19.0, this will raises ValueError to be more strict. This section also describes how to keep previous behavior or align different indexes, using the flexible comparison methods like .eq.

As a result, Series and DataFrame operators behave as below:

Arithmetic operators#

Arithmetic operators align both index (no changes).

In [75]: s1 = pd.Series([1, 2, 3], index=list("ABC"))

In [76]: s2 = pd.Series([2, 2, 2], index=list("ABD"))

In [77]: s1 + s2 Out[77]: A 3.0 B 4.0 C NaN D NaN Length: 4, dtype: float64

In [78]: df1 = pd.DataFrame([1, 2, 3], index=list("ABC"))

In [79]: df2 = pd.DataFrame([2, 2, 2], index=list("ABD"))

In [80]: df1 + df2 Out[80]: 0 A 3.0 B 4.0 C NaN D NaN

[4 rows x 1 columns]

Comparison operators#

Comparison operators raise ValueError when .index are different.

Previous behavior (Series):

Series compared values ignoring the .index as long as both had the same length:

In [1]: s1 == s2 Out[1]: A False B True C False dtype: bool

New behavior (Series):

In [2]: s1 == s2 Out[2]: ValueError: Can only compare identically-labeled Series objects

Note

To achieve the same result as previous versions (compare values based on locations ignoring .index), compare both .values.

In [81]: s1.values == s2.values Out[81]: array([False, True, False])

If you want to compare Series aligning its .index, see flexible comparison methods section below:

In [82]: s1.eq(s2) Out[82]: A False B True C False D False Length: 4, dtype: bool

Current behavior (DataFrame, no change):

In [3]: df1 == df2 Out[3]: ValueError: Can only compare identically-labeled DataFrame objects

Logical operators#

Logical operators align both .index of left and right hand side.

Previous behavior (Series), only left hand side index was kept:

In [4]: s1 = pd.Series([True, False, True], index=list('ABC')) In [5]: s2 = pd.Series([True, True, True], index=list('ABD')) In [6]: s1 & s2 Out[6]: A True B False C False dtype: bool

New behavior (Series):

In [83]: s1 = pd.Series([True, False, True], index=list("ABC"))

In [84]: s2 = pd.Series([True, True, True], index=list("ABD"))

In [85]: s1 & s2 Out[85]: A True B False C False D False Length: 4, dtype: bool

Note

Series logical operators fill a NaN result with False.

Note

To achieve the same result as previous versions (compare values based on only left hand side index), you can use reindex_like:

In [86]: s1 & s2.reindex_like(s1) Out[86]: A True B False C False Length: 3, dtype: bool

Current behavior (DataFrame, no change):

In [87]: df1 = pd.DataFrame([True, False, True], index=list("ABC"))

In [88]: df2 = pd.DataFrame([True, True, True], index=list("ABD"))

In [89]: df1 & df2 Out[89]: 0 A True B False C False D False

[4 rows x 1 columns]

Flexible comparison methods#

Series flexible comparison methods like eq, ne, le, lt, ge and gt now align both index. Use these operators if you want to compare two Serieswhich has the different index.

In [90]: s1 = pd.Series([1, 2, 3], index=["a", "b", "c"])

In [91]: s2 = pd.Series([2, 2, 2], index=["b", "c", "d"])

In [92]: s1.eq(s2) Out[92]: a False b True c False d False Length: 4, dtype: bool

In [93]: s1.ge(s2) Out[93]: a False b True c True d False Length: 4, dtype: bool

Previously, this worked the same as comparison operators (see above).

Series type promotion on assignment#

A Series will now correctly promote its dtype for assignment with incompat values to the current dtype (GH 13234)

Previous behavior:

In [2]: s["a"] = pd.Timestamp("2016-01-01")

In [3]: s["b"] = 3.0 TypeError: invalid type promotion

New behavior:

In [95]: s["a"] = pd.Timestamp("2016-01-01")

In [96]: s["b"] = 3.0

In [97]: s Out[97]: a 2016-01-01 00:00:00 b 3.0 Length: 2, dtype: object

In [98]: s.dtype Out[98]: dtype('O')

Function .to_datetime() changes#

Previously if .to_datetime() encountered mixed integers/floats and strings, but no datetimes with errors='coerce' it would convert all to NaT.

Previous behavior:

In [2]: pd.to_datetime([1, 'foo'], errors='coerce') Out[2]: DatetimeIndex(['NaT', 'NaT'], dtype='datetime64[ns]', freq=None)

Current behavior:

This will now convert integers/floats with the default unit of ns.

In [99]: pd.to_datetime([1, "foo"], errors="coerce") Out[99]: DatetimeIndex(['1970-01-01 00:00:00.000000001', 'NaT'], dtype='datetime64[ns]', freq=None)

Bug fixes related to .to_datetime():

Merging changes#

Merging will now preserve the dtype of the join keys (GH 8596)

In [100]: df1 = pd.DataFrame({"key": [1], "v1": [10]})

In [101]: df1 Out[101]: key v1 0 1 10

[1 rows x 2 columns]

In [102]: df2 = pd.DataFrame({"key": [1, 2], "v1": [20, 30]})

In [103]: df2 Out[103]: key v1 0 1 20 1 2 30

[2 rows x 2 columns]

Previous behavior:

In [5]: pd.merge(df1, df2, how='outer') Out[5]: key v1 0 1.0 10.0 1 1.0 20.0 2 2.0 30.0

In [6]: pd.merge(df1, df2, how='outer').dtypes Out[6]: key float64 v1 float64 dtype: object

New behavior:

We are able to preserve the join keys

In [104]: pd.merge(df1, df2, how="outer") Out[104]: key v1 0 1 10 1 1 20 2 2 30

[3 rows x 2 columns]

In [105]: pd.merge(df1, df2, how="outer").dtypes Out[105]: key int64 v1 int64 Length: 2, dtype: object

Of course if you have missing values that are introduced, then the resulting dtype will be upcast, which is unchanged from previous.

In [106]: pd.merge(df1, df2, how="outer", on="key") Out[106]: key v1_x v1_y 0 1 10.0 20 1 2 NaN 30

[2 rows x 3 columns]

In [107]: pd.merge(df1, df2, how="outer", on="key").dtypes Out[107]: key int64 v1_x float64 v1_y int64 Length: 3, dtype: object

Method .describe() changes#

Percentile identifiers in the index of a .describe() output will now be rounded to the least precision that keeps them distinct (GH 13104)

In [108]: s = pd.Series([0, 1, 2, 3, 4])

In [109]: df = pd.DataFrame([0, 1, 2, 3, 4])

Previous behavior:

The percentiles were rounded to at most one decimal place, which could raise ValueError for a data frame if the percentiles were duplicated.

In [3]: s.describe(percentiles=[0.0001, 0.0005, 0.001, 0.999, 0.9995, 0.9999]) Out[3]: count 5.000000 mean 2.000000 std 1.581139 min 0.000000 0.0% 0.000400 0.1% 0.002000 0.1% 0.004000 50% 2.000000 99.9% 3.996000 100.0% 3.998000 100.0% 3.999600 max 4.000000 dtype: float64

In [4]: df.describe(percentiles=[0.0001, 0.0005, 0.001, 0.999, 0.9995, 0.9999]) Out[4]: ... ValueError: cannot reindex from a duplicate axis

New behavior:

In [110]: s.describe(percentiles=[0.0001, 0.0005, 0.001, 0.999, 0.9995, 0.9999]) Out[110]: count 5.000000 mean 2.000000 std 1.581139 min 0.000000 0.01% 0.000400 0.05% 0.002000 0.1% 0.004000 50% 2.000000 99.9% 3.996000 99.95% 3.998000 99.99% 3.999600 max 4.000000 Length: 12, dtype: float64

In [111]: df.describe(percentiles=[0.0001, 0.0005, 0.001, 0.999, 0.9995, 0.9999]) Out[111]: 0 count 5.000000 mean 2.000000 std 1.581139 min 0.000000 0.01% 0.000400 0.05% 0.002000 0.1% 0.004000 50% 2.000000 99.9% 3.996000 99.95% 3.998000 99.99% 3.999600 max 4.000000

[12 rows x 1 columns]

Furthermore:

Period changes#

The PeriodIndex now has period dtype#

PeriodIndex now has its own period dtype. The period dtype is a pandas extension dtype like category or the timezone aware dtype (datetime64[ns, tz]) (GH 13941). As a consequence of this change, PeriodIndex no longer has an integer dtype:

Previous behavior:

In [1]: pi = pd.PeriodIndex(['2016-08-01'], freq='D')

In [2]: pi Out[2]: PeriodIndex(['2016-08-01'], dtype='int64', freq='D')

In [3]: pd.api.types.is_integer_dtype(pi) Out[3]: True

In [4]: pi.dtype Out[4]: dtype('int64')

New behavior:

In [112]: pi = pd.PeriodIndex(["2016-08-01"], freq="D")

In [113]: pi Out[113]: PeriodIndex(['2016-08-01'], dtype='period[D]')

In [114]: pd.api.types.is_integer_dtype(pi) Out[114]: False

In [115]: pd.api.types.is_period_dtype(pi) Out[115]: True

In [116]: pi.dtype Out[116]: period[D]

In [117]: type(pi.dtype) Out[117]: pandas.core.dtypes.dtypes.PeriodDtype

Period('NaT') now returns pd.NaT#

Previously, Period has its own Period('NaT') representation different from pd.NaT. Now Period('NaT') has been changed to return pd.NaT. (GH 12759, GH 13582)

Previous behavior:

In [5]: pd.Period('NaT', freq='D') Out[5]: Period('NaT', 'D')

New behavior:

These result in pd.NaT without providing freq option.

In [118]: pd.Period("NaT") Out[118]: NaT

In [119]: pd.Period(None) Out[119]: NaT

To be compatible with Period addition and subtraction, pd.NaT now supports addition and subtraction with int. Previously it raised ValueError.

Previous behavior:

In [5]: pd.NaT + 1 ... ValueError: Cannot add integral value to Timestamp without freq.

New behavior:

In [120]: pd.NaT + 1 Out[120]: NaT

In [121]: pd.NaT - 1 Out[121]: NaT

PeriodIndex.values now returns array of Period object#

.values is changed to return an array of Period objects, rather than an array of integers (GH 13988).

Previous behavior:

In [6]: pi = pd.PeriodIndex(['2011-01', '2011-02'], freq='M') In [7]: pi.values Out[7]: array([492, 493])

New behavior:

In [122]: pi = pd.PeriodIndex(["2011-01", "2011-02"], freq="M")

In [123]: pi.values Out[123]: array([Period('2011-01', 'M'), Period('2011-02', 'M')], dtype=object)

Index + / - no longer used for set operations#

Addition and subtraction of the base Index type and of DatetimeIndex (not the numeric index types) previously performed set operations (set union and difference). This behavior was already deprecated since 0.15.0 (in favor using the specific.union() and .difference() methods), and is now disabled. When possible, + and - are now used for element-wise operations, for example for concatenating strings or subtracting datetimes (GH 8227, GH 14127).

Previous behavior:

In [1]: pd.Index(['a', 'b']) + pd.Index(['a', 'c']) FutureWarning: using '+' to provide set union with Indexes is deprecated, use '|' or .union() Out[1]: Index(['a', 'b', 'c'], dtype='object')

New behavior: the same operation will now perform element-wise addition:

In [124]: pd.Index(["a", "b"]) + pd.Index(["a", "c"]) Out[124]: Index(['aa', 'bc'], dtype='object')

Note that numeric Index objects already performed element-wise operations. For example, the behavior of adding two integer Indexes is unchanged. The base Index is now made consistent with this behavior.

In [125]: pd.Index([1, 2, 3]) + pd.Index([2, 3, 4]) Out[125]: Index([3, 5, 7], dtype='int64')

Further, because of this change, it is now possible to subtract two DatetimeIndex objects resulting in a TimedeltaIndex:

Previous behavior:

In [1]: (pd.DatetimeIndex(['2016-01-01', '2016-01-02']) ...: - pd.DatetimeIndex(['2016-01-02', '2016-01-03'])) FutureWarning: using '-' to provide set differences with datetimelike Indexes is deprecated, use .difference() Out[1]: DatetimeIndex(['2016-01-01'], dtype='datetime64[ns]', freq=None)

New behavior:

In [126]: ( .....: pd.DatetimeIndex(["2016-01-01", "2016-01-02"]) .....: - pd.DatetimeIndex(["2016-01-02", "2016-01-03"]) .....: ) .....: Out[126]: TimedeltaIndex(['-1 days', '-1 days'], dtype='timedelta64[ns]', freq=None)

Index.difference and .symmetric_difference changes#

Index.difference and Index.symmetric_difference will now, more consistently, treat NaN values as any other values. (GH 13514)

In [127]: idx1 = pd.Index([1, 2, 3, np.nan])

In [128]: idx2 = pd.Index([0, 1, np.nan])

Previous behavior:

In [3]: idx1.difference(idx2) Out[3]: Float64Index([nan, 2.0, 3.0], dtype='float64')

In [4]: idx1.symmetric_difference(idx2) Out[4]: Float64Index([0.0, nan, 2.0, 3.0], dtype='float64')

New behavior:

In [129]: idx1.difference(idx2) Out[129]: Index([2.0, 3.0], dtype='float64')

In [130]: idx1.symmetric_difference(idx2) Out[130]: Index([0.0, 2.0, 3.0], dtype='float64')

Index.unique consistently returns Index#

Index.unique() now returns unique values as anIndex of the appropriate dtype. (GH 13395). Previously, most Index classes returned np.ndarray, and DatetimeIndex,TimedeltaIndex and PeriodIndex returned Index to keep metadata like timezone.

Previous behavior:

In [1]: pd.Index([1, 2, 3]).unique() Out[1]: array([1, 2, 3])

In [2]: pd.DatetimeIndex(['2011-01-01', '2011-01-02', ...: '2011-01-03'], tz='Asia/Tokyo').unique() Out[2]: DatetimeIndex(['2011-01-01 00:00:00+09:00', '2011-01-02 00:00:00+09:00', '2011-01-03 00:00:00+09:00'], dtype='datetime64[ns, Asia/Tokyo]', freq=None)

New behavior:

In [131]: pd.Index([1, 2, 3]).unique() Out[131]: Index([1, 2, 3], dtype='int64')

In [132]: pd.DatetimeIndex( .....: ["2011-01-01", "2011-01-02", "2011-01-03"], tz="Asia/Tokyo" .....: ).unique() .....: Out[132]: DatetimeIndex(['2011-01-01 00:00:00+09:00', '2011-01-02 00:00:00+09:00', '2011-01-03 00:00:00+09:00'], dtype='datetime64[ns, Asia/Tokyo]', freq=None)

MultiIndex constructors, groupby and set_index preserve categorical dtypes#

MultiIndex.from_arrays and MultiIndex.from_product will now preserve categorical dtype in MultiIndex levels (GH 13743, GH 13854).

In [133]: cat = pd.Categorical(["a", "b"], categories=list("bac"))

In [134]: lvl1 = ["foo", "bar"]

In [135]: midx = pd.MultiIndex.from_arrays([cat, lvl1])

In [136]: midx Out[136]: MultiIndex([('a', 'foo'), ('b', 'bar')], )

Previous behavior:

In [4]: midx.levels[0] Out[4]: Index(['b', 'a', 'c'], dtype='object')

In [5]: midx.get_level_values[0] Out[5]: Index(['a', 'b'], dtype='object')

New behavior: the single level is now a CategoricalIndex:

In [137]: midx.levels[0] Out[137]: CategoricalIndex(['b', 'a', 'c'], categories=['b', 'a', 'c'], ordered=False, dtype='category')

In [138]: midx.get_level_values(0) Out[138]: CategoricalIndex(['a', 'b'], categories=['b', 'a', 'c'], ordered=False, dtype='category')

An analogous change has been made to MultiIndex.from_product. As a consequence, groupby and set_index also preserve categorical dtypes in indexes

In [139]: df = pd.DataFrame({"A": [0, 1], "B": [10, 11], "C": cat})

In [140]: df_grouped = df.groupby(by=["A", "C"], observed=False).first()

In [141]: df_set_idx = df.set_index(["A", "C"])

Previous behavior:

In [11]: df_grouped.index.levels[1] Out[11]: Index(['b', 'a', 'c'], dtype='object', name='C') In [12]: df_grouped.reset_index().dtypes Out[12]: A int64 C object B float64 dtype: object

In [13]: df_set_idx.index.levels[1] Out[13]: Index(['b', 'a', 'c'], dtype='object', name='C') In [14]: df_set_idx.reset_index().dtypes Out[14]: A int64 C object B int64 dtype: object

New behavior:

In [142]: df_grouped.index.levels[1] Out[142]: CategoricalIndex(['b', 'a', 'c'], categories=['b', 'a', 'c'], ordered=False, dtype='category', name='C')

In [143]: df_grouped.reset_index().dtypes Out[143]: A int64 C category B float64 Length: 3, dtype: object

In [144]: df_set_idx.index.levels[1] Out[144]: CategoricalIndex(['b', 'a', 'c'], categories=['b', 'a', 'c'], ordered=False, dtype='category', name='C')

In [145]: df_set_idx.reset_index().dtypes Out[145]: A int64 C category B int64 Length: 3, dtype: object

Function read_csv will progressively enumerate chunks#

When read_csv() is called with chunksize=n and without specifying an index, each chunk used to have an independently generated index from 0 to n-1. They are now given instead a progressive index, starting from 0 for the first chunk, from n for the second, and so on, so that, when concatenated, they are identical to the result of calling read_csv() without the chunksize= argument (GH 12185).

In [146]: data = "A,B\n0,1\n2,3\n4,5\n6,7"

Previous behavior:

In [2]: pd.concat(pd.read_csv(StringIO(data), chunksize=2)) Out[2]: A B 0 0 1 1 2 3 0 4 5 1 6 7

New behavior:

In [147]: pd.concat(pd.read_csv(StringIO(data), chunksize=2)) Out[147]: A B 0 0 1 1 2 3 2 4 5 3 6 7

[4 rows x 2 columns]

Sparse changes#

These changes allow pandas to handle sparse data with more dtypes, and for work to make a smoother experience with data handling.

Types int64 and bool support enhancements#

Sparse data structures now gained enhanced support of int64 and bool dtype (GH 667, GH 13849).

Previously, sparse data were float64 dtype by default, even if all inputs were of int or bool dtype. You had to specify dtype explicitly to create sparse data with int64 dtype. Also, fill_value had to be specified explicitly because the default was np.nan which doesn’t appear in int64 or bool data.

In [1]: pd.SparseArray([1, 2, 0, 0]) Out[1]: [1.0, 2.0, 0.0, 0.0] Fill: nan IntIndex Indices: array([0, 1, 2, 3], dtype=int32)

specifying int64 dtype, but all values are stored in sp_values because

fill_value default is np.nan

In [2]: pd.SparseArray([1, 2, 0, 0], dtype=np.int64) Out[2]: [1, 2, 0, 0] Fill: nan IntIndex Indices: array([0, 1, 2, 3], dtype=int32)

In [3]: pd.SparseArray([1, 2, 0, 0], dtype=np.int64, fill_value=0) Out[3]: [1, 2, 0, 0] Fill: 0 IntIndex Indices: array([0, 1], dtype=int32)

As of v0.19.0, sparse data keeps the input dtype, and uses more appropriate fill_value defaults (0 for int64 dtype, False for bool dtype).

In [148]: pd.arrays.SparseArray([1, 2, 0, 0], dtype=np.int64) Out[148]: [1, 2, 0, 0] Fill: 0 IntIndex Indices: array([0, 1], dtype=int32)

In [149]: pd.arrays.SparseArray([True, False, False, False]) Out[149]: [True, False, False, False] Fill: False IntIndex Indices: array([0], dtype=int32)

See the docs for more details.

Operators now preserve dtypes#

s = pd.SparseSeries([0, 2, 0, 1], fill_value=0, dtype=np.int64) s.dtype

s + 1

s = pd.SparseSeries([1.0, 0.0, 2.0, 0.0], fill_value=0) s s.astype(np.int64)

astype fails if data contains values which cannot be converted to specified dtype. Note that the limitation is applied to fill_value which default is np.nan.

In [7]: pd.SparseSeries([1., np.nan, 2., np.nan], fill_value=np.nan).astype(np.int64) Out[7]: ValueError: unable to coerce current fill_value nan to int64 dtype

Other sparse fixes#

Indexer dtype changes#

Note

This change only affects 64 bit python running on Windows, and only affects relatively advanced indexing operations

Methods such as Index.get_indexer that return an indexer array, coerce that array to a “platform int”, so that it can be directly used in 3rd party library operations like numpy.take. Previously, a platform int was defined as np.int_which corresponds to a C integer, but the correct type, and what is being used now, is np.intp, which corresponds to the C integer size that can hold a pointer (GH 3033, GH 13972).

These types are the same on many platform, but for 64 bit python on Windows,np.int_ is 32 bits, and np.intp is 64 bits. Changing this behavior improves performance for many operations on that platform.

Previous behavior:

In [1]: i = pd.Index(['a', 'b', 'c'])

In [2]: i.get_indexer(['b', 'b', 'c']).dtype Out[2]: dtype('int32')

New behavior:

In [1]: i = pd.Index(['a', 'b', 'c'])

In [2]: i.get_indexer(['b', 'b', 'c']).dtype Out[2]: dtype('int64')

Other API changes#

Deprecations#

Removal of prior version deprecations/changes#

Performance improvements#

Bug fixes#

Contributors#

A total of 117 people contributed patches to this release. People with a “+” by their names contributed a patch for the first time.