Version 0.17.0 (October 9, 2015) — pandas 2.2.3 documentation (original) (raw)

This is a major release from 0.16.2 and includes a small number of API changes, several new features, enhancements, and performance improvements along with a large number of bug fixes. We recommend that all users upgrade to this version.

Warning

pandas >= 0.17.0 will no longer support compatibility with Python version 3.2 (GH 9118)

Warning

The pandas.io.data package is deprecated and will be replaced by thepandas-datareader package. This will allow the data modules to be independently updated to your pandas installation. The API for pandas-datareader v0.1.1 is exactly the same as in pandas v0.17.0 (GH 8961, GH 10861).

After installing pandas-datareader, you can easily change your imports:

from pandas.io import data, wb

becomes

from pandas_datareader import data, wb

Highlights include:

Check the API Changes and deprecations before updating.

What’s new in v0.17.0

New features#

Datetime with TZ#

We are adding an implementation that natively supports datetime with timezones. A Series or a DataFrame column previously_could_ be assigned a datetime with timezones, and would work as an object dtype. This had performance issues with a large number rows. See the docs for more details. (GH 8260, GH 10763, GH 11034).

The new implementation allows for having a single-timezone across all rows, with operations in a performant manner.

In [1]: df = pd.DataFrame( ...: { ...: "A": pd.date_range("20130101", periods=3), ...: "B": pd.date_range("20130101", periods=3, tz="US/Eastern"), ...: "C": pd.date_range("20130101", periods=3, tz="CET"), ...: } ...: ) ...:

In [2]: df Out[2]: A B C 0 2013-01-01 2013-01-01 00:00:00-05:00 2013-01-01 00:00:00+01:00 1 2013-01-02 2013-01-02 00:00:00-05:00 2013-01-02 00:00:00+01:00 2 2013-01-03 2013-01-03 00:00:00-05:00 2013-01-03 00:00:00+01:00

[3 rows x 3 columns]

In [3]: df.dtypes Out[3]: A datetime64[ns] B datetime64[ns, US/Eastern] C datetime64[ns, CET] Length: 3, dtype: object

In [4]: df.B Out[4]: 0 2013-01-01 00:00:00-05:00 1 2013-01-02 00:00:00-05:00 2 2013-01-03 00:00:00-05:00 Name: B, Length: 3, dtype: datetime64[ns, US/Eastern]

In [5]: df.B.dt.tz_localize(None) Out[5]: 0 2013-01-01 1 2013-01-02 2 2013-01-03 Name: B, Length: 3, dtype: datetime64[ns]

This uses a new-dtype representation as well, that is very similar in look-and-feel to its numpy cousin datetime64[ns]

In [6]: df["B"].dtype Out[6]: datetime64[ns, US/Eastern]

In [7]: type(df["B"].dtype) Out[7]: pandas.core.dtypes.dtypes.DatetimeTZDtype

Note

There is a slightly different string repr for the underlying DatetimeIndex as a result of the dtype changes, but functionally these are the same.

Previous behavior:

In [1]: pd.date_range('20130101', periods=3, tz='US/Eastern') Out[1]: DatetimeIndex(['2013-01-01 00:00:00-05:00', '2013-01-02 00:00:00-05:00', '2013-01-03 00:00:00-05:00'], dtype='datetime64[ns]', freq='D', tz='US/Eastern')

In [2]: pd.date_range('20130101', periods=3, tz='US/Eastern').dtype Out[2]: dtype('<M8[ns]')

New behavior:

In [8]: pd.date_range("20130101", periods=3, tz="US/Eastern") Out[8]: DatetimeIndex(['2013-01-01 00:00:00-05:00', '2013-01-02 00:00:00-05:00', '2013-01-03 00:00:00-05:00'], dtype='datetime64[ns, US/Eastern]', freq='D')

In [9]: pd.date_range("20130101", periods=3, tz="US/Eastern").dtype Out[9]: datetime64[ns, US/Eastern]

Releasing the GIL#

We are releasing the global-interpreter-lock (GIL) on some cython operations. This will allow other threads to run simultaneously during computation, potentially allowing performance improvements from multi-threading. Notably groupby, nsmallest, value_counts and some indexing operations benefit from this. (GH 8882)

For example the groupby expression in the following code will have the GIL released during the factorization step, e.g. df.groupby('key')as well as the .sum() operation.

N = 1000000 ngroups = 10 df = DataFrame( {"key": np.random.randint(0, ngroups, size=N), "data": np.random.randn(N)} ) df.groupby("key")["data"].sum()

Releasing of the GIL could benefit an application that uses threads for user interactions (e.g. QT), or performing multi-threaded computations. A nice example of a library that can handle these types of computation-in-parallel is the dask library.

Plot submethods#

The Series and DataFrame .plot() method allows for customizing plot types by supplying the kind keyword arguments. Unfortunately, many of these kinds of plots use different required and optional keyword arguments, which makes it difficult to discover what any given plot kind uses out of the dozens of possible arguments.

To alleviate this issue, we have added a new, optional plotting interface, which exposes each kind of plot as a method of the .plot attribute. Instead of writing series.plot(kind=<kind>, ...), you can now also use series.plot.<kind>(...):

In [10]: df = pd.DataFrame(np.random.rand(10, 2), columns=['a', 'b'])

In [11]: df.plot.bar()

../_images/whatsnew_plot_submethods.png

As a result of this change, these methods are now all discoverable via tab-completion:

In [12]: df.plot. # noqa: E225, E999 df.plot.area df.plot.barh df.plot.density df.plot.hist df.plot.line df.plot.scatter df.plot.bar df.plot.box df.plot.hexbin df.plot.kde df.plot.pie

Each method signature only includes relevant arguments. Currently, these are limited to required arguments, but in the future these will include optional arguments, as well. For an overview, see the new Plotting API documentation.

Additional methods for dt accessor#

Series.dt.strftime#

We are now supporting a Series.dt.strftime method for datetime-likes to generate a formatted string (GH 10110). Examples:

DatetimeIndex

In [13]: s = pd.Series(pd.date_range("20130101", periods=4))

In [14]: s Out[14]: 0 2013-01-01 1 2013-01-02 2 2013-01-03 3 2013-01-04 Length: 4, dtype: datetime64[ns]

In [15]: s.dt.strftime("%Y/%m/%d") Out[15]: 0 2013/01/01 1 2013/01/02 2 2013/01/03 3 2013/01/04 Length: 4, dtype: object

PeriodIndex

In [16]: s = pd.Series(pd.period_range("20130101", periods=4))

In [17]: s Out[17]: 0 2013-01-01 1 2013-01-02 2 2013-01-03 3 2013-01-04 Length: 4, dtype: period[D]

In [18]: s.dt.strftime("%Y/%m/%d") Out[18]: 0 2013/01/01 1 2013/01/02 2 2013/01/03 3 2013/01/04 Length: 4, dtype: object

The string format is as the python standard library and details can be found here

Series.dt.total_seconds#

pd.Series of type timedelta64 has new method .dt.total_seconds() returning the duration of the timedelta in seconds (GH 10817)

TimedeltaIndex

In [19]: s = pd.Series(pd.timedelta_range("1 minutes", periods=4))

In [20]: s Out[20]: 0 0 days 00:01:00 1 1 days 00:01:00 2 2 days 00:01:00 3 3 days 00:01:00 Length: 4, dtype: timedelta64[ns]

In [21]: s.dt.total_seconds() Out[21]: 0 60.0 1 86460.0 2 172860.0 3 259260.0 Length: 4, dtype: float64

Period frequency enhancement#

Period, PeriodIndex and period_range can now accept multiplied freq. Also, Period.freq and PeriodIndex.freq are now stored as a DateOffset instance like DatetimeIndex, and not as str (GH 7811)

A multiplied freq represents a span of corresponding length. The example below creates a period of 3 days. Addition and subtraction will shift the period by its span.

In [22]: p = pd.Period("2015-08-01", freq="3D")

In [23]: p Out[23]: Period('2015-08-01', '3D')

In [24]: p + 1 Out[24]: Period('2015-08-04', '3D')

In [25]: p - 2 Out[25]: Period('2015-07-26', '3D')

In [26]: p.to_timestamp() Out[26]: Timestamp('2015-08-01 00:00:00')

In [27]: p.to_timestamp(how="E") Out[27]: Timestamp('2015-08-03 23:59:59.999999999')

You can use the multiplied freq in PeriodIndex and period_range.

In [28]: idx = pd.period_range("2015-08-01", periods=4, freq="2D")

In [29]: idx Out[29]: PeriodIndex(['2015-08-01', '2015-08-03', '2015-08-05', '2015-08-07'], dtype='period[2D]')

In [30]: idx + 1 Out[30]: PeriodIndex(['2015-08-03', '2015-08-05', '2015-08-07', '2015-08-09'], dtype='period[2D]')

Support for SAS XPORT files#

read_sas() provides support for reading SAS XPORT format files. (GH 4052).

df = pd.read_sas("sas_xport.xpt")

It is also possible to obtain an iterator and read an XPORT file incrementally.

for df in pd.read_sas("sas_xport.xpt", chunksize=10000): do_something(df)

See the docs for more details.

Support for math functions in .eval()#

eval() now supports calling math functions (GH 4893)

df = pd.DataFrame({"a": np.random.randn(10)}) df.eval("b = sin(a)")

The support math functions are sin, cos, exp, log, expm1, log1p,sqrt, sinh, cosh, tanh, arcsin, arccos, arctan, arccosh,arcsinh, arctanh, abs and arctan2.

These functions map to the intrinsics for the NumExpr engine. For the Python engine, they are mapped to NumPy calls.

Changes to Excel with MultiIndex#

In version 0.16.2 a DataFrame with MultiIndex columns could not be written to Excel via to_excel. That functionality has been added (GH 10564), along with updating read_excel so that the data can be read back with, no loss of information, by specifying which columns/rows make up the MultiIndexin the header and index_col parameters (GH 4679)

See the documentation for more details.

In [31]: df = pd.DataFrame( ....: [[1, 2, 3, 4], [5, 6, 7, 8]], ....: columns=pd.MultiIndex.from_product( ....: [["foo", "bar"], ["a", "b"]], names=["col1", "col2"] ....: ), ....: index=pd.MultiIndex.from_product([["j"], ["l", "k"]], names=["i1", "i2"]), ....: ) ....:

In [32]: df Out[32]: col1 foo bar
col2 a b a b i1 i2
j l 1 2 3 4 k 5 6 7 8

[2 rows x 4 columns]

In [33]: df.to_excel("test.xlsx")

In [34]: df = pd.read_excel("test.xlsx", header=[0, 1], index_col=[0, 1])

In [35]: df Out[35]: col1 foo bar
col2 a b a b i1 i2
j l 1 2 3 4 k 5 6 7 8

[2 rows x 4 columns]

Previously, it was necessary to specify the has_index_names argument in read_excel, if the serialized data had index names. For version 0.17.0 the output format of to_excelhas been changed to make this keyword unnecessary - the change is shown below.

Old

../_images/old-excel-index.png

New

../_images/new-excel-index.png

Warning

Excel files saved in version 0.16.2 or prior that had index names will still able to be read in, but the has_index_names argument must specified to True.

Google BigQuery enhancements#

Display alignment with Unicode East Asian width#

Warning

Enabling this option will affect the performance for printing of DataFrame and Series (about 2 times slower). Use only when it is actually required.

Some East Asian countries use Unicode characters its width is corresponding to 2 alphabets. If a DataFrame or Series contains these characters, the default output cannot be aligned properly. The following options are added to enable precise handling for these characters.

In [36]: df = pd.DataFrame({u"国籍": ["UK", u"日本"], u"名前": ["Alice", u"しのぶ"]})

In [37]: df Out[37]: 国籍 名前 0 UK Alice 1 日本 しのぶ

[2 rows x 2 columns]

In [38]: pd.set_option("display.unicode.east_asian_width", True)

In [39]: df Out[39]: 国籍 名前 0 UK Alice 1 日本 しのぶ

[2 rows x 2 columns]

For further details, see here

Other enhancements#

1 1 b 2.0 both
2 2 NaN 2.0 right_only
3 2 NaN 2.0 right_only
[4 rows x 4 columns]
For more, see the updated docs

1 5.0
2 5.0
3 7.0
4 NaN
5 11.0
6 13.0
Length: 7, dtype: float64

first 0.126970 0.966718 0.260476
second 0.897237 0.376750 0.336222
third 0.451376 0.840255 0.123102
[3 rows x 3 columns]
In [51]: df.round(2)
Out[51]:
A B C
first 0.13 0.97 0.26
second 0.90 0.38 0.34
third 0.45 0.84 0.12
[3 rows x 3 columns]
In [52]: df.round({"A": 0, "C": 2})
Out[52]:
A B C
first 0.0 0.966718 0.26
second 1.0 0.376750 0.34
third 0.0 0.840255 0.12
[3 rows x 3 columns]

0.1 0.0 2000-01-01
1.9 2.0 2000-01-03
3.5 NaN NaT
[3 rows x 2 columns]
When used on a DatetimeIndex, TimedeltaIndex or PeriodIndex, tolerance will coerced into a Timedelta if possible. This allows you to specify tolerance with a string:
In [59]: df = df.set_index("t")
In [60]: df.reindex(pd.to_datetime(["1999-12-31"]), method="nearest", tolerance="1 day")
Out[60]:
x
1999-12-31 0
[1 rows x 1 columns]
tolerance is also exposed by the lower level Index.get_indexer and Index.get_loc methods.

Backwards incompatible API changes#

Changes to sorting API#

The sorting API has had some longtime inconsistencies. (GH 9816, GH 8239).

Here is a summary of the API PRIOR to 0.17.0:

To address these issues, we have revamped the API:

We now have two distinct and non-overlapping methods of sorting. A * marks items that will show a FutureWarning.

To sort by the values:

Previous Replacement
* Series.order() Series.sort_values()
* Series.sort() Series.sort_values(inplace=True)
* DataFrame.sort(columns=...) DataFrame.sort_values(by=...)

To sort by the index:

Previous Replacement
Series.sort_index() Series.sort_index()
Series.sortlevel(level=...) Series.sort_index(level=...)
DataFrame.sort_index() DataFrame.sort_index()
DataFrame.sortlevel(level=...) DataFrame.sort_index(level=...)
* DataFrame.sort() DataFrame.sort_index()

We have also deprecated and changed similar methods in two Series-like classes, Index and Categorical.

Previous Replacement
* Index.order() Index.sort_values()
* Categorical.order() Categorical.sort_values()

Changes to to_datetime and to_timedelta#

Error handling#

The default for pd.to_datetime error handling has changed to errors='raise'. In prior versions it was errors='ignore'. Furthermore, the coerce argument has been deprecated in favor of errors='coerce'. This means that invalid parsing will raise rather that return the original input as in previous versions. (GH 10636)

Previous behavior:

In [2]: pd.to_datetime(['2009-07-31', 'asd']) Out[2]: array(['2009-07-31', 'asd'], dtype=object)

New behavior:

In [3]: pd.to_datetime(['2009-07-31', 'asd']) ValueError: Unknown string format

Of course you can coerce this as well.

In [61]: pd.to_datetime(["2009-07-31", "asd"], errors="coerce") Out[61]: DatetimeIndex(['2009-07-31', 'NaT'], dtype='datetime64[ns]', freq=None)

To keep the previous behavior, you can use errors='ignore':

In [4]: pd.to_datetime(["2009-07-31", "asd"], errors="ignore") Out[4]: Index(['2009-07-31', 'asd'], dtype='object')

Furthermore, pd.to_timedelta has gained a similar API, of errors='raise'|'ignore'|'coerce', and the coerce keyword has been deprecated in favor of errors='coerce'.

Consistent parsing#

The string parsing of to_datetime, Timestamp and DatetimeIndex has been made consistent. (GH 7599)

Prior to v0.17.0, Timestamp and to_datetime may parse year-only datetime-string incorrectly using today’s date, otherwise DatetimeIndexuses the beginning of the year. Timestamp and to_datetime may raise ValueError in some types of datetime-string which DatetimeIndexcan parse, such as a quarterly string.

Previous behavior:

In [1]: pd.Timestamp('2012Q2') Traceback ... ValueError: Unable to parse 2012Q2

Results in today's date.

In [2]: pd.Timestamp('2014') Out [2]: 2014-08-12 00:00:00

v0.17.0 can parse them as below. It works on DatetimeIndex also.

New behavior:

In [62]: pd.Timestamp("2012Q2") Out[62]: Timestamp('2012-04-01 00:00:00')

In [63]: pd.Timestamp("2014") Out[63]: Timestamp('2014-01-01 00:00:00')

In [64]: pd.DatetimeIndex(["2012Q2", "2014"]) Out[64]: DatetimeIndex(['2012-04-01', '2014-01-01'], dtype='datetime64[ns]', freq=None)

Note

If you want to perform calculations based on today’s date, use Timestamp.now() and pandas.tseries.offsets.

In [65]: import pandas.tseries.offsets as offsets

In [66]: pd.Timestamp.now() Out[66]: Timestamp('2024-09-20 12:30:23.176994')

In [67]: pd.Timestamp.now() + offsets.DateOffset(years=1) Out[67]: Timestamp('2025-09-20 12:30:23.177749')

Changes to Index comparisons#

Operator equal on Index should behavior similarly to Series (GH 9947, GH 10637)

Starting in v0.17.0, comparing Index objects of different lengths will raise a ValueError. This is to be consistent with the behavior of Series.

Previous behavior:

In [2]: pd.Index([1, 2, 3]) == pd.Index([1, 4, 5]) Out[2]: array([ True, False, False], dtype=bool)

In [3]: pd.Index([1, 2, 3]) == pd.Index([2]) Out[3]: array([False, True, False], dtype=bool)

In [4]: pd.Index([1, 2, 3]) == pd.Index([1, 2]) Out[4]: False

New behavior:

In [8]: pd.Index([1, 2, 3]) == pd.Index([1, 4, 5]) Out[8]: array([ True, False, False], dtype=bool)

In [9]: pd.Index([1, 2, 3]) == pd.Index([2]) ValueError: Lengths must match to compare

In [10]: pd.Index([1, 2, 3]) == pd.Index([1, 2]) ValueError: Lengths must match to compare

Note that this is different from the numpy behavior where a comparison can be broadcast:

In [68]: np.array([1, 2, 3]) == np.array([1]) Out[68]: array([ True, False, False])

or it can return False if broadcasting can not be done:

In [11]: np.array([1, 2, 3]) == np.array([1, 2]) Out[11]: False

Changes to boolean comparisons vs. None#

Boolean comparisons of a Series vs None will now be equivalent to comparing with np.nan, rather than raise TypeError. (GH 1079).

In [69]: s = pd.Series(range(3), dtype="float")

In [70]: s.iloc[1] = None

In [71]: s Out[71]: 0 0.0 1 NaN 2 2.0 Length: 3, dtype: float64

Previous behavior:

In [5]: s == None TypeError: Could not compare <type 'NoneType'> type with Series

New behavior:

In [72]: s == None Out[72]: 0 False 1 False 2 False Length: 3, dtype: bool

Usually you simply want to know which values are null.

In [73]: s.isnull() Out[73]: 0 False 1 True 2 False Length: 3, dtype: bool

Warning

You generally will want to use isnull/notnull for these types of comparisons, as isnull/notnull tells you which elements are null. One has to be mindful that nan's don’t compare equal, but None's do. Note that pandas/numpy uses the fact that np.nan != np.nan, and treats None like np.nan.

In [74]: None == None Out[74]: True

In [75]: np.nan == np.nan Out[75]: False

HDFStore dropna behavior#

The default behavior for HDFStore write functions with format='table' is now to keep rows that are all missing. Previously, the behavior was to drop rows that were all missing save the index. The previous behavior can be replicated using the dropna=True option. (GH 9382)

Previous behavior:

In [76]: df_with_missing = pd.DataFrame( ....: {"col1": [0, np.nan, 2], "col2": [1, np.nan, np.nan]} ....: ) ....:

In [77]: df_with_missing Out[77]: col1 col2 0 0.0 1.0 1 NaN NaN 2 2.0 NaN

[3 rows x 2 columns]

In [27]: df_with_missing.to_hdf('file.h5', key='df_with_missing', format='table', mode='w')

In [28]: pd.read_hdf('file.h5', 'df_with_missing')

Out [28]: col1 col2 0 0 1 2 2 NaN

New behavior:

In [78]: df_with_missing.to_hdf("file.h5", key="df_with_missing", format="table", mode="w")

In [79]: pd.read_hdf("file.h5", "df_with_missing") Out[79]: col1 col2 0 0.0 1.0 1 NaN NaN 2 2.0 NaN

[3 rows x 2 columns]

See the docs for more details.

Changes to display.precision option#

The display.precision option has been clarified to refer to decimal places (GH 10451).

Earlier versions of pandas would format floating point numbers to have one less decimal place than the value indisplay.precision.

In [1]: pd.set_option('display.precision', 2)

In [2]: pd.DataFrame({'x': [123.456789]}) Out[2]: x 0 123.5

If interpreting precision as “significant figures” this did work for scientific notation but that same interpretation did not work for values with standard formatting. It was also out of step with how numpy handles formatting.

Going forward the value of display.precision will directly control the number of places after the decimal, for regular formatting as well as scientific notation, similar to how numpy’s precision print option works.

In [80]: pd.set_option("display.precision", 2)

In [81]: pd.DataFrame({"x": [123.456789]}) Out[81]: x 0 123.46

[1 rows x 1 columns]

To preserve output behavior with prior versions the default value of display.precision has been reduced to 6from 7.

Changes to Categorical.unique#

Categorical.unique now returns new Categoricals with categories and codes that are unique, rather than returning np.array (GH 10508)

In [82]: cat = pd.Categorical(["C", "A", "B", "C"], categories=["A", "B", "C"], ordered=True)

In [83]: cat Out[83]: ['C', 'A', 'B', 'C'] Categories (3, object): ['A' < 'B' < 'C']

In [84]: cat.unique() Out[84]: ['C', 'A', 'B'] Categories (3, object): ['A' < 'B' < 'C']

In [85]: cat = pd.Categorical(["C", "A", "B", "C"], categories=["A", "B", "C"])

In [86]: cat Out[86]: ['C', 'A', 'B', 'C'] Categories (3, object): ['A', 'B', 'C']

In [87]: cat.unique() Out[87]: ['C', 'A', 'B'] Categories (3, object): ['A', 'B', 'C']

Other API changes#

Deprecations#

Note

These indexing function have been deprecated in the documentation since 0.11.0.

Removal of prior version deprecations/changes#

2013-01-01 0.471435 -1.190976
2013-01-02 1.432707 -0.312652
2013-01-03 -0.720589 0.887163
2013-01-04 0.859588 -0.636524
2013-01-05 0.015696 -2.242685
[5 rows x 2 columns]
Previously
In [3]: df + df.A
FutureWarning: TimeSeries broadcasting along DataFrame index by default is deprecated.
Please use DataFrame. to explicitly broadcast arithmetic operations along the index
Out[3]:
A B
2013-01-01 0.942870 -0.719541
2013-01-02 2.865414 1.120055
2013-01-03 -1.441177 0.166574
2013-01-04 1.719177 0.223065
2013-01-05 0.031393 -2.226989
Current
In [91]: df.add(df.A, axis="index")
Out[91]:
A B
2013-01-01 0.942870 -0.719541
2013-01-02 2.865414 1.120055
2013-01-03 -1.441177 0.166574
2013-01-04 1.719177 0.223065
2013-01-05 0.031393 -2.226989
[5 rows x 2 columns]

Performance improvements#

Bug fixes#

Contributors#

A total of 112 people contributed patches to this release. People with a “+” by their names contributed a patch for the first time.