Version 0.18.0 (March 13, 2016) — pandas 2.2.3 documentation (original) (raw)

This is a major release from 0.17.1 and includes a small number of API changes, several new features, enhancements, and performance improvements along with a large number of bug fixes. We recommend that all users upgrade to this version.

Warning

pandas >= 0.18.0 no longer supports compatibility with Python version 2.6 and 3.3 (GH 7718, GH 11273)

Warning

numexpr version 2.4.4 will now show a warning and not be used as a computation back-end for pandas because of some buggy behavior. This does not affect other versions (>= 2.1 and >= 2.4.6). (GH 12489)

Highlights include:

Check the API Changes and deprecations before updating.

What’s new in v0.18.0

New features#

Window functions are now methods#

Window functions have been refactored to be methods on Series/DataFrame objects, rather than top-level functions, which are now deprecated. This allows these window-type functions, to have a similar API to that of .groupby. See the full documentation here (GH 11603, GH 12373)

In [1]: np.random.seed(1234)

In [2]: df = pd.DataFrame({'A': range(10), 'B': np.random.randn(10)})

In [3]: df Out[3]: A B 0 0 0.471435 1 1 -1.190976 2 2 1.432707 3 3 -0.312652 4 4 -0.720589 5 5 0.887163 6 6 0.859588 7 7 -0.636524 8 8 0.015696 9 9 -2.242685

[10 rows x 2 columns]

Previous behavior:

In [8]: pd.rolling_mean(df, window=3) FutureWarning: pd.rolling_mean is deprecated for DataFrame and will be removed in a future version, replace with DataFrame.rolling(window=3,center=False).mean() Out[8]: A B 0 NaN NaN 1 NaN NaN 2 1 0.237722 3 2 -0.023640 4 3 0.133155 5 4 -0.048693 6 5 0.342054 7 6 0.370076 8 7 0.079587 9 8 -0.954504

New behavior:

In [4]: r = df.rolling(window=3)

These show a descriptive repr

In [5]: r Out[5]: Rolling [window=3,center=False,axis=0,method=single]

with tab-completion of available methods and properties.

In [9]: r. # noqa E225, E999 r.A r.agg r.apply r.count r.exclusions r.max r.median r.name r.skew r.sum r.B r.aggregate r.corr r.cov r.kurt r.mean r.min r.quantile r.std r.var

The methods operate on the Rolling object itself

In [6]: r.mean() Out[6]: A B 0 NaN NaN 1 NaN NaN 2 1.0 0.237722 3 2.0 -0.023640 4 3.0 0.133155 5 4.0 -0.048693 6 5.0 0.342054 7 6.0 0.370076 8 7.0 0.079587 9 8.0 -0.954504

[10 rows x 2 columns]

They provide getitem accessors

In [7]: r['A'].mean() Out[7]: 0 NaN 1 NaN 2 1.0 3 2.0 4 3.0 5 4.0 6 5.0 7 6.0 8 7.0 9 8.0 Name: A, Length: 10, dtype: float64

And multiple aggregations

In [8]: r.agg({'A': ['mean', 'std'], ...: 'B': ['mean', 'std']}) ...: Out[8]: A B
mean std mean std 0 NaN NaN NaN NaN 1 NaN NaN NaN NaN 2 1.0 1.0 0.237722 1.327364 3 2.0 1.0 -0.023640 1.335505 4 3.0 1.0 0.133155 1.143778 5 4.0 1.0 -0.048693 0.835747 6 5.0 1.0 0.342054 0.920379 7 6.0 1.0 0.370076 0.871850 8 7.0 1.0 0.079587 0.750099 9 8.0 1.0 -0.954504 1.162285

[10 rows x 4 columns]

Changes to rename#

Series.rename and NDFrame.rename_axis can now take a scalar or list-like argument for altering the Series or axis name, in addition to their old behaviors of altering labels. (GH 9494, GH 11965)

In [9]: s = pd.Series(np.random.randn(5))

In [10]: s.rename('newname') Out[10]: 0 1.150036 1 0.991946 2 0.953324 3 -2.021255 4 -0.334077 Name: newname, Length: 5, dtype: float64

In [11]: df = pd.DataFrame(np.random.randn(5, 2))

In [12]: (df.rename_axis("indexname") ....: .rename_axis("columns_name", axis="columns")) ....: Out[12]: columns_name 0 1 indexname
0 0.002118 0.405453 1 0.289092 1.321158 2 -1.546906 -0.202646 3 -0.655969 0.193421 4 0.553439 1.318152

[5 rows x 2 columns]

The new functionality works well in method chains. Previously these methods only accepted functions or dicts mapping a label to a new label. This continues to work as before for function or dict-like values.

Range Index#

A RangeIndex has been added to the Int64Index sub-classes to support a memory saving alternative for common use cases. This has a similar implementation to the python range object (xrange in python 2), in that it only stores the start, stop, and step values for the index. It will transparently interact with the user API, converting to Int64Index if needed.

This will now be the default constructed index for NDFrame objects, rather than previous an Int64Index. (GH 939, GH 12070, GH 12071, GH 12109, GH 12888)

Previous behavior:

In [3]: s = pd.Series(range(1000))

In [4]: s.index Out[4]: Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, ... 990, 991, 992, 993, 994, 995, 996, 997, 998, 999], dtype='int64', length=1000)

In [6]: s.index.nbytes Out[6]: 8000

New behavior:

In [13]: s = pd.Series(range(1000))

In [14]: s.index Out[14]: RangeIndex(start=0, stop=1000, step=1)

In [15]: s.index.nbytes Out[15]: 128

Changes to str.cat#

The method .str.cat() concatenates the members of a Series. Before, if NaN values were present in the Series, calling .str.cat() on it would return NaN, unlike the rest of the Series.str.* API. This behavior has been amended to ignore NaN values by default. (GH 11435).

A new, friendlier ValueError is added to protect against the mistake of supplying the sep as an arg, rather than as a kwarg. (GH 11334).

In [27]: pd.Series(['a', 'b', np.nan, 'c']).str.cat(sep=' ') Out[27]: 'a b c'

In [28]: pd.Series(['a', 'b', np.nan, 'c']).str.cat(sep=' ', na_rep='?') Out[28]: 'a b ? c'

In [2]: pd.Series(['a', 'b', np.nan, 'c']).str.cat(' ') ValueError: Did you mean to supply a sep keyword?

Datetimelike rounding#

DatetimeIndex, Timestamp, TimedeltaIndex, Timedelta have gained the .round(), .floor() and .ceil() method for datetimelike rounding, flooring and ceiling. (GH 4314, GH 11963)

Naive datetimes

In [29]: dr = pd.date_range('20130101 09:12:56.1234', periods=3)

In [30]: dr Out[30]: DatetimeIndex(['2013-01-01 09:12:56.123400', '2013-01-02 09:12:56.123400', '2013-01-03 09:12:56.123400'], dtype='datetime64[ns]', freq='D')

In [31]: dr.round('s') Out[31]: DatetimeIndex(['2013-01-01 09:12:56', '2013-01-02 09:12:56', '2013-01-03 09:12:56'], dtype='datetime64[ns]', freq=None)

Timestamp scalar

In [32]: dr[0] Out[32]: Timestamp('2013-01-01 09:12:56.123400')

In [33]: dr[0].round('10s') Out[33]: Timestamp('2013-01-01 09:13:00')

Tz-aware are rounded, floored and ceiled in local times

In [34]: dr = dr.tz_localize('US/Eastern')

In [35]: dr Out[35]: DatetimeIndex(['2013-01-01 09:12:56.123400-05:00', '2013-01-02 09:12:56.123400-05:00', '2013-01-03 09:12:56.123400-05:00'], dtype='datetime64[ns, US/Eastern]', freq=None)

In [36]: dr.round('s') Out[36]: DatetimeIndex(['2013-01-01 09:12:56-05:00', '2013-01-02 09:12:56-05:00', '2013-01-03 09:12:56-05:00'], dtype='datetime64[ns, US/Eastern]', freq=None)

Timedeltas

In [37]: t = pd.timedelta_range('1 days 2 hr 13 min 45 us', periods=3, freq='d')

In [38]: t Out[38]: TimedeltaIndex(['1 days 02:13:00.000045', '2 days 02:13:00.000045', '3 days 02:13:00.000045'], dtype='timedelta64[ns]', freq='D')

In [39]: t.round('10min') Out[39]: TimedeltaIndex(['1 days 02:10:00', '2 days 02:10:00', '3 days 02:10:00'], dtype='timedelta64[ns]', freq=None)

Timedelta scalar

In [40]: t[0] Out[40]: Timedelta('1 days 02:13:00.000045')

In [41]: t[0].round('2h') Out[41]: Timedelta('1 days 02:00:00')

In addition, .round(), .floor() and .ceil() will be available through the .dt accessor of Series.

In [42]: s = pd.Series(dr)

In [43]: s Out[43]: 0 2013-01-01 09:12:56.123400-05:00 1 2013-01-02 09:12:56.123400-05:00 2 2013-01-03 09:12:56.123400-05:00 Length: 3, dtype: datetime64[ns, US/Eastern]

In [44]: s.dt.round('D') Out[44]: 0 2013-01-01 00:00:00-05:00 1 2013-01-02 00:00:00-05:00 2 2013-01-03 00:00:00-05:00 Length: 3, dtype: datetime64[ns, US/Eastern]

Formatting of integers in FloatIndex#

Integers in FloatIndex, e.g. 1., are now formatted with a decimal point and a 0 digit, e.g. 1.0 (GH 11713) This change not only affects the display to the console, but also the output of IO methods like .to_csv or .to_html.

Previous behavior:

In [2]: s = pd.Series([1, 2, 3], index=np.arange(3.))

In [3]: s Out[3]: 0 1 1 2 2 3 dtype: int64

In [4]: s.index Out[4]: Float64Index([0.0, 1.0, 2.0], dtype='float64')

In [5]: print(s.to_csv(path=None)) 0,1 1,2 2,3

New behavior:

In [45]: s = pd.Series([1, 2, 3], index=np.arange(3.))

In [46]: s Out[46]: 0.0 1 1.0 2 2.0 3 Length: 3, dtype: int64

In [47]: s.index Out[47]: Index([0.0, 1.0, 2.0], dtype='float64')

In [48]: print(s.to_csv(path_or_buf=None, header=False)) 0.0,1 1.0,2 2.0,3

Changes to dtype assignment behaviors#

When a DataFrame’s slice is updated with a new slice of the same dtype, the dtype of the DataFrame will now remain the same. (GH 10503)

Previous behavior:

In [5]: df = pd.DataFrame({'a': [0, 1, 1], 'b': pd.Series([100, 200, 300], dtype='uint32')})

In [7]: df.dtypes Out[7]: a int64 b uint32 dtype: object

In [8]: ix = df['a'] == 1

In [9]: df.loc[ix, 'b'] = df.loc[ix, 'b']

In [11]: df.dtypes Out[11]: a int64 b int64 dtype: object

New behavior:

In [49]: df = pd.DataFrame({'a': [0, 1, 1], ....: 'b': pd.Series([100, 200, 300], dtype='uint32')}) ....:

In [50]: df.dtypes Out[50]: a int64 b uint32 Length: 2, dtype: object

In [51]: ix = df['a'] == 1

In [52]: df.loc[ix, 'b'] = df.loc[ix, 'b']

In [53]: df.dtypes Out[53]: a int64 b uint32 Length: 2, dtype: object

When a DataFrame’s integer slice is partially updated with a new slice of floats that could potentially be down-casted to integer without losing precision, the dtype of the slice will be set to float instead of integer.

Previous behavior:

In [4]: df = pd.DataFrame(np.array(range(1,10)).reshape(3,3), columns=list('abc'), index=[[4,4,8], [8,10,12]])

In [5]: df Out[5]: a b c 4 8 1 2 3 10 4 5 6 8 12 7 8 9

In [7]: df.ix[4, 'c'] = np.array([0., 1.])

In [8]: df Out[8]: a b c 4 8 1 2 0 10 4 5 1 8 12 7 8 9

New behavior:

In [54]: df = pd.DataFrame(np.array(range(1,10)).reshape(3,3), ....: columns=list('abc'), ....: index=[[4,4,8], [8,10,12]]) ....:

In [55]: df Out[55]: a b c 4 8 1 2 3 10 4 5 6 8 12 7 8 9

[3 rows x 3 columns]

In [56]: df.loc[4, 'c'] = np.array([0., 1.])

In [57]: df Out[57]: a b c 4 8 1 2 0 10 4 5 1 8 12 7 8 9

[3 rows x 3 columns]

Method to_xarray#

In a future version of pandas, we will be deprecating Panel and other > 2 ndim objects. In order to provide for continuity, all NDFrame objects have gained the .to_xarray() method in order to convert to xarray objects, which has a pandas-like interface for > 2 ndim. (GH 11972)

See the xarray full-documentation here.

In [1]: p = Panel(np.arange(234).reshape(2,3,4))

In [2]: p.to_xarray() Out[2]: <xarray.DataArray (items: 2, major_axis: 3, minor_axis: 4)> array([[[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11]],

   [[12, 13, 14, 15],
    [16, 17, 18, 19],
    [20, 21, 22, 23]]])

Coordinates:

Latex representation#

DataFrame has gained a ._repr_latex_() method in order to allow for conversion to latex in a ipython/jupyter notebook using nbconvert. (GH 11778)

Note that this must be activated by setting the option pd.display.latex.repr=True (GH 12182)

For example, if you have a jupyter notebook you plan to convert to latex using nbconvert, place the statement pd.display.latex.repr=True in the first cell to have the contained DataFrame output also stored as latex.

The options display.latex.escape and display.latex.longtable have also been added to the configuration and are used automatically by the to_latexmethod. See the available options docs for more info.

pd.read_sas() changes#

read_sas has gained the ability to read SAS7BDAT files, including compressed files. The files can be read in entirety, or incrementally. For full details see here. (GH 4052)

Other enhancements#

Backwards incompatible API changes#

NaT and Timedelta operations#

NaT and Timedelta have expanded arithmetic operations, which are extended to Seriesarithmetic where applicable. Operations defined for datetime64[ns] or timedelta64[ns]are now also defined for NaT (GH 11564).

NaT now supports arithmetic operations with integers and floats.

In [58]: pd.NaT * 1 Out[58]: NaT

In [59]: pd.NaT * 1.5 Out[59]: NaT

In [60]: pd.NaT / 2 Out[60]: NaT

In [61]: pd.NaT * np.nan Out[61]: NaT

NaT defines more arithmetic operations with datetime64[ns] and timedelta64[ns].

In [62]: pd.NaT / pd.NaT Out[62]: nan

In [63]: pd.Timedelta('1s') / pd.NaT Out[63]: nan

NaT may represent either a datetime64[ns] null or a timedelta64[ns] null. Given the ambiguity, it is treated as a timedelta64[ns], which allows more operations to succeed.

In [64]: pd.NaT + pd.NaT Out[64]: NaT

same as

In [65]: pd.Timedelta('1s') + pd.Timedelta('1s') Out[65]: Timedelta('0 days 00:00:02')

as opposed to

In [3]: pd.Timestamp('19900315') + pd.Timestamp('19900315') TypeError: unsupported operand type(s) for +: 'Timestamp' and 'Timestamp'

However, when wrapped in a Series whose dtype is datetime64[ns] or timedelta64[ns], the dtype information is respected.

In [1]: pd.Series([pd.NaT], dtype='<M8[ns]') + pd.Series([pd.NaT], dtype='<M8[ns]') TypeError: can only operate on a datetimes for subtraction, but the operator [add] was passed

In [66]: pd.Series([pd.NaT], dtype='<m8[ns]') + pd.Series([pd.NaT], dtype='<m8[ns]') Out[66]: 0 NaT Length: 1, dtype: timedelta64[ns]

Timedelta division by floats now works.

In [67]: pd.Timedelta('1s') / 2.0 Out[67]: Timedelta('0 days 00:00:00.500000')

Subtraction by Timedelta in a Series by a Timestamp works (GH 11925)

In [68]: ser = pd.Series(pd.timedelta_range('1 day', periods=3))

In [69]: ser Out[69]: 0 1 days 1 2 days 2 3 days Length: 3, dtype: timedelta64[ns]

In [70]: pd.Timestamp('2012-01-01') - ser Out[70]: 0 2011-12-31 1 2011-12-30 2 2011-12-29 Length: 3, dtype: datetime64[ns]

NaT.isoformat() now returns 'NaT'. This change allowspd.Timestamp to rehydrate any timestamp like object from its isoformat (GH 12300).

Changes to msgpack#

Forward incompatible changes in msgpack writing format were made over 0.17.0 and 0.18.0; older versions of pandas cannot read files packed by newer versions (GH 12129, GH 10527)

Bugs in to_msgpack and read_msgpack introduced in 0.17.0 and fixed in 0.18.0, caused files packed in Python 2 unreadable by Python 3 (GH 12142). The following table describes the backward and forward compat of msgpacks.

Warning

Packed with Can be unpacked with
pre-0.17 / Python 2 any
pre-0.17 / Python 3 any
0.17 / Python 2 ==0.17 / Python 2 >=0.18 / any Python
0.17 / Python 3 >=0.18 / any Python
0.18 >= 0.18

0.18.0 is backward-compatible for reading files packed by older versions, except for files packed with 0.17 in Python 2, in which case only they can only be unpacked in Python 2.

Signature change for .rank#

Series.rank and DataFrame.rank now have the same signature (GH 11759)

Previous signature

In [3]: pd.Series([0,1]).rank(method='average', na_option='keep', ascending=True, pct=False) Out[3]: 0 1 1 2 dtype: float64

In [4]: pd.DataFrame([0,1]).rank(axis=0, numeric_only=None, method='average', na_option='keep', ascending=True, pct=False) Out[4]: 0 0 1 1 2

New signature

In [71]: pd.Series([0,1]).rank(axis=0, method='average', numeric_only=False, ....: na_option='keep', ascending=True, pct=False) ....: Out[71]: 0 1.0 1 2.0 Length: 2, dtype: float64

In [72]: pd.DataFrame([0,1]).rank(axis=0, method='average', numeric_only=False, ....: na_option='keep', ascending=True, pct=False) ....: Out[72]: 0 0 1.0 1 2.0

[2 rows x 1 columns]

Bug in QuarterBegin with n=0#

In previous versions, the behavior of the QuarterBegin offset was inconsistent depending on the date when the n parameter was 0. (GH 11406)

The general semantics of anchored offsets for n=0 is to not move the date when it is an anchor point (e.g., a quarter start date), and otherwise roll forward to the next anchor point.

In [73]: d = pd.Timestamp('2014-02-01')

In [74]: d Out[74]: Timestamp('2014-02-01 00:00:00')

In [75]: d + pd.offsets.QuarterBegin(n=0, startingMonth=2) Out[75]: Timestamp('2014-02-01 00:00:00')

In [76]: d + pd.offsets.QuarterBegin(n=0, startingMonth=1) Out[76]: Timestamp('2014-04-01 00:00:00')

For the QuarterBegin offset in previous versions, the date would be rolled_backwards_ if date was in the same month as the quarter start date.

In [3]: d = pd.Timestamp('2014-02-15')

In [4]: d + pd.offsets.QuarterBegin(n=0, startingMonth=2) Out[4]: Timestamp('2014-02-01 00:00:00')

This behavior has been corrected in version 0.18.0, which is consistent with other anchored offsets like MonthBegin and YearBegin.

In [77]: d = pd.Timestamp('2014-02-15')

In [78]: d + pd.offsets.QuarterBegin(n=0, startingMonth=2) Out[78]: Timestamp('2014-05-01 00:00:00')

Resample API#

Like the change in the window functions API above, .resample(...) is changing to have a more groupby-like API. (GH 11732, GH 12702, GH 12202, GH 12332, GH 12334, GH 12348, GH 12448).

In [79]: np.random.seed(1234)

In [80]: df = pd.DataFrame(np.random.rand(10,4), ....: columns=list('ABCD'), ....: index=pd.date_range('2010-01-01 09:00:00', ....: periods=10, freq='s')) ....:

In [81]: df Out[81]: A B C D 2010-01-01 09:00:00 0.191519 0.622109 0.437728 0.785359 2010-01-01 09:00:01 0.779976 0.272593 0.276464 0.801872 2010-01-01 09:00:02 0.958139 0.875933 0.357817 0.500995 2010-01-01 09:00:03 0.683463 0.712702 0.370251 0.561196 2010-01-01 09:00:04 0.503083 0.013768 0.772827 0.882641 2010-01-01 09:00:05 0.364886 0.615396 0.075381 0.368824 2010-01-01 09:00:06 0.933140 0.651378 0.397203 0.788730 2010-01-01 09:00:07 0.316836 0.568099 0.869127 0.436173 2010-01-01 09:00:08 0.802148 0.143767 0.704261 0.704581 2010-01-01 09:00:09 0.218792 0.924868 0.442141 0.909316

[10 rows x 4 columns]

Previous API:

You would write a resampling operation that immediately evaluates. If a how parameter was not provided, it would default to how='mean'.

In [6]: df.resample('2s') Out[6]: A B C D 2010-01-01 09:00:00 0.485748 0.447351 0.357096 0.793615 2010-01-01 09:00:02 0.820801 0.794317 0.364034 0.531096 2010-01-01 09:00:04 0.433985 0.314582 0.424104 0.625733 2010-01-01 09:00:06 0.624988 0.609738 0.633165 0.612452 2010-01-01 09:00:08 0.510470 0.534317 0.573201 0.806949

You could also specify a how directly

In [7]: df.resample('2s', how='sum') Out[7]: A B C D 2010-01-01 09:00:00 0.971495 0.894701 0.714192 1.587231 2010-01-01 09:00:02 1.641602 1.588635 0.728068 1.062191 2010-01-01 09:00:04 0.867969 0.629165 0.848208 1.251465 2010-01-01 09:00:06 1.249976 1.219477 1.266330 1.224904 2010-01-01 09:00:08 1.020940 1.068634 1.146402 1.613897

New API:

Now, you can write .resample(..) as a 2-stage operation like .groupby(...), which yields a Resampler.

In [82]: r = df.resample('2s')

In [83]: r Out[83]: <pandas.core.resample.DatetimeIndexResampler object at 0x7fe88cbf8f10>

Downsampling#

You can then use this object to perform operations. These are downsampling operations (going from a higher frequency to a lower one).

In [84]: r.mean() Out[84]: A B C D 2010-01-01 09:00:00 0.485748 0.447351 0.357096 0.793615 2010-01-01 09:00:02 0.820801 0.794317 0.364034 0.531096 2010-01-01 09:00:04 0.433985 0.314582 0.424104 0.625733 2010-01-01 09:00:06 0.624988 0.609738 0.633165 0.612452 2010-01-01 09:00:08 0.510470 0.534317 0.573201 0.806949

[5 rows x 4 columns]

In [85]: r.sum() Out[85]: A B C D 2010-01-01 09:00:00 0.971495 0.894701 0.714192 1.587231 2010-01-01 09:00:02 1.641602 1.588635 0.728068 1.062191 2010-01-01 09:00:04 0.867969 0.629165 0.848208 1.251465 2010-01-01 09:00:06 1.249976 1.219477 1.266330 1.224904 2010-01-01 09:00:08 1.020940 1.068634 1.146402 1.613897

[5 rows x 4 columns]

Furthermore, resample now supports getitem operations to perform the resample on specific columns.

In [86]: r[['A','C']].mean() Out[86]: A C 2010-01-01 09:00:00 0.485748 0.357096 2010-01-01 09:00:02 0.820801 0.364034 2010-01-01 09:00:04 0.433985 0.424104 2010-01-01 09:00:06 0.624988 0.633165 2010-01-01 09:00:08 0.510470 0.573201

[5 rows x 2 columns]

and .aggregate type operations.

In [87]: r.agg({'A' : 'mean', 'B' : 'sum'}) Out[87]: A B 2010-01-01 09:00:00 0.485748 0.894701 2010-01-01 09:00:02 0.820801 1.588635 2010-01-01 09:00:04 0.433985 0.629165 2010-01-01 09:00:06 0.624988 1.219477 2010-01-01 09:00:08 0.510470 1.068634

[5 rows x 2 columns]

These accessors can of course, be combined

In [88]: r[['A','B']].agg(['mean','sum']) Out[88]: A B
mean sum mean sum 2010-01-01 09:00:00 0.485748 0.971495 0.447351 0.894701 2010-01-01 09:00:02 0.820801 1.641602 0.794317 1.588635 2010-01-01 09:00:04 0.433985 0.867969 0.314582 0.629165 2010-01-01 09:00:06 0.624988 1.249976 0.609738 1.219477 2010-01-01 09:00:08 0.510470 1.020940 0.534317 1.068634

[5 rows x 4 columns]

Upsampling#

Upsampling operations take you from a lower frequency to a higher frequency. These are now performed with the Resampler objects with backfill(),ffill(), fillna() and asfreq() methods.

In [89]: s = pd.Series(np.arange(5, dtype='int64'), index=pd.date_range('2010-01-01', periods=5, freq='Q'))

In [90]: s Out[90]: 2010-03-31 0 2010-06-30 1 2010-09-30 2 2010-12-31 3 2011-03-31 4 Freq: Q-DEC, Length: 5, dtype: int64

Previously

In [6]: s.resample('M', fill_method='ffill') Out[6]: 2010-03-31 0 2010-04-30 0 2010-05-31 0 2010-06-30 1 2010-07-31 1 2010-08-31 1 2010-09-30 2 2010-10-31 2 2010-11-30 2 2010-12-31 3 2011-01-31 3 2011-02-28 3 2011-03-31 4 Freq: M, dtype: int64

New API

In [91]: s.resample('M').ffill() Out[91]: 2010-03-31 0 2010-04-30 0 2010-05-31 0 2010-06-30 1 2010-07-31 1 2010-08-31 1 2010-09-30 2 2010-10-31 2 2010-11-30 2 2010-12-31 3 2011-01-31 3 2011-02-28 3 2011-03-31 4 Freq: M, Length: 13, dtype: int64

Note

In the new API, you can either downsample OR upsample. The prior implementation would allow you to pass an aggregator function (like mean) even though you were upsampling, providing a bit of confusion.

Previous API will work but with deprecations#

Warning

This new API for resample includes some internal changes for the prior-to-0.18.0 API, to work with a deprecation warning in most cases, as the resample operation returns a deferred object. We can intercept operations and just do what the (pre 0.18.0) API did (with a warning). Here is a typical use case:

In [4]: r = df.resample('2s')

In [6]: r*10 pandas/tseries/resample.py:80: FutureWarning: .resample() is now a deferred operation use .resample(...).mean() instead of .resample(...)

Out[6]: A B C D 2010-01-01 09:00:00 4.857476 4.473507 3.570960 7.936154 2010-01-01 09:00:02 8.208011 7.943173 3.640340 5.310957 2010-01-01 09:00:04 4.339846 3.145823 4.241039 6.257326 2010-01-01 09:00:06 6.249881 6.097384 6.331650 6.124518 2010-01-01 09:00:08 5.104699 5.343172 5.732009 8.069486

However, getting and assignment operations directly on a Resampler will raise a ValueError:

In [7]: r.iloc[0] = 5 ValueError: .resample() is now a deferred operation use .resample(...).mean() instead of .resample(...)

There is a situation where the new API can not perform all the operations when using original code. This code is intending to resample every 2s, take the mean AND then take the min of those results.

In [4]: df.resample('2s').min() Out[4]: A 0.433985 B 0.314582 C 0.357096 D 0.531096 dtype: float64

The new API will:

In [89]: df.resample('2s').min() Out[89]: A B C D 2010-01-01 09:00:00 0.191519 0.272593 0.276464 0.785359 2010-01-01 09:00:02 0.683463 0.712702 0.357817 0.500995 2010-01-01 09:00:04 0.364886 0.013768 0.075381 0.368824 2010-01-01 09:00:06 0.316836 0.568099 0.397203 0.436173 2010-01-01 09:00:08 0.218792 0.143767 0.442141 0.704581

[5 rows x 4 columns]

The good news is the return dimensions will differ between the new API and the old API, so this should loudly raise an exception.

To replicate the original operation

In [90]: df.resample('2s').mean().min() Out[90]: A 0.433985 B 0.314582 C 0.357096 D 0.531096 Length: 4, dtype: float64

Changes to eval#

In prior versions, new columns assignments in an eval expression resulted in an inplace change to the DataFrame. (GH 9297, GH 8664, GH 10486)

In [91]: df = pd.DataFrame({'a': np.linspace(0, 10, 5), 'b': range(5)})

In [92]: df Out[92]: a b 0 0.0 0 1 2.5 1 2 5.0 2 3 7.5 3 4 10.0 4

[5 rows x 2 columns]

In [12]: df.eval('c = a + b') FutureWarning: eval expressions containing an assignment currentlydefault to operating inplace. This will change in a future version of pandas, use inplace=True to avoid this warning.

In [13]: df Out[13]: a b c 0 0.0 0 0.0 1 2.5 1 3.5 2 5.0 2 7.0 3 7.5 3 10.5 4 10.0 4 14.0

In version 0.18.0, a new inplace keyword was added to choose whether the assignment should be done inplace or return a copy.

In [93]: df Out[93]: a b c 0 0.0 0 0.0 1 2.5 1 3.5 2 5.0 2 7.0 3 7.5 3 10.5 4 10.0 4 14.0

[5 rows x 3 columns]

In [94]: df.eval('d = c - b', inplace=False) Out[94]: a b c d 0 0.0 0 0.0 0.0 1 2.5 1 3.5 2.5 2 5.0 2 7.0 5.0 3 7.5 3 10.5 7.5 4 10.0 4 14.0 10.0

[5 rows x 4 columns]

In [95]: df Out[95]: a b c 0 0.0 0 0.0 1 2.5 1 3.5 2 5.0 2 7.0 3 7.5 3 10.5 4 10.0 4 14.0

[5 rows x 3 columns]

In [96]: df.eval('d = c - b', inplace=True)

In [97]: df Out[97]: a b c d 0 0.0 0 0.0 0.0 1 2.5 1 3.5 2.5 2 5.0 2 7.0 5.0 3 7.5 3 10.5 7.5 4 10.0 4 14.0 10.0

[5 rows x 4 columns]

Warning

For backwards compatibility, inplace defaults to True if not specified. This will change in a future version of pandas. If your code depends on an inplace assignment you should update to explicitly set inplace=True

The inplace keyword parameter was also added the query method.

In [98]: df.query('a > 5') Out[98]: a b c d 3 7.5 3 10.5 7.5 4 10.0 4 14.0 10.0

[2 rows x 4 columns]

In [99]: df.query('a > 5', inplace=True)

In [100]: df Out[100]: a b c d 3 7.5 3 10.5 7.5 4 10.0 4 14.0 10.0

[2 rows x 4 columns]

Warning

Note that the default value for inplace in a queryis False, which is consistent with prior versions.

eval has also been updated to allow multi-line expressions for multiple assignments. These expressions will be evaluated one at a time in order. Only assignments are valid for multi-line expressions.

In [101]: df Out[101]: a b c d 3 7.5 3 10.5 7.5 4 10.0 4 14.0 10.0

[2 rows x 4 columns]

In [102]: df.eval(""" .....: e = d + a .....: f = e - 22 .....: g = f / 2.0""", inplace=True) .....:

In [103]: df Out[103]: a b c d e f g 3 7.5 3 10.5 7.5 15.0 -7.0 -3.5 4 10.0 4 14.0 10.0 20.0 -2.0 -1.0

[2 rows x 7 columns]

Other API changes#

Deprecations#

Out[2]:
0 0.0
1 0.5
2 1.5
dtype: float64
In [3]: pd.rolling_cov(s, s, window=2)
FutureWarning: pd.rolling_cov is deprecated for Series and
will be removed in a future version, replace with
Series.rolling(window=2).cov(other=)
Out[3]:
0 NaN
1 0.5
2 0.5
dtype: float64

Removal of deprecated float indexers#

In GH 4892 indexing with floating point numbers on a non-Float64Index was deprecated (in version 0.14.0). In 0.18.0, this deprecation warning is removed and these will now raise a TypeError. (GH 12165, GH 12333)

In [104]: s = pd.Series([1, 2, 3], index=[4, 5, 6])

In [105]: s Out[105]: 4 1 5 2 6 3 Length: 3, dtype: int64

In [106]: s2 = pd.Series([1, 2, 3], index=list('abc'))

In [107]: s2 Out[107]: a 1 b 2 c 3 Length: 3, dtype: int64

Previous behavior:

this is label indexing

In [2]: s[5.0] FutureWarning: scalar indexers for index type Int64Index should be integers and not floating point Out[2]: 2

this is positional indexing

In [3]: s.iloc[1.0] FutureWarning: scalar indexers for index type Int64Index should be integers and not floating point Out[3]: 2

this is label indexing

In [4]: s.loc[5.0] FutureWarning: scalar indexers for index type Int64Index should be integers and not floating point Out[4]: 2

.ix would coerce 1.0 to the positional 1, and index

In [5]: s2.ix[1.0] = 10 FutureWarning: scalar indexers for index type Index should be integers and not floating point

In [6]: s2 Out[6]: a 1 b 10 c 3 dtype: int64

New behavior:

For iloc, getting & setting via a float scalar will always raise.

In [3]: s.iloc[2.0] TypeError: cannot do label indexing on <class 'pandas.indexes.numeric.Int64Index'> with these indexers [2.0] of <type 'float'>

Other indexers will coerce to a like integer for both getting and setting. The FutureWarning has been dropped for .loc, .ix and [].

In [108]: s[5.0] Out[108]: 2

In [109]: s.loc[5.0] Out[109]: 2

and setting

In [110]: s_copy = s.copy()

In [111]: s_copy[5.0] = 10

In [112]: s_copy Out[112]: 4 1 5 10 6 3 Length: 3, dtype: int64

In [113]: s_copy = s.copy()

In [114]: s_copy.loc[5.0] = 10

In [115]: s_copy Out[115]: 4 1 5 10 6 3 Length: 3, dtype: int64

Positional setting with .ix and a float indexer will ADD this value to the index, rather than previously setting the value by position.

In [3]: s2.ix[1.0] = 10 In [4]: s2 Out[4]: a 1 b 2 c 3 1.0 10 dtype: int64

Slicing will also coerce integer-like floats to integers for a non-Float64Index.

In [116]: s.loc[5.0:6] Out[116]: 5 2 6 3 Length: 2, dtype: int64

Note that for floats that are NOT coercible to ints, the label based bounds will be excluded

In [117]: s.loc[5.1:6] Out[117]: 6 3 Length: 1, dtype: int64

Float indexing on a Float64Index is unchanged.

In [118]: s = pd.Series([1, 2, 3], index=np.arange(3.))

In [119]: s[1.0] Out[119]: 2

In [120]: s[1.0:2.5] Out[120]: 1.0 2 2.0 3 Length: 2, dtype: int64

Removal of prior version deprecations/changes#

Performance improvements#

Bug fixes#

Contributors#

A total of 101 people contributed patches to this release. People with a “+” by their names contributed a patch for the first time.