Version 0.15.0 (October 18, 2014) — pandas 2.2.3 documentation (original) (raw)

This is a major release from 0.14.1 and includes a small number of API changes, several new features, enhancements, and performance improvements along with a large number of bug fixes. We recommend that all users upgrade to this version.

Warning

pandas >= 0.15.0 will no longer support compatibility with NumPy versions < 1.7.0. If you want to use the latest versions of pandas, please upgrade to NumPy >= 1.7.0 (GH 7711)

Warning

In 0.15.0 Index has internally been refactored to no longer sub-class ndarraybut instead subclass PandasObject, similarly to the rest of the pandas objects. This change allows very easy sub-classing and creation of new index types. This should be a transparent change with only very limited API implications (See the Internal Refactoring)

Warning

The refactoring in Categorical changed the two argument constructor from “codes/labels and levels” to “values and levels (now called ‘categories’)”. This can lead to subtle bugs. If you useCategorical directly, please audit your code before updating to this pandas version and change it to use the from_codes() constructor. See more on Categorical here

New features#

Categoricals in Series/DataFrame#

Categorical can now be included in Series and DataFrames and gained new methods to manipulate. Thanks to Jan Schulz for much of this API/implementation. (GH 3943, GH 5313, GH 5314,GH 7444, GH 7839, GH 7848, GH 7864, GH 7914, GH 7768, GH 8006, GH 3678,GH 8075, GH 8076, GH 8143, GH 8453, GH 8518).

For full docs, see the categorical introduction and theAPI documentation.

In [1]: df = pd.DataFrame({"id": [1, 2, 3, 4, 5, 6], ...: "raw_grade": ['a', 'b', 'b', 'a', 'a', 'e']}) ...:

In [2]: df["grade"] = df["raw_grade"].astype("category")

In [3]: df["grade"] Out[3]: 0 a 1 b 2 b 3 a 4 a 5 e Name: grade, Length: 6, dtype: category Categories (3, object): ['a', 'b', 'e']

Rename the categories

In [4]: df["grade"] = df["grade"].cat.rename_categories(["very good", "good", "very bad"])

Reorder the categories and simultaneously add the missing categories

In [5]: df["grade"] = df["grade"].cat.set_categories(["very bad", "bad", ...: "medium", "good", "very good"]) ...:

In [6]: df["grade"] Out[6]: 0 very good 1 good 2 good 3 very good 4 very good 5 very bad Name: grade, Length: 6, dtype: category Categories (5, object): ['very bad', 'bad', 'medium', 'good', 'very good']

In [7]: df.sort_values("grade") Out[7]: id raw_grade grade 5 6 e very bad 1 2 b good 2 3 b good 0 1 a very good 3 4 a very good 4 5 a very good

[6 rows x 3 columns]

In [8]: df.groupby("grade", observed=False).size() Out[8]: grade very bad 1 bad 0 medium 0 good 2 very good 3 Length: 5, dtype: int64

TimedeltaIndex/scalar#

We introduce a new scalar type Timedelta, which is a subclass of datetime.timedelta, and behaves in a similar manner, but allows compatibility with np.timedelta64 types as well as a host of custom representation, parsing, and attributes. This type is very similar to how Timestamp works for datetimes. It is a nice-API box for the type. See the docs. (GH 3009, GH 4533, GH 8209, GH 8187, GH 8190, GH 7869, GH 7661, GH 8345, GH 8471)

Warning

Timedelta scalars (and TimedeltaIndex) component fields are not the same as the component fields on a datetime.timedelta object. For example, .seconds on a datetime.timedelta object returns the total number of seconds combined between hours, minutes and seconds. In contrast, the pandas Timedelta breaks out hours, minutes, microseconds and nanoseconds separately.

Timedelta accessor

In [9]: tds = pd.Timedelta('31 days 5 min 3 sec')

In [10]: tds.minutes Out[10]: 5L

In [11]: tds.seconds Out[11]: 3L

datetime.timedelta accessor

this is 5 minutes * 60 + 3 seconds

In [12]: tds.to_pytimedelta().seconds Out[12]: 303

Note: this is no longer true starting from v0.16.0, where full compatibility with datetime.timedelta is introduced. See the0.16.0 whatsnew entry

Warning

Prior to 0.15.0 pd.to_timedelta would return a Series for list-like/Series input, and a np.timedelta64 for scalar input. It will now return a TimedeltaIndex for list-like input, Series for Series input, and Timedelta for scalar input.

The arguments to pd.to_timedelta are now (arg,unit='ns',box=True,coerce=False), previously were (arg,box=True,unit='ns') as these are more logical.

Construct a scalar

In [9]: pd.Timedelta('1 days 06:05:01.00003') Out[9]: Timedelta('1 days 06:05:01.000030')

In [10]: pd.Timedelta('15.5us') Out[10]: Timedelta('0 days 00:00:00.000015500')

In [11]: pd.Timedelta('1 hour 15.5us') Out[11]: Timedelta('0 days 01:00:00.000015500')

negative Timedeltas have this string repr

to be more consistent with datetime.timedelta conventions

In [12]: pd.Timedelta('-1us') Out[12]: Timedelta('-1 days +23:59:59.999999')

a NaT

In [13]: pd.Timedelta('nan') Out[13]: NaT

Access fields for a Timedelta

In [14]: td = pd.Timedelta('1 hour 3m 15.5us')

In [15]: td.seconds Out[15]: 3780

In [16]: td.microseconds Out[16]: 15

In [17]: td.nanoseconds Out[17]: 500

Construct a TimedeltaIndex

In [18]: pd.TimedeltaIndex(['1 days', '1 days, 00:00:05', ....: np.timedelta64(2, 'D'), ....: datetime.timedelta(days=2, seconds=2)]) ....: Out[18]: TimedeltaIndex(['1 days 00:00:00', '1 days 00:00:05', '2 days 00:00:00', '2 days 00:00:02'], dtype='timedelta64[ns]', freq=None)

Constructing a TimedeltaIndex with a regular range

In [19]: pd.timedelta_range('1 days', periods=5, freq='D') Out[19]: TimedeltaIndex(['1 days', '2 days', '3 days', '4 days', '5 days'], dtype='timedelta64[ns]', freq='D')

In [20]: pd.timedelta_range(start='1 days', end='2 days', freq='30T') Out[20]: TimedeltaIndex(['1 days 00:00:00', '1 days 00:30:00', '1 days 01:00:00', '1 days 01:30:00', '1 days 02:00:00', '1 days 02:30:00', '1 days 03:00:00', '1 days 03:30:00', '1 days 04:00:00', '1 days 04:30:00', '1 days 05:00:00', '1 days 05:30:00', '1 days 06:00:00', '1 days 06:30:00', '1 days 07:00:00', '1 days 07:30:00', '1 days 08:00:00', '1 days 08:30:00', '1 days 09:00:00', '1 days 09:30:00', '1 days 10:00:00', '1 days 10:30:00', '1 days 11:00:00', '1 days 11:30:00', '1 days 12:00:00', '1 days 12:30:00', '1 days 13:00:00', '1 days 13:30:00', '1 days 14:00:00', '1 days 14:30:00', '1 days 15:00:00', '1 days 15:30:00', '1 days 16:00:00', '1 days 16:30:00', '1 days 17:00:00', '1 days 17:30:00', '1 days 18:00:00', '1 days 18:30:00', '1 days 19:00:00', '1 days 19:30:00', '1 days 20:00:00', '1 days 20:30:00', '1 days 21:00:00', '1 days 21:30:00', '1 days 22:00:00', '1 days 22:30:00', '1 days 23:00:00', '1 days 23:30:00', '2 days 00:00:00'], dtype='timedelta64[ns]', freq='30T')

You can now use a TimedeltaIndex as the index of a pandas object

In [20]: s = pd.Series(np.arange(5), ....: index=pd.timedelta_range('1 days', periods=5, freq='s')) ....:

In [21]: s Out[21]: 1 days 00:00:00 0 1 days 00:00:01 1 1 days 00:00:02 2 1 days 00:00:03 3 1 days 00:00:04 4 Freq: s, Length: 5, dtype: int64

You can select with partial string selections

In [22]: s['1 day 00:00:02'] Out[22]: 2

In [23]: s['1 day':'1 day 00:00:02'] Out[23]: 1 days 00:00:00 0 1 days 00:00:01 1 1 days 00:00:02 2 Freq: s, Length: 3, dtype: int64

Finally, the combination of TimedeltaIndex with DatetimeIndex allow certain combination operations that are NaT preserving:

In [24]: tdi = pd.TimedeltaIndex(['1 days', pd.NaT, '2 days'])

In [25]: tdi.tolist() Out[25]: [Timedelta('1 days 00:00:00'), NaT, Timedelta('2 days 00:00:00')]

In [26]: dti = pd.date_range('20130101', periods=3)

In [27]: dti.tolist() Out[27]: [Timestamp('2013-01-01 00:00:00'), Timestamp('2013-01-02 00:00:00'), Timestamp('2013-01-03 00:00:00')]

In [28]: (dti + tdi).tolist() Out[28]: [Timestamp('2013-01-02 00:00:00'), NaT, Timestamp('2013-01-05 00:00:00')]

In [29]: (dti - tdi).tolist() Out[29]: [Timestamp('2012-12-31 00:00:00'), NaT, Timestamp('2013-01-01 00:00:00')]

Memory usage#

Implemented methods to find memory usage of a DataFrame. See the FAQ for more. (GH 6852).

A new display option display.memory_usage (see Options and settings) sets the default behavior of the memory_usage argument in the df.info() method. By default display.memory_usage is True.

In [30]: dtypes = ['int64', 'float64', 'datetime64[ns]', 'timedelta64[ns]', ....: 'complex128', 'object', 'bool'] ....:

In [31]: n = 5000

In [32]: data = {t: np.random.randint(100, size=n).astype(t) for t in dtypes}

In [33]: df = pd.DataFrame(data)

In [34]: df['categorical'] = df['object'].astype('category')

In [35]: df.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 8 columns):

Column Non-Null Count Dtype


0 int64 5000 non-null int64
1 float64 5000 non-null float64
2 datetime64[ns] 5000 non-null datetime64[ns] 3 timedelta64[ns] 5000 non-null timedelta64[ns] 4 complex128 5000 non-null complex128
5 object 5000 non-null object
6 bool 5000 non-null bool
7 categorical 5000 non-null category
dtypes: bool(1), category(1), complex128(1), datetime64ns, float64(1), int64(1), object(1), timedelta64ns memory usage: 288.2+ KB

Additionally memory_usage() is an available method for a dataframe object which returns the memory usage of each column.

In [36]: df.memory_usage(index=True) Out[36]: Index 128 int64 40000 float64 40000 datetime64[ns] 40000 timedelta64[ns] 40000 complex128 80000 object 40000 bool 5000 categorical 9968 Length: 9, dtype: int64

Series.dt accessor#

Series has gained an accessor to succinctly return datetime like properties for the values of the Series, if its a datetime/period like Series. (GH 7207) This will return a Series, indexed like the existing Series. See the docs

datetime

In [37]: s = pd.Series(pd.date_range('20130101 09:10:12', periods=4))

In [38]: s Out[38]: 0 2013-01-01 09:10:12 1 2013-01-02 09:10:12 2 2013-01-03 09:10:12 3 2013-01-04 09:10:12 Length: 4, dtype: datetime64[ns]

In [39]: s.dt.hour Out[39]: 0 9 1 9 2 9 3 9 Length: 4, dtype: int32

In [40]: s.dt.second Out[40]: 0 12 1 12 2 12 3 12 Length: 4, dtype: int32

In [41]: s.dt.day Out[41]: 0 1 1 2 2 3 3 4 Length: 4, dtype: int32

In [42]: s.dt.freq Out[42]: 'D'

This enables nice expressions like this:

In [43]: s[s.dt.day == 2] Out[43]: 1 2013-01-02 09:10:12 Length: 1, dtype: datetime64[ns]

You can easily produce tz aware transformations:

In [44]: stz = s.dt.tz_localize('US/Eastern')

In [45]: stz Out[45]: 0 2013-01-01 09:10:12-05:00 1 2013-01-02 09:10:12-05:00 2 2013-01-03 09:10:12-05:00 3 2013-01-04 09:10:12-05:00 Length: 4, dtype: datetime64[ns, US/Eastern]

In [46]: stz.dt.tz Out[46]: <DstTzInfo 'US/Eastern' LMT-1 day, 19:04:00 STD>

You can also chain these types of operations:

In [47]: s.dt.tz_localize('UTC').dt.tz_convert('US/Eastern') Out[47]: 0 2013-01-01 04:10:12-05:00 1 2013-01-02 04:10:12-05:00 2 2013-01-03 04:10:12-05:00 3 2013-01-04 04:10:12-05:00 Length: 4, dtype: datetime64[ns, US/Eastern]

The .dt accessor works for period and timedelta dtypes.

period

In [48]: s = pd.Series(pd.period_range('20130101', periods=4, freq='D'))

In [49]: s Out[49]: 0 2013-01-01 1 2013-01-02 2 2013-01-03 3 2013-01-04 Length: 4, dtype: period[D]

In [50]: s.dt.year Out[50]: 0 2013 1 2013 2 2013 3 2013 Length: 4, dtype: int64

In [51]: s.dt.day Out[51]: 0 1 1 2 2 3 3 4 Length: 4, dtype: int64

timedelta

In [52]: s = pd.Series(pd.timedelta_range('1 day 00:00:05', periods=4, freq='s'))

In [53]: s Out[53]: 0 1 days 00:00:05 1 1 days 00:00:06 2 1 days 00:00:07 3 1 days 00:00:08 Length: 4, dtype: timedelta64[ns]

In [54]: s.dt.days Out[54]: 0 1 1 1 2 1 3 1 Length: 4, dtype: int64

In [55]: s.dt.seconds Out[55]: 0 5 1 6 2 7 3 8 Length: 4, dtype: int32

In [56]: s.dt.components Out[56]: days hours minutes seconds milliseconds microseconds nanoseconds 0 1 0 0 5 0 0 0 1 1 0 0 6 0 0 0 2 1 0 0 7 0 0 0 3 1 0 0 8 0 0 0

[4 rows x 7 columns]

Timezone handling improvements#

In [63]: didx.tz_localize(None)
Out[63]:
DatetimeIndex(['2014-08-01 09:00:00', '2014-08-01 10:00:00',
'2014-08-01 11:00:00', '2014-08-01 12:00:00',
'2014-08-01 13:00:00', '2014-08-01 14:00:00',
'2014-08-01 15:00:00', '2014-08-01 16:00:00',
'2014-08-01 17:00:00', '2014-08-01 18:00:00'],
dtype='datetime64[ns]', freq=None)

Rolling/expanding moments improvements#

1 3
2 6
3 NaN
dtype: float64
New behavior (note final value is 5 = sum([2, 3, NaN])):
In [7]: pd.rolling_sum(pd.Series(range(4)), window=3,
....: min_periods=0, center=True)
Out[7]:
0 1
1 3
2 6
3 5
dtype: float64

1 6.583333
2 6.883333
3 6.683333
4 NaN
dtype: float64
New behavior
In [10]: pd.rolling_window(s, window=3, win_type='triang', center=True)
Out[10]:
0 NaN
1 9.875
2 10.325
3 10.025
4 NaN
dtype: float64

1 NaN
2 1.000000
3 1.000000
4 1.571429
5 2.189189
dtype: float64
New behavior (note values start at index 4, the location of the 2nd (since min_periods=2) non-empty value):
In [2]: pd.ewma(s, com=3., min_periods=2)
Out[2]:
0 NaN
1 NaN
2 NaN
3 NaN
4 1.759644
5 2.383784
dtype: float64

1 0.500000
2 1.210526
3 4.089069
dtype: float64
In [15]: pd.ewmvar(s, com=2., bias=False) / pd.ewmvar(s, com=2., bias=True)
Out[15]:
0 NaN
1 2.083333
2 1.583333
3 1.425439
dtype: float64
See Exponentially weighted moment functions for details. (GH 7912)

Improvements in the SQL IO module#

Backwards incompatible API changes#

Breaking changes#

API changes related to Categorical (see herefor more details):

API changes related to the introduction of the Timedelta scalar (seeabove for more details):

For API changes related to the rolling and expanding functions, see detailed overview above.

Other notable API changes:

2 7 NaN
3 11 NaN
Furthermore, .loc will raise If no values are found in a MultiIndex with a list-like indexer:
In [63]: s = pd.Series(np.arange(3, dtype='int64'),
....: index=pd.MultiIndex.from_product([['A'],
....: ['foo', 'bar', 'baz']],
....: names=['one', 'two'])
....: ).sort_index()
....:
In [64]: s
Out[64]:
one two
A bar 1
baz 2
foo 0
Length: 3, dtype: int64
In [65]: try:
....: s.loc[['D']]
....: except KeyError as e:
....: print("KeyError: " + str(e))
....:
KeyError: "['D'] not in index"

2 c
Length: 3, dtype: object
To insert a NaN, you must explicitly use np.nan. See the docs.

the original object

In [5]: s
Out[5]:
0 2.5
1 3.5
2 4.5
dtype: float64

a reference to the original object

In [7]: s2
Out[7]:
0 1
1 2
2 3
dtype: int64
This is now the correct behavior

the original object

In [75]: s
Out[75]:
0 2.5
1 3.5
2 4.5
Length: 3, dtype: float64

a reference to the original object

In [76]: s2
Out[76]:
0 2.5
1 3.5
2 4.5
Length: 3, dtype: float64

In [79]: df = pd.DataFrame({'a': i})
In [80]: df
Out[80]:
a
0 2011-01-01 00:00:00-05:00
1 2011-01-01 00:00:10-05:00
2 2011-01-01 00:00:20-05:00
[3 rows x 1 columns]
In [81]: df.dtypes
Out[81]:
a datetime64[ns, US/Eastern]
Length: 1, dtype: object
Previously this would have yielded a column of datetime64 dtype, but without timezone info.
The behaviour of assigning a column to an existing dataframe as df['a'] = iremains unchanged (this already returned an object column with a timezone).

dtypes are now preserved

In [85]: df.loc[2] = df.loc[1]
In [86]: df
Out[86]:
female fitness
0 True 1
1 False 2
2 False 2
[3 rows x 2 columns]
In [87]: df.dtypes
Out[87]:
female bool
fitness int64
Length: 2, dtype: object

Internal refactoring#

In 0.15.0 Index has internally been refactored to no longer sub-class ndarraybut instead subclass PandasObject, similarly to the rest of the pandas objects. This change allows very easy sub-classing and creation of new index types. This should be a transparent change with only very limited API implications (GH 5080, GH 7439, GH 7796, GH 8024, GH 8367, GH 7997, GH 8522):

Deprecations#

+

pd.Index(['a', 'b', 'c']) + pd.Index(['b', 'c', 'd'])

should be replaced by

pd.Index(['a', 'b', 'c']).union(pd.Index(['b', 'c', 'd']))

-

pd.Index(['a', 'b', 'c']) - pd.Index(['b', 'c', 'd'])

should be replaced by

pd.Index(['a', 'b', 'c']).difference(pd.Index(['b', 'c', 'd']))

Removal of prior version deprecations/changes#

Enhancements#

Enhancements in the importing/exporting of Stata files:

Enhancements in the plotting functions:

Other:

count 24 24
unique 2 4
top foo a
freq 16 6
[4 rows x 2 columns]
In [90]: df.describe(include=["number", "object"], exclude=["float"])
Out[90]:
catA catB numC
count 24 24 24.000000
unique 2 4 NaN
top foo a NaN
freq 16 6 NaN
mean NaN NaN 11.500000
std NaN NaN 7.071068
min NaN NaN 0.000000
25% NaN NaN 5.750000
50% NaN NaN 11.500000
75% NaN NaN 17.250000
max NaN NaN 23.000000
[11 rows x 3 columns]
Requesting all columns is possible with the shorthand ‘all’
In [91]: df.describe(include='all')
Out[91]:
catA catB numC numD
count 24 24 24.000000 24.000000
unique 2 4 NaN NaN
top foo a NaN NaN
freq 16 6 NaN NaN
mean NaN NaN 11.500000 12.000000
std NaN NaN 7.071068 7.071068
min NaN NaN 0.000000 0.500000
25% NaN NaN 5.750000 6.250000
50% NaN NaN 11.500000 12.000000
75% NaN NaN 17.250000 17.750000
max NaN NaN 23.000000 23.500000
[11 rows x 4 columns]
Without those arguments, describe will behave as before, including only numerical columns or, if none are, only categorical columns. See also the docs

get the first, 4th, and last date index for each month

In [96]: df.groupby([df.index.year, df.index.month]).nth([0, 3, -1])
Out[96]:
a b
2014-04-01 1 1
2014-04-04 1 1
2014-04-30 1 1
2014-05-01 1 1
2014-05-06 1 1
2014-05-30 1 1
2014-06-02 1 1
2014-06-05 1 1
2014-06-30 1 1
[9 rows x 2 columns]

In [106]: idx + pd.offsets.Hour(2)
Out[106]:
PeriodIndex(['2014-07-01 11:00', '2014-07-01 12:00', '2014-07-01 13:00',
'2014-07-01 14:00', '2014-07-01 15:00'],
dtype='period[H]')
In [107]: idx + pd.Timedelta('120m')
Out[107]:
PeriodIndex(['2014-07-01 11:00', '2014-07-01 12:00', '2014-07-01 13:00',
'2014-07-01 14:00', '2014-07-01 15:00'],
dtype='period[H]')
In [108]: idx = pd.period_range('2014-07', periods=5, freq='M')
In [109]: idx
Out[109]: PeriodIndex(['2014-07', '2014-08', '2014-09', '2014-10', '2014-11'], dtype='period[M]')
In [110]: idx + pd.offsets.MonthEnd(3)
Out[110]: PeriodIndex(['2014-10', '2014-11', '2014-12', '2015-01', '2015-02'], dtype='period[M]')

In [99]: idx.set_names(['qux', 'corge'], level=[0, 1])
Out[99]:
MultiIndex([('a', 0, 'p'),
('a', 0, 'q'),
('a', 0, 'r'),
('a', 1, 'p'),
('a', 1, 'q'),
('a', 1, 'r'),
('a', 2, 'p'),
('a', 2, 'q'),
('a', 2, 'r')],
names=['qux', 'corge', 'baz'])
In [100]: idx.set_levels(['a', 'b', 'c'], level='bar')
Out[100]:
MultiIndex([('a', 'a', 'p'),
('a', 'a', 'q'),
('a', 'a', 'r'),
('a', 'b', 'p'),
('a', 'b', 'q'),
('a', 'b', 'r'),
('a', 'c', 'p'),
('a', 'c', 'q'),
('a', 'c', 'r')],
names=['foo', 'bar', 'baz'])
In [101]: idx.set_levels([['a', 'b', 'c'], [1, 2, 3]], level=[1, 2])
Out[101]:
MultiIndex([('a', 'a', 1),
('a', 'a', 2),
('a', 'a', 3),
('a', 'b', 1),
('a', 'b', 2),
('a', 'b', 3),
('a', 'c', 1),
('a', 'c', 2),
('a', 'c', 3)],
names=['foo', 'bar', 'baz'])

Performance#

Bug fixes#

Contributors#

A total of 80 people contributed patches to this release. People with a “+” by their names contributed a patch for the first time.