Version 0.13.0 (January 3, 2014) — pandas 2.2.3 documentation (original) (raw)

This is a major release from 0.12.0 and includes a number of API changes, several new features and enhancements along with a large number of bug fixes.

Highlights include:

Several experimental features are added, including:

Their are several new or updated docs sections including:

Warning

In 0.13.0 Series has internally been refactored to no longer sub-class ndarraybut instead subclass NDFrame, similar to the rest of the pandas containers. This should be a transparent change with only very limited API implications. See Internal Refactoring

API changes#

previously, you would have set levels or labels directly

pd.index.levels = [[1, 2, 3, 4], [1, 2, 4, 4]]

now, you use the set_levels or set_labels methods

index = pd.index.set_levels([[1, 2, 3, 4], [1, 2, 4, 4]])

similarly, for names, you can rename the object

but setting names is not deprecated

index = pd.index.set_names(["bob", "cranberry"])

and all methods take an inplace kwarg - but return None

pd.index.set_names(["bob", "cranberry"], inplace=True)

Prior version deprecations/changes#

These were announced changes in 0.12 or prior that are taking effect as of 0.13.0

Deprecations#

Deprecated in 0.13.0

Indexing API changes#

Prior to 0.13, it was impossible to use a label indexer (.loc/.ix) to set a value that was not contained in the index of a particular axis. (GH 2578). See the docs

In the Series case this is effectively an appending operation

In [6]: s = pd.Series([1, 2, 3])

In [7]: s Out[7]: 0 1 1 2 2 3 dtype: int64

In [8]: s[5] = 5.

In [9]: s Out[9]: 0 1.0 1 2.0 2 3.0 5 5.0 dtype: float64

In [10]: dfi = pd.DataFrame(np.arange(6).reshape(3, 2), ....: columns=['A', 'B']) ....:

In [11]: dfi Out[11]: A B 0 0 1 1 2 3 2 4 5

This would previously KeyError

In [12]: dfi.loc[:, 'C'] = dfi.loc[:, 'A']

In [13]: dfi Out[13]: A B C 0 0 1 0 1 2 3 2 2 4 5 4

This is like an append operation.

In [14]: dfi.loc[3] = 5

In [15]: dfi Out[15]: A B C 0 0 1 0 1 2 3 2 2 4 5 4 3 5 5 5

A Panel setting operation on an arbitrary axis aligns the input to the Panel

In [20]: p = pd.Panel(np.arange(16).reshape(2, 4, 2), ....: items=['Item1', 'Item2'], ....: major_axis=pd.date_range('2001/1/12', periods=4), ....: minor_axis=['A', 'B'], dtype='float64') ....:

In [21]: p Out[21]: <class 'pandas.core.panel.Panel'> Dimensions: 2 (items) x 4 (major_axis) x 2 (minor_axis) Items axis: Item1 to Item2 Major_axis axis: 2001-01-12 00:00:00 to 2001-01-15 00:00:00 Minor_axis axis: A to B

In [22]: p.loc[:, :, 'C'] = pd.Series([30, 32], index=p.items)

In [23]: p Out[23]: <class 'pandas.core.panel.Panel'> Dimensions: 2 (items) x 4 (major_axis) x 3 (minor_axis) Items axis: Item1 to Item2 Major_axis axis: 2001-01-12 00:00:00 to 2001-01-15 00:00:00 Minor_axis axis: A to C

In [24]: p.loc[:, :, 'C'] Out[24]: Item1 Item2 2001-01-12 30.0 32.0 2001-01-13 30.0 32.0 2001-01-14 30.0 32.0 2001-01-15 30.0 32.0

Float64Index API change#

HDFStore API changes#

2013-01-05 -0.424972 0.567020
2013-01-06 -0.673690 0.113648
2013-01-07 0.404705 0.577046
2013-01-08 -0.370647 -1.157892
2013-01-09 1.075770 -0.109050
2013-01-10 0.357021 -0.674600
Use an inline column reference
In [32]: pd.read_hdf(path, 'dfq',
....: where="A>0 or C>0")
....:
Out[32]:
A B C D
2013-01-01 0.469112 -0.282863 -1.509059 -1.135632
2013-01-02 1.212112 -0.173215 0.119209 -1.044236
2013-01-04 0.721555 -0.706771 -1.039575 0.271860
2013-01-05 -0.424972 0.567020 0.276232 -1.087401
2013-01-07 0.404705 0.577046 -1.715002 -1.039268
2013-01-09 1.075770 -0.109050 1.643563 -1.469388
2013-01-10 0.357021 -0.674600 -1.776904 -0.968914

DataFrame repr changes#

The HTML and plain text representations of DataFrame now show a truncated view of the table once it exceeds a certain size, rather than switching to the short info view (GH 4886, GH 5550). This makes the representation more consistent as small DataFrames get larger.

Truncated HTML representation of a DataFrame

To get the info view, call DataFrame.info(). If you prefer the info view as the repr for large DataFrames, you can set this by runningset_option('display.large_repr', 'info').

Enhancements#

previously, nan was erroneously counted as 2 here

now it is not counted at all

In [51]: pd.get_dummies([1, 2, np.nan])
Out[51]:
1.0 2.0
0 True False
1 False True
2 False False

unless requested

In [52]: pd.get_dummies([1, 2, np.nan], dummy_na=True)
Out[52]:
1.0 2.0 NaN
0 True False False
1 False True False
2 False False True

In [57]: pd.to_timedelta(np.arange(5), unit='d')
Out[57]: TimedeltaIndex(['0 days', '1 days', '2 days', '3 days', '4 days'], dtype='timedelta64[ns]', freq=None)
A Series of dtype timedelta64[ns] can now be divided by anothertimedelta64[ns] object, or astyped to yield a float64 dtyped Series. This is frequency conversion. See the docs for the docs.
In [58]: import datetime
In [59]: td = pd.Series(pd.date_range('20130101', periods=4)) - pd.Series(
....: pd.date_range('20121201', periods=4))
....:
In [60]: td[2] += np.timedelta64(datetime.timedelta(minutes=5, seconds=3))
In [61]: td[3] = np.nan
In [62]: td
Out[62]:
0 31 days 00:00:00
1 31 days 00:00:00
2 31 days 00:05:03
3 NaT
dtype: timedelta64[ns]

to days

In [63]: td / np.timedelta64(1, 'D')
Out[63]:
0 31.000000
1 31.000000
2 31.003507
3 NaN
dtype: float64
In [64]: td.astype('timedelta64[D]')
Out[64]:
0 31.0
1 31.0
2 31.0
3 NaN
dtype: float64

to seconds

In [65]: td / np.timedelta64(1, 's')
Out[65]:
0 2678400.0
1 2678400.0
2 2678703.0
3 NaN
dtype: float64
In [66]: td.astype('timedelta64[s]')
Out[66]:
0 2678400.0
1 2678400.0
2 2678703.0
3 NaN
dtype: float64
Dividing or multiplying a timedelta64[ns] Series by an integer or integer Series
In [63]: td * -1
Out[63]:
0 -31 days +00:00:00
1 -31 days +00:00:00
2 -32 days +23:54:57
3 NaT
dtype: timedelta64[ns]
In [64]: td * pd.Series([1, 2, 3, 4])
Out[64]:
0 31 days 00:00:00
1 62 days 00:00:00
2 93 days 00:15:09
3 NaT
dtype: timedelta64[ns]
Absolute DateOffset objects can act equivalently to timedeltas
In [65]: from pandas import offsets
In [66]: td + offsets.Minute(5) + offsets.Milli(5)
Out[66]:
0 31 days 00:05:00.005000
1 31 days 00:05:00.005000
2 31 days 00:10:03.005000
3 NaT
dtype: timedelta64[ns]
Fillna is now supported for timedeltas
In [67]: td.fillna(pd.Timedelta(0))
Out[67]:
0 31 days 00:00:00
1 31 days 00:00:00
2 31 days 00:05:03
3 0 days 00:00:00
dtype: timedelta64[ns]
In [68]: td.fillna(datetime.timedelta(days=1, seconds=5))
Out[68]:
0 31 days 00:00:00
1 31 days 00:00:00
2 31 days 00:05:03
3 1 days 00:00:05
dtype: timedelta64[ns]
You can do numeric reduction operations on timedeltas.
In [69]: td.mean()
Out[69]: Timedelta('31 days 00:01:41')
In [70]: td.quantile(.1)
Out[70]: Timedelta('31 days 00:00:00')

1 b 2
2 NaN NaN
and optional groups can also be used.
In [74]: pd.Series(['a1', 'b2', '3']).str.extract(
....: '(?P[ab])?(?P\d)')
....:
Out[74]:
letter digit
0 a 1
1 b 2
2 NaN 3

or with frequency as offset
In [75]: pd.date_range('2013-01-01', periods=5, freq=pd.offsets.Nano(5))
Out[75]:
DatetimeIndex([ '2013-01-01 00:00:00',
'2013-01-01 00:00:00.000000005',
'2013-01-01 00:00:00.000000010',
'2013-01-01 00:00:00.000000015',
'2013-01-01 00:00:00.000000020'],
dtype='datetime64[ns]', freq='5ns')
Timestamps can be modified in the nanosecond range
In [76]: t = pd.Timestamp('20130101 09:01:02')
In [77]: t + pd.tseries.offsets.Nano(123)
Out[77]: Timestamp('2013-01-01 09:01:02.000000123')

0 True False
1 False False
2 True True
3 False False
In [83]: dfi[mask.any(axis=1)]
Out[83]:
A B
0 1 a
2 3 f

note that pandas.rpy was deprecated in v0.16.0

import pandas.rpy.common as com
com.load_data('Titanic')

1 3.0
2 5.0
3 7.0
4 NaN
5 11.0
dtype: float64

1 b e 1.2 1.3 0.997345 1
2 c f 0.7 0.1 0.282978 2
In [92]: pd.wide_to_long(df, ["A", "B"], i="id", j="year")
Out[92]:
X A B
id year
0 1970 -1.085631 a 2.5
1 1970 0.997345 b 1.2
2 1970 0.282978 c 0.7
0 1980 -1.085631 d 3.2
1 1980 0.997345 e 1.3
2 1980 0.282978 f 0.1

Experimental#

eval with NumExpr backend

In [95]: %timeit pd.eval('df1 + df2 + df3 + df4')
6.88 ms +- 275 us per loop (mean +- std. dev. of 7 runs, 100 loops each)

pure Python evaluation

In [96]: %timeit df1 + df2 + df3 + df4
6.5 ms +- 286 us per loop (mean +- std. dev. of 7 runs, 100 loops each)
For more details, see the the docs

A query to select the average monthly temperatures in the

in the year 2000 across the USA. The dataset,

publicata:samples.gsod, is available on all BigQuery accounts,

and is based on NOAA gsod data.

query = """SELECT station_number as STATION,
month as MONTH, AVG(mean_temp) as MEAN_TEMP
FROM publicdata:samples.gsod
WHERE YEAR = 2000
GROUP BY STATION, MONTH
ORDER BY STATION, MONTH ASC"""

Fetch the result set for this query

Your Google BigQuery Project ID

To find this, see your dashboard:

https://console.developers.google.com/iam-admin/projects?authuser=0

projectid = 'xxxxxxxxx'
df = gbq.read_gbq(query, project_id=projectid)

Use pandas to process and reshape the dataset

df2 = df.pivot(index='STATION', columns='MONTH', values='MEAN_TEMP')
df3 = pd.concat([df2.min(), df2.mean(), df2.max()],
axis=1, keys=["Min Tem", "Mean Temp", "Max Temp"])
The resulting DataFrame is:

df3
Min Tem Mean Temp Max Temp
MONTH
1 -53.336667 39.827892 89.770968
2 -49.837500 43.685219 93.437932
3 -77.926087 48.708355 96.099998
4 -82.892858 55.070087 97.317240
5 -92.378261 61.428117 102.042856
6 -77.703334 65.858888 102.900000
7 -87.821428 68.169663 106.510714
8 -89.431999 68.614215 105.500000
9 -86.611112 63.436935 107.142856
10 -78.209677 56.880838 92.103333
11 -50.125000 48.861228 94.996428
12 -50.332258 42.286879 94.396774
Warning
To use this module, you will need a BigQuery account. See <https://cloud.google.com/products/big-query> for details.
As of 10/10/13, there is a bug in Google’s API preventing result sets from being larger than 100,000 rows. A patch is scheduled for the week of 10/14/13.

Internal refactoring#

In 0.13.0 there is a major refactor primarily to subclass Series fromNDFrame, which is the base class currently for DataFrame and Panel, to unify methods and behaviors. Series formerly subclassed directly fromndarray. (GH 4080, GH 3862, GH 816)

Warning

There are two potential incompatibilities from < 0.13.0

Bug fixes#

Contributors#

A total of 77 people contributed patches to this release. People with a “+” by their names contributed a patch for the first time.