Version 0.10.0 (December 17, 2012) — pandas 2.2.3 documentation (original) (raw)

This is a major release from 0.9.1 and includes many new features and enhancements along with a large number of bug fixes. There are also a number of important API changes that long-time pandas users should pay close attention to.

File parsing new features#

The delimited file parsing engine (the guts of read_csv and read_table) has been rewritten from the ground up and now uses a fraction the amount of memory while parsing, while being 40% or more faster in most use cases (in some cases much faster).

There are also many new features:

API changes#

Deprecated DataFrame BINOP TimeSeries special case behavior

The default behavior of binary operations between a DataFrame and a Series has always been to align on the DataFrame’s columns and broadcast down the rows,except in the special case that the DataFrame contains time series. Since there are now method for each binary operator enabling you to specify how you want to broadcast, we are phasing out this special case (Zen of Python:Special cases aren’t special enough to break the rules). Here’s what I’m talking about:

In [1]: import pandas as pd

In [2]: df = pd.DataFrame(np.random.randn(6, 4), index=pd.date_range("1/1/2000", periods=6))

In [3]: df Out[3]: 0 1 2 3 2000-01-01 0.469112 -0.282863 -1.509059 -1.135632 2000-01-02 1.212112 -0.173215 0.119209 -1.044236 2000-01-03 -0.861849 -2.104569 -0.494929 1.071804 2000-01-04 0.721555 -0.706771 -1.039575 0.271860 2000-01-05 -0.424972 0.567020 0.276232 -1.087401 2000-01-06 -0.673690 0.113648 -1.478427 0.524988

deprecated now

In [4]: df - df[0] Out[4]: 0 1 ... 2000-01-05 00:00:00 2000-01-06 00:00:00 2000-01-01 NaN NaN ... NaN NaN 2000-01-02 NaN NaN ... NaN NaN 2000-01-03 NaN NaN ... NaN NaN 2000-01-04 NaN NaN ... NaN NaN 2000-01-05 NaN NaN ... NaN NaN 2000-01-06 NaN NaN ... NaN NaN

[6 rows x 10 columns]

Change your code to

In [5]: df.sub(df[0], axis=0) # align on axis 0 (rows) Out[5]: 0 1 2 3 2000-01-01 0.0 -0.751976 -1.978171 -1.604745 2000-01-02 0.0 -1.385327 -1.092903 -2.256348 2000-01-03 0.0 -1.242720 0.366920 1.933653 2000-01-04 0.0 -1.428326 -1.761130 -0.449695 2000-01-05 0.0 0.991993 0.701204 -0.662428 2000-01-06 0.0 0.787338 -0.804737 1.198677

You will get a deprecation warning in the 0.10.x series, and the deprecated functionality will be removed in 0.11 or later.

Altered resample default behavior

The default time series resample binning behavior of daily D and_higher_ frequencies has been changed to closed='left', label='left'. Lower nfrequencies are unaffected. The prior defaults were causing a great deal of confusion for users, especially resampling data to daily frequency (which labeled the aggregated group with the end of the interval: the next day).

In [1]: dates = pd.date_range('1/1/2000', '1/5/2000', freq='4h')

In [2]: series = pd.Series(np.arange(len(dates)), index=dates)

In [3]: series Out[3]: 2000-01-01 00:00:00 0 2000-01-01 04:00:00 1 2000-01-01 08:00:00 2 2000-01-01 12:00:00 3 2000-01-01 16:00:00 4 2000-01-01 20:00:00 5 2000-01-02 00:00:00 6 2000-01-02 04:00:00 7 2000-01-02 08:00:00 8 2000-01-02 12:00:00 9 2000-01-02 16:00:00 10 2000-01-02 20:00:00 11 2000-01-03 00:00:00 12 2000-01-03 04:00:00 13 2000-01-03 08:00:00 14 2000-01-03 12:00:00 15 2000-01-03 16:00:00 16 2000-01-03 20:00:00 17 2000-01-04 00:00:00 18 2000-01-04 04:00:00 19 2000-01-04 08:00:00 20 2000-01-04 12:00:00 21 2000-01-04 16:00:00 22 2000-01-04 20:00:00 23 2000-01-05 00:00:00 24 Freq: 4H, dtype: int64

In [4]: series.resample('D', how='sum') Out[4]: 2000-01-01 15 2000-01-02 51 2000-01-03 87 2000-01-04 123 2000-01-05 24 Freq: D, dtype: int64

In [5]: # old behavior In [6]: series.resample('D', how='sum', closed='right', label='right') Out[6]: 2000-01-01 0 2000-01-02 21 2000-01-03 57 2000-01-04 93 2000-01-05 129 Freq: D, dtype: int64

In [6]: s = pd.Series([1.5, np.inf, 3.4, -np.inf])

In [7]: pd.isnull(s) Out[7]: 0 False 1 False 2 False 3 False Length: 4, dtype: bool

In [8]: s.fillna(0) Out[8]: 0 1.500000 1 inf 2 3.400000 3 -inf Length: 4, dtype: float64

In [9]: pd.set_option('use_inf_as_null', True)

In [10]: pd.isnull(s) Out[10]: 0 False 1 True 2 False 3 True Length: 4, dtype: bool

In [11]: s.fillna(0) Out[11]: 0 1.5 1 0.0 2 3.4 3 0.0 Length: 4, dtype: float64

In [12]: pd.reset_option('use_inf_as_null')

In [6]: import io

In [7]: data = """ ...: a,b,c ...: 1,Yes,2 ...: 3,No,4 ...: """ ...:

In [8]: print(data)

a,b,c
1,Yes,2
3,No,4

In [9]: pd.read_csv(io.StringIO(data), header=None) Out[9]: 0 1 2 0 a b c 1 1 Yes 2 2 3 No 4

In [10]: pd.read_csv(io.StringIO(data), header=None, prefix="X") Out[10]: X0 X1 X2 0 a b c 1 1 Yes 2 2 3 No 4

In [4]: print(data)

a,b,c
1,Yes,2
3,No,4

In [5]: pd.read_csv(io.StringIO(data)) Out[5]: a b c 0 1 Yes 2 1 3 No 4

In [6]: pd.read_csv(io.StringIO(data), true_values=["Yes"], false_values=["No"]) Out[6]: a b c 0 1 True 2 1 3 False 4

In [6]: s = pd.Series([np.nan, 1.0, 2.0, np.nan, 4])

In [7]: s Out[7]: 0 NaN 1 1.0 2 2.0 3 NaN 4 4.0 dtype: float64

In [8]: s.fillna(0) Out[8]: 0 0.0 1 1.0 2 2.0 3 0.0 4 4.0 dtype: float64

In [9]: s.fillna(method="pad") Out[9]: 0 NaN 1 1.0 2 2.0 3 2.0 4 4.0 dtype: float64

Convenience methods ffill and bfill have been added:

In [10]: s.ffill() Out[10]: 0 NaN 1 1.0 2 2.0 3 2.0 4 4.0 dtype: float64

0 0.340445 0.115903
1 0.984729 0.969691
2 0.919540 0.845555
3 0.037772 0.001427
4 0.861549 0.742267

New features#

Wide DataFrame printing#

Instead of printing the summary information, pandas now splits the string representation across multiple rows by default:

In [16]: wide_frame = pd.DataFrame(np.random.randn(5, 16))

In [17]: wide_frame Out[17]: 0 1 2 ... 13 14 15 0 -0.548702 1.467327 -1.015962 ... 1.669052 1.037882 -1.705775 1 -0.919854 -0.042379 1.247642 ... 1.956030 0.017587 -0.016692 2 -0.575247 0.254161 -1.143704 ... 1.211526 0.268520 0.024580 3 -1.577585 0.396823 -0.105381 ... 0.593616 0.884345 1.591431 4 0.141809 0.220390 0.435589 ... -0.392670 0.007207 1.928123

[5 rows x 16 columns]

The old behavior of printing out summary information can be achieved via the ‘expand_frame_repr’ print option:

In [18]: pd.set_option("expand_frame_repr", False)

In [19]: wide_frame Out[19]: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 -0.548702 1.467327 -1.015962 -0.483075 1.637550 -1.217659 -0.291519 -1.745505 -0.263952 0.991460 -0.919069 0.266046 -0.709661 1.669052 1.037882 -1.705775 1 -0.919854 -0.042379 1.247642 -0.009920 0.290213 0.495767 0.362949 1.548106 -1.131345 -0.089329 0.337863 -0.945867 -0.932132 1.956030 0.017587 -0.016692 2 -0.575247 0.254161 -1.143704 0.215897 1.193555 -0.077118 -0.408530 -0.862495 1.346061 1.511763 1.627081 -0.990582 -0.441652 1.211526 0.268520 0.024580 3 -1.577585 0.396823 -0.105381 -0.532532 1.453749 1.208843 -0.080952 -0.264610 -0.727965 -0.589346 0.339969 -0.693205 -0.339355 0.593616 0.884345 1.591431 4 0.141809 0.220390 0.435589 0.192451 -0.096701 0.803351 1.715071 -0.708758 -1.202872 -1.814470 1.018601 -0.595447 1.395433 -0.392670 0.007207 1.928123

The width of each line can be changed via ‘line_width’ (80 by default):

pd.set_option("line_width", 40)

wide_frame

Updated PyTables support#

Docs for PyTables Table format & several enhancements to the api. Here is a taste of what to expect.

In [41]: store = pd.HDFStore('store.h5')

In [42]: df = pd.DataFrame(np.random.randn(8, 3), ....: index=pd.date_range('1/1/2000', periods=8), ....: columns=['A', 'B', 'C'])

In [43]: df Out[43]: A B C 2000-01-01 -2.036047 0.000830 -0.955697 2000-01-02 -0.898872 -0.725411 0.059904 2000-01-03 -0.449644 1.082900 -1.221265 2000-01-04 0.361078 1.330704 0.855932 2000-01-05 -1.216718 1.488887 0.018993 2000-01-06 -0.877046 0.045976 0.437274 2000-01-07 -0.567182 -0.888657 -0.556383 2000-01-08 0.655457 1.117949 -2.782376

[8 rows x 3 columns]

appending data frames

In [44]: df1 = df[0:4]

In [45]: df2 = df[4:]

In [46]: store.append('df', df1)

In [47]: store.append('df', df2)

In [48]: store Out[48]: <class 'pandas.io.pytables.HDFStore'> File path: store.h5 /df frame_table (typ->appendable,nrows->8,ncols->3,indexers->[index])

selecting the entire store

In [49]: store.select('df') Out[49]: A B C 2000-01-01 -2.036047 0.000830 -0.955697 2000-01-02 -0.898872 -0.725411 0.059904 2000-01-03 -0.449644 1.082900 -1.221265 2000-01-04 0.361078 1.330704 0.855932 2000-01-05 -1.216718 1.488887 0.018993 2000-01-06 -0.877046 0.045976 0.437274 2000-01-07 -0.567182 -0.888657 -0.556383 2000-01-08 0.655457 1.117949 -2.782376

[8 rows x 3 columns]

In [50]: wp = pd.Panel(np.random.randn(2, 5, 4), items=['Item1', 'Item2'], ....: major_axis=pd.date_range('1/1/2000', periods=5), ....: minor_axis=['A', 'B', 'C', 'D'])

In [51]: wp Out[51]: <class 'pandas.core.panel.Panel'> Dimensions: 2 (items) x 5 (major_axis) x 4 (minor_axis) Items axis: Item1 to Item2 Major_axis axis: 2000-01-01 00:00:00 to 2000-01-05 00:00:00 Minor_axis axis: A to D

storing a panel

In [52]: store.append('wp', wp)

selecting via A QUERY

In [53]: store.select('wp', [pd.Term('major_axis>20000102'), ....: pd.Term('minor_axis', '=', ['A', 'B'])]) ....: Out[53]: <class 'pandas.core.panel.Panel'> Dimensions: 2 (items) x 3 (major_axis) x 2 (minor_axis) Items axis: Item1 to Item2 Major_axis axis: 2000-01-03 00:00:00 to 2000-01-05 00:00:00 Minor_axis axis: A to B

removing data from tables

In [54]: store.remove('wp', pd.Term('major_axis>20000103')) Out[54]: 8

In [55]: store.select('wp') Out[55]: <class 'pandas.core.panel.Panel'> Dimensions: 2 (items) x 3 (major_axis) x 4 (minor_axis) Items axis: Item1 to Item2 Major_axis axis: 2000-01-01 00:00:00 to 2000-01-03 00:00:00 Minor_axis axis: A to D

deleting a store

In [56]: del store['df']

In [57]: store Out[57]: <class 'pandas.io.pytables.HDFStore'> File path: store.h5 /wp wide_table (typ->appendable,nrows->12,ncols->2,indexers->[major_axis,minor_axis])

Enhancements

In [62]: store.remove('food')

In [63]: store
Out[63]:
<class 'pandas.io.pytables.HDFStore'>
File path: store.h5
/foo/bar/bah frame (shape->[8,3])
/wp wide_table (typ->appendable,nrows->12,ncols->2,indexers->[major_axis,minor_axis])

2000-01-01 -2.036047 0.000830 -0.955697 string 1
2000-01-02 -0.898872 -0.725411 0.059904 string 1
2000-01-03 -0.449644 1.082900 -1.221265 string 1
2000-01-04 0.361078 1.330704 0.855932 string 1
2000-01-05 -1.216718 1.488887 0.018993 string 1
2000-01-06 -0.877046 0.045976 0.437274 string 1
2000-01-07 -0.567182 -0.888657 -0.556383 string 1
2000-01-08 0.655457 1.117949 -2.782376 string 1

[8 rows x 5 columns]

In [69]: df1.get_dtype_counts()
Out[69]:
float64 3
int64 1
object 1
dtype: int64

Bug Fixes

Compatibility

0.10 of HDFStore is backwards compatible for reading tables created in a prior version of pandas, however, query terms using the prior (undocumented) methodology are unsupported. You must read in the entire file and write it out using the new format to take advantage of the updates.

N dimensional panels (experimental)#

Adding experimental support for Panel4D and factory functions to create n-dimensional named panels. Here is a taste of what to expect.

In [58]: p4d = Panel4D(np.random.randn(2, 2, 5, 4), ....: labels=['Label1','Label2'], ....: items=['Item1', 'Item2'], ....: major_axis=date_range('1/1/2000', periods=5), ....: minor_axis=['A', 'B', 'C', 'D']) ....:

In [59]: p4d Out[59]: <class 'pandas.core.panelnd.Panel4D'> Dimensions: 2 (labels) x 2 (items) x 5 (major_axis) x 4 (minor_axis) Labels axis: Label1 to Label2 Items axis: Item1 to Item2 Major_axis axis: 2000-01-01 00:00:00 to 2000-01-05 00:00:00 Minor_axis axis: A to D

See the full release notes or issue tracker on GitHub for a complete list.

Contributors#

A total of 26 people contributed patches to this release. People with a “+” by their names contributed a patch for the first time.