Version 0.14.0 (May 31 , 2014) — pandas 3.0.0.dev0+2099.g3832e85779 documentation (original) (raw)

This is a major release from 0.13.1 and includes a small number of API changes, several new features, enhancements, and performance improvements along with a large number of bug fixes. We recommend that all users upgrade to this version.

Warning

In 0.14.0 all NDFrame based containers have undergone significant internal refactoring. Before that each block of homogeneous data had its own labels and extra care was necessary to keep those in sync with the parent container’s labels. This should not have any visible user/API behavior changes (GH 6745)

API changes#

0 0.469112 -0.282863
1 -1.509059 -1.135632
2 1.212112 -0.173215
3 0.119209 -1.044236
4 -0.861849 -2.104569
[5 rows x 2 columns]
In [3]: dfl.iloc[:, 2:3]
Out[3]:
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4]
[5 rows x 0 columns]
In [4]: dfl.iloc[:, 1:3]
Out[4]:
B
0 -0.282863
1 -1.135632
2 -0.173215
3 -1.044236
4 -2.104569
[5 rows x 1 columns]
In [5]: dfl.iloc[4:6]
Out[5]:
A B
4 -0.861849 -2.104569
[1 rows x 2 columns]
These are out-of-bounds selections

dfl.iloc[[4, 5, 6]]
IndexError: positional indexers are out-of-bounds
dfl.iloc[:, 4]
IndexError: single positional indexer is out-of-bounds

Old behavior, casted MultiIndex to an Index

In [10]: tuple_ind
Out[10]: Index([('a', 'c'), ('a', 'd'), ('b', 'c'), ('b', 'd')], dtype='object')
In [11]: df_multi.set_index(tuple_ind)
Out[11]:
0 1
(a, c) 0.471435 -1.190976
(a, d) 1.432707 -0.312652
(b, c) -0.720589 0.887163
(b, d) 0.859588 -0.636524
[4 rows x 2 columns]

New behavior

In [12]: mi
Out[12]:
MultiIndex([('a', 'c'),
('a', 'd'),
('b', 'c'),
('b', 'd')],
)
In [13]: df_multi.set_index(mi)
Out[13]:
0 1
a c 0.471435 -1.190976
d 1.432707 -0.312652
b c -0.720589 0.887163
d 0.859588 -0.636524
[4 rows x 2 columns]
This also applies when passing multiple indices to set_index:

Old output, 2-level MultiIndex of tuples

In [14]: df_multi.set_index([df_multi.index, df_multi.index])
Out[14]:
0 1
(a, c) (a, c) 0.471435 -1.190976
(a, d) (a, d) 1.432707 -0.312652
(b, c) (b, c) -0.720589 0.887163
(b, d) (b, d) 0.859588 -0.636524
[4 rows x 2 columns]

New output, 4-level MultiIndex

In [15]: df_multi.set_index([df_multi.index, df_multi.index])
Out[15]:
0 1
a c a c 0.471435 -1.190976
d a d 1.432707 -0.312652
b c b c -0.720589 0.887163
d b d 0.859588 -0.636524
[4 rows x 2 columns]

A 0.035310 0.326593 -0.505430
B 0.137748 -0.006888 -0.005383
C -0.006888 0.861040 0.020762

Display changes#

show dimensions since this is truncated

In [17]: with pd.option_context('display.max_rows', 2, 'display.max_columns', 2,
....: 'display.show_dimensions', 'truncate'):
....: print(dfd)
....:
0 ... 4
0 0 ... 4
.. .. ... ..
4 20 ... 24
[5 rows x 5 columns]

will not show dimensions since it is not truncated

In [18]: with pd.option_context('display.max_rows', 10, 'display.max_columns', 40,
....: 'display.show_dimensions', 'truncate'):
....: print(dfd)
....:
0 1 2 3 4
0 0 1 2 3 4
1 5 6 7 8 9
2 10 11 12 13 14
3 15 16 17 18 19
4 20 21 22 23 24

Text parsing API changes#

read_csv()/read_table() will now be noisier w.r.t invalid options rather than falling back to the PythonParser.

GroupBy API changes#

More consistent behavior for some groupby methods:

this is equivalent to g.first()

In [22]: g.nth(0, dropna='any')
Out[22]:
A B
1 1 4.0
2 5 6.0
[2 rows x 2 columns]

this is equivalent to g.last()

In [23]: g.nth(-1, dropna='any')
Out[23]:
A B
1 1 4.0
2 5 6.0
[2 rows x 2 columns]
Filtering
In [24]: gf = df.groupby('A', as_index=False)
In [25]: gf.nth(0)
Out[25]:
A B
0 1 NaN
2 5 6.0
[2 rows x 2 columns]
In [26]: gf.nth(0, dropna='any')
Out[26]:
A B
1 1 4.0
2 5 6.0
[2 rows x 2 columns]

SQL#

The SQL reading and writing functions now support more database flavors through SQLAlchemy (GH 2717, GH 4163, GH 5950, GH 6292). All databases supported by SQLAlchemy can be used, such as PostgreSQL, MySQL, Oracle, Microsoft SQL server (see documentation of SQLAlchemy on included dialects).

The functionality of providing DBAPI connection objects will only be supported for sqlite3 in the future. The 'mysql' flavor is deprecated.

The new functions read_sql_query() and read_sql_table()are introduced. The function read_sql() is kept as a convenience wrapper around the other two and will delegate to specific function depending on the provided input (database table name or sql query).

In practice, you have to provide a SQLAlchemy engine to the sql functions. To connect with SQLAlchemy you use the create_engine() function to create an engine object from database URI. You only need to create the engine once per database you are connecting to. For an in-memory sqlite database:

In [35]: from sqlalchemy import create_engine

Create your connection.

In [36]: engine = create_engine('sqlite:///:memory:')

This engine can then be used to write or read data to/from this database:

In [37]: df = pd.DataFrame({'A': [1, 2, 3], 'B': ['a', 'b', 'c']})

In [38]: df.to_sql(name='db_table', con=engine, index=False) Out[38]: 3

You can read data from a database by specifying the table name:

In [39]: pd.read_sql_table('db_table', engine) Out[39]: A B 0 1 a 1 2 b 2 3 c

[3 rows x 2 columns]

or by specifying a sql query:

In [40]: pd.read_sql_query('SELECT * FROM db_table', engine) Out[40]: A B 0 1 a 1 2 b 2 3 c

[3 rows x 2 columns]

Some other enhancements to the sql functions include:

Warning

Some of the existing functions or function aliases have been deprecated and will be removed in future versions. This includes: tquery, uquery,read_frame, frame_query, write_frame.

Warning

The support for the ‘mysql’ flavor when using DBAPI connection objects has been deprecated. MySQL will be further supported with SQLAlchemy engines (GH 6900).

Multi-indexing using slicers#

In 0.14.0 we added a new way to slice MultiIndexed objects. You can slice a MultiIndex by providing multiple indexers.

You can provide any of the selectors as if you are indexing by label, see Selection by Label, including slices, lists of labels, labels, and boolean indexers.

You can use slice(None) to select all the contents of that level. You do not need to specify all the_deeper_ levels, they will be implied as slice(None).

As usual, both sides of the slicers are included as this is label indexing.

See the docsSee also issues (GH 6134, GH 4036, GH 3057, GH 2598, GH 5641, GH 7106)

Warning

You should specify all axes in the .loc specifier, meaning the indexer for the index and for the columns. Their are some ambiguous cases where the passed indexer could be mis-interpreted as indexing both axes, rather than into say the MultiIndex for the rows.

You should do this:

df.loc[(slice('A1', 'A3'), ...), :] # noqa: E901

rather than this:

df.loc[(slice('A1', 'A3'), ...)] # noqa: E901

Warning

You will need to make sure that the selection axes are fully lexsorted!

In [41]: def mklbl(prefix, n): ....: return ["%s%s" % (prefix, i) for i in range(n)] ....:

In [42]: index = pd.MultiIndex.from_product([mklbl('A', 4), ....: mklbl('B', 2), ....: mklbl('C', 4), ....: mklbl('D', 2)]) ....:

In [43]: columns = pd.MultiIndex.from_tuples([('a', 'foo'), ('a', 'bar'), ....: ('b', 'foo'), ('b', 'bah')], ....: names=['lvl0', 'lvl1']) ....:

In [44]: df = pd.DataFrame(np.arange(len(index) * len(columns)).reshape((len(index), ....: len(columns))), ....: index=index, ....: columns=columns).sort_index().sort_index(axis=1) ....:

In [45]: df Out[45]: lvl0 a b
lvl1 bar foo bah foo A0 B0 C0 D0 1 0 3 2 D1 5 4 7 6 C1 D0 9 8 11 10 D1 13 12 15 14 C2 D0 17 16 19 18 ... ... ... ... ... A3 B1 C1 D1 237 236 239 238 C2 D0 241 240 243 242 D1 245 244 247 246 C3 D0 249 248 251 250 D1 253 252 255 254

[64 rows x 4 columns]

Basic MultiIndex slicing using slices, lists, and labels.

In [46]: df.loc[(slice('A1', 'A3'), slice(None), ['C1', 'C3']), :] Out[46]: lvl0 a b
lvl1 bar foo bah foo A1 B0 C1 D0 73 72 75 74 D1 77 76 79 78 C3 D0 89 88 91 90 D1 93 92 95 94 B1 C1 D0 105 104 107 106 ... ... ... ... ... A3 B0 C3 D1 221 220 223 222 B1 C1 D0 233 232 235 234 D1 237 236 239 238 C3 D0 249 248 251 250 D1 253 252 255 254

[24 rows x 4 columns]

You can use a pd.IndexSlice to shortcut the creation of these slices

In [47]: idx = pd.IndexSlice

In [48]: df.loc[idx[:, :, ['C1', 'C3']], idx[:, 'foo']] Out[48]: lvl0 a b lvl1 foo foo A0 B0 C1 D0 8 10 D1 12 14 C3 D0 24 26 D1 28 30 B1 C1 D0 40 42 ... ... ... A3 B0 C3 D1 220 222 B1 C1 D0 232 234 D1 236 238 C3 D0 248 250 D1 252 254

[32 rows x 2 columns]

It is possible to perform quite complicated selections using this method on multiple axes at the same time.

In [49]: df.loc['A1', (slice(None), 'foo')] Out[49]: lvl0 a b lvl1 foo foo B0 C0 D0 64 66 D1 68 70 C1 D0 72 74 D1 76 78 C2 D0 80 82 ... ... ... B1 C1 D1 108 110 C2 D0 112 114 D1 116 118 C3 D0 120 122 D1 124 126

[16 rows x 2 columns]

In [50]: df.loc[idx[:, :, ['C1', 'C3']], idx[:, 'foo']] Out[50]: lvl0 a b lvl1 foo foo A0 B0 C1 D0 8 10 D1 12 14 C3 D0 24 26 D1 28 30 B1 C1 D0 40 42 ... ... ... A3 B0 C3 D1 220 222 B1 C1 D0 232 234 D1 236 238 C3 D0 248 250 D1 252 254

[32 rows x 2 columns]

Using a boolean indexer you can provide selection related to the values.

In [51]: mask = df[('a', 'foo')] > 200

In [52]: df.loc[idx[mask, :, ['C1', 'C3']], idx[:, 'foo']] Out[52]: lvl0 a b lvl1 foo foo A3 B0 C1 D1 204 206 C3 D0 216 218 D1 220 222 B1 C1 D0 232 234 D1 236 238 C3 D0 248 250 D1 252 254

[7 rows x 2 columns]

You can also specify the axis argument to .loc to interpret the passed slicers on a single axis.

In [53]: df.loc(axis=0)[:, :, ['C1', 'C3']] Out[53]: lvl0 a b
lvl1 bar foo bah foo A0 B0 C1 D0 9 8 11 10 D1 13 12 15 14 C3 D0 25 24 27 26 D1 29 28 31 30 B1 C1 D0 41 40 43 42 ... ... ... ... ... A3 B0 C3 D1 221 220 223 222 B1 C1 D0 233 232 235 234 D1 237 236 239 238 C3 D0 249 248 251 250 D1 253 252 255 254

[32 rows x 4 columns]

Furthermore you can set the values using these methods

In [54]: df2 = df.copy()

In [55]: df2.loc(axis=0)[:, :, ['C1', 'C3']] = -10

In [56]: df2 Out[56]: lvl0 a b
lvl1 bar foo bah foo A0 B0 C0 D0 1 0 3 2 D1 5 4 7 6 C1 D0 -10 -10 -10 -10 D1 -10 -10 -10 -10 C2 D0 17 16 19 18 ... ... ... ... ... A3 B1 C1 D1 -10 -10 -10 -10 C2 D0 241 240 243 242 D1 245 244 247 246 C3 D0 -10 -10 -10 -10 D1 -10 -10 -10 -10

[64 rows x 4 columns]

You can use a right-hand-side of an alignable object as well.

In [57]: df2 = df.copy()

In [58]: df2.loc[idx[:, :, ['C1', 'C3']], :] = df2 * 1000

In [59]: df2 Out[59]: lvl0 a b
lvl1 bar foo bah foo A0 B0 C0 D0 1 0 3 2 D1 5 4 7 6 C1 D0 9000 8000 11000 10000 D1 13000 12000 15000 14000 C2 D0 17 16 19 18 ... ... ... ... ... A3 B1 C1 D1 237000 236000 239000 238000 C2 D0 241 240 243 242 D1 245 244 247 246 C3 D0 249000 248000 251000 250000 D1 253000 252000 255000 254000

[64 rows x 4 columns]

Plotting#

Prior version deprecations/changes#

There are prior version deprecations that are taking effect as of 0.14.0.

Deprecations#

non-floating point indexes can only be indexed by integers / labels

In [1]: pd.Series(1, np.arange(5))[3.0]
pandas/core/index.py:469: FutureWarning: scalar indexers for index type Int64Index should be integers and not floating point
Out[1]: 1
In [2]: pd.Series(1, np.arange(5)).iloc[3.0]
pandas/core/index.py:469: FutureWarning: scalar indexers for index type Int64Index should be integers and not floating point
Out[2]: 1
In [3]: pd.Series(1, np.arange(5)).iloc[3.0:4]
pandas/core/index.py:527: FutureWarning: slice indexers when using iloc should be integers and not floating point
Out[3]:
3 1
dtype: int64

these are Float64Indexes, so integer or floating point is acceptable

In [4]: pd.Series(1, np.arange(5.))[3]
Out[4]: 1
In [5]: pd.Series(1, np.arange(5.))[3.0]
Out[6]: 1

Known issues#

Enhancements#

A B 1.0 4.0 5.0 8.0 10.0
C 2.0 3.0 6.0 7.0 NaN
D NaN NaN NaN NaN 9.0
[3 rows x 5 columns]

household_id
1 0 196087.3
2 1 316478.7
3 0 294750.0
[3 rows x 2 columns]
In [64]: portfolio = pd.DataFrame({'household_id': [1, 2, 2, 3, 3, 3, 4],
....: 'asset_id': ["nl0000301109",
....: "nl0000289783",
....: "gb00b03mlx29",
....: "gb00b03mlx29",
....: "lu0197800237",
....: "nl0000289965",
....: np.nan],
....: 'name': ["ABN Amro",
....: "Robeco",
....: "Royal Dutch Shell",
....: "Royal Dutch Shell",
....: "AAB Eastern Europe Equity Fund",
....: "Postbank BioTech Fonds",
....: np.nan],
....: 'share': [1.0, 0.4, 0.6, 0.15, 0.6, 0.25, 1.0]
....: },
....: columns=['household_id', 'asset_id', 'name', 'share']
....: ).set_index(['household_id', 'asset_id'])
....:
In [65]: portfolio
Out[65]:
name share
household_id asset_id
1 nl0000301109 ABN Amro 1.00
2 nl0000289783 Robeco 0.40
gb00b03mlx29 Royal Dutch Shell 0.60
3 gb00b03mlx29 Royal Dutch Shell 0.15
lu0197800237 AAB Eastern Europe Equity Fund 0.60
nl0000289965 Postbank BioTech Fonds 0.25
4 NaN NaN 1.00
[7 rows x 2 columns]
In [66]: household.join(portfolio, how='inner')
Out[66]:
male ... share
household_id asset_id ...
1 nl0000301109 0 ... 1.00
2 nl0000289783 1 ... 0.40
gb00b03mlx29 1 ... 0.60
3 gb00b03mlx29 0 ... 0.15
lu0197800237 0 ... 0.60
nl0000289965 0 ... 0.25
[6 rows x 4 columns]

1 A Mark 3 2013-09-01 13:05:00 2013-10-15 13:05:00
2 A Carl 5 2013-10-01 20:00:00 2013-09-05 20:00:00
3 A Carl 1 2013-10-02 10:00:00 2013-11-02 10:00:00
4 A Joe 8 2013-11-01 20:00:00 2013-10-07 20:00:00
5 B Joe 1 2013-10-02 10:00:00 2013-09-05 10:00:00
[6 rows x 5 columns]
In [75]: df.pivot_table(values='Quantity',
....: index=pd.Grouper(freq='M', key='Date'),
....: columns=pd.Grouper(freq='M', key='PayDay'),
....: aggfunc="sum")
Out[75]:
PayDay 2013-09-30 2013-10-31 2013-11-30
Date
2013-09-30 NaN 3.0 NaN
2013-10-31 6.0 NaN 1.0
2013-11-30 NaN 9.0 NaN
[3 rows x 3 columns]

2013-01-05 08:00 0.285296
2013-01-05 09:00 0.484288
2013-01-05 10:00 1.363482
2013-01-05 11:00 -0.781105
2013-01-05 12:00 -0.468018
Freq: H, Length: 100, dtype: float64
In [79]: ps['2013-01-02']
Out[79]:
2013-01-02 00:00 0.553439
2013-01-02 01:00 1.318152
2013-01-02 02:00 -0.469305
2013-01-02 03:00 0.675554
2013-01-02 04:00 -1.817027
...
2013-01-02 19:00 0.036142
2013-01-02 20:00 -2.074978
2013-01-02 21:00 0.247792
2013-01-02 22:00 -0.897157
2013-01-02 23:00 -0.136795
Freq: H, Length: 24, dtype: float64

Performance#

Experimental#

There are no experimental changes in 0.14.0

Bug fixes#

Contributors#

A total of 94 people contributed patches to this release. People with a “+” by their names contributed a patch for the first time.