Version 0.16.1 (May 11, 2015) — pandas 2.2.3 documentation (original) (raw)

This is a minor bug-fix release from 0.16.0 and includes a large number of bug fixes along several new features, enhancements, and performance improvements. We recommend that all users upgrade to this version.

Highlights include:

What’s new in v0.16.1

Warning

In pandas 0.17.0, the sub-package pandas.io.data will be removed in favor of a separately installable package (GH 8961).

Enhancements#

CategoricalIndex#

We introduce a CategoricalIndex, a new type of index object that is useful for supporting indexing with duplicates. This is a container around a Categorical (introduced in v0.15.0) and allows efficient indexing and storage of an index with a large number of duplicated elements. Prior to 0.16.1, setting the index of a DataFrame/Series with a category dtype would convert this to regular object-based Index.

In [1]: df = pd.DataFrame({'A': np.arange(6), ...: 'B': pd.Series(list('aabbca')) ...: .astype('category', categories=list('cab')) ...: }) ...:

In [2]: df Out[2]: A B 0 0 a 1 1 a 2 2 b 3 3 b 4 4 c 5 5 a

In [3]: df.dtypes Out[3]: A int64 B category dtype: object

In [4]: df.B.cat.categories Out[4]: Index(['c', 'a', 'b'], dtype='object')

setting the index, will create a CategoricalIndex

In [5]: df2 = df.set_index('B')

In [6]: df2.index Out[6]: CategoricalIndex(['a', 'a', 'b', 'b', 'c', 'a'], categories=['c', 'a', 'b'], ordered=False, name='B', dtype='category')

indexing with __getitem__/.iloc/.loc/.ix works similarly to an Index with duplicates. The indexers MUST be in the category or the operation will raise.

In [7]: df2.loc['a'] Out[7]: A B a 0 a 1 a 5

and preserves the CategoricalIndex

In [8]: df2.loc['a'].index Out[8]: CategoricalIndex(['a', 'a', 'a'], categories=['c', 'a', 'b'], ordered=False, name='B', dtype='category')

sorting will order by the order of the categories

In [9]: df2.sort_index() Out[9]: A B c 4 a 0 a 1 a 5 b 2 b 3

groupby operations on the index will preserve the index nature as well

In [10]: df2.groupby(level=0).sum() Out[10]: A B c 4 a 6 b 5

In [11]: df2.groupby(level=0).sum().index Out[11]: CategoricalIndex(['c', 'a', 'b'], categories=['c', 'a', 'b'], ordered=False, name='B', dtype='category')

reindexing operations, will return a resulting index based on the type of the passed indexer, meaning that passing a list will return a plain-old-Index; indexing with a Categorical will return a CategoricalIndex, indexed according to the categories of the PASSED Categorical dtype. This allows one to arbitrarily index these even with values NOT in the categories, similarly to how you can reindex ANY pandas index.

In [12]: df2.reindex(['a', 'e']) Out[12]: A B a 0.0 a 1.0 a 5.0 e NaN

In [13]: df2.reindex(['a', 'e']).index Out[13]: pd.Index(['a', 'a', 'a', 'e'], dtype='object', name='B')

In [14]: df2.reindex(pd.Categorical(['a', 'e'], categories=list('abcde'))) Out[14]: A B a 0.0 a 1.0 a 5.0 e NaN

In [15]: df2.reindex(pd.Categorical(['a', 'e'], categories=list('abcde'))).index Out[15]: pd.CategoricalIndex(['a', 'a', 'a', 'e'], categories=['a', 'b', 'c', 'd', 'e'], ordered=False, name='B', dtype='category')

See the documentation for more. (GH 7629, GH 10038, GH 10039)

Sample#

Series, DataFrames, and Panels now have a new method: sample(). The method accepts a specific number of rows or columns to return, or a fraction of the total number or rows or columns. It also has options for sampling with or without replacement, for passing in a column for weights for non-uniform sampling, and for setting seed values to facilitate replication. (GH 2419)

In [1]: example_series = pd.Series([0, 1, 2, 3, 4, 5])

When no arguments are passed, returns 1

In [2]: example_series.sample() Out[2]: 3 3 Length: 1, dtype: int64

One may specify either a number of rows:

In [3]: example_series.sample(n=3) Out[3]: 2 2 1 1 0 0 Length: 3, dtype: int64

Or a fraction of the rows:

In [4]: example_series.sample(frac=0.5) Out[4]: 1 1 5 5 3 3 Length: 3, dtype: int64

weights are accepted.

In [5]: example_weights = [0, 0, 0.2, 0.2, 0.2, 0.4]

In [6]: example_series.sample(n=3, weights=example_weights) Out[6]: 2 2 4 4 3 3 Length: 3, dtype: int64

weights will also be normalized if they do not sum to one,

and missing values will be treated as zeros.

In [7]: example_weights2 = [0.5, 0, 0, 0, None, np.nan]

In [8]: example_series.sample(n=1, weights=example_weights2) Out[8]: 0 0 Length: 1, dtype: int64

When applied to a DataFrame, one may pass the name of a column to specify sampling weights when sampling from rows.

In [9]: df = pd.DataFrame({"col1": [9, 8, 7, 6], "weight_column": [0.5, 0.4, 0.1, 0]})

In [10]: df.sample(n=3, weights="weight_column") Out[10]: col1 weight_column 0 9 0.5 1 8 0.4 2 7 0.1

[3 rows x 2 columns]

String methods enhancements#

Continuing from v0.16.0, the following enhancements make string operations easier and more consistent with standard python string operations.

a2 1
b1 2
b2 3
Length: 4, dtype: int64
In [16]: idx.str.startswith("a")
Out[16]: array([ True, True, False, False])
In [17]: s[s.index.str.startswith("a")]
Out[17]:
a1 0
a2 1
Length: 2, dtype: int64

return Series

In [19]: s.str.split(",")
Out[19]:
0 [a, b]
1 [a, c]
2 [b, c]
Length: 3, dtype: object

return DataFrame

In [20]: s.str.split(",", expand=True)
Out[20]:
0 1
0 a b
1 a c
2 b c
[3 rows x 2 columns]
In [21]: idx = pd.Index(["a,b", "a,c", "b,c"])

return Index

In [22]: idx.str.split(",")
Out[22]: Index([['a', 'b'], ['a', 'c'], ['b', 'c']], dtype='object')

return MultiIndex

In [23]: idx.str.split(",", expand=True)
Out[23]:
MultiIndex([('a', 'b'),
('a', 'c'),
('b', 'c')],
)

Other enhancements#

0 -0.706771 -1.039575
1 -0.424972 0.567020
2 -1.087401 -0.673690
[3 rows x 2 columns]

API changes#

Deprecations#

Index representation#

The string representation of Index and its sub-classes have now been unified. These will show a single-line display if there are few values; a wrapped multi-line display for a lot of values (but less than display.max_seq_items; if lots of items (> display.max_seq_items) will show a truncated display (the head and tail of the data). The formatting for MultiIndex is unchanged (a multi-line wrapped display). The display width responds to the option display.max_seq_items, which is defaulted to 100. (GH 6482)

Previous behavior

In [2]: pd.Index(range(4), name='foo') Out[2]: Int64Index([0, 1, 2, 3], dtype='int64')

In [3]: pd.Index(range(104), name='foo') Out[3]: Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, ...], dtype='int64')

In [4]: pd.date_range('20130101', periods=4, name='foo', tz='US/Eastern') Out[4]: <class 'pandas.tseries.index.DatetimeIndex'> [2013-01-01 00:00:00-05:00, ..., 2013-01-04 00:00:00-05:00] Length: 4, Freq: D, Timezone: US/Eastern

In [5]: pd.date_range('20130101', periods=104, name='foo', tz='US/Eastern') Out[5]: <class 'pandas.tseries.index.DatetimeIndex'> [2013-01-01 00:00:00-05:00, ..., 2013-04-14 00:00:00-04:00] Length: 104, Freq: D, Timezone: US/Eastern

New behavior

In [29]: pd.set_option("display.width", 80)

In [30]: pd.Index(range(4), name="foo") Out[30]: RangeIndex(start=0, stop=4, step=1, name='foo')

In [31]: pd.Index(range(30), name="foo") Out[31]: RangeIndex(start=0, stop=30, step=1, name='foo')

In [32]: pd.Index(range(104), name="foo") Out[32]: RangeIndex(start=0, stop=104, step=1, name='foo')

In [33]: pd.CategoricalIndex(["a", "bb", "ccc", "dddd"], ordered=True, name="foobar") Out[33]: CategoricalIndex(['a', 'bb', 'ccc', 'dddd'], categories=['a', 'bb', 'ccc', 'dddd'], ordered=True, dtype='category', name='foobar')

In [34]: pd.CategoricalIndex(["a", "bb", "ccc", "dddd"] * 10, ordered=True, name="foobar") Out[34]: CategoricalIndex(['a', 'bb', 'ccc', 'dddd', 'a', 'bb', 'ccc', 'dddd', 'a', 'bb', 'ccc', 'dddd', 'a', 'bb', 'ccc', 'dddd', 'a', 'bb', 'ccc', 'dddd', 'a', 'bb', 'ccc', 'dddd', 'a', 'bb', 'ccc', 'dddd', 'a', 'bb', 'ccc', 'dddd', 'a', 'bb', 'ccc', 'dddd', 'a', 'bb', 'ccc', 'dddd'], categories=['a', 'bb', 'ccc', 'dddd'], ordered=True, dtype='category', name='foobar')

In [35]: pd.CategoricalIndex(["a", "bb", "ccc", "dddd"] * 100, ordered=True, name="foobar") Out[35]: CategoricalIndex(['a', 'bb', 'ccc', 'dddd', 'a', 'bb', 'ccc', 'dddd', 'a', 'bb', ... 'ccc', 'dddd', 'a', 'bb', 'ccc', 'dddd', 'a', 'bb', 'ccc', 'dddd'], categories=['a', 'bb', 'ccc', 'dddd'], ordered=True, dtype='category', name='foobar', length=400)

In [36]: pd.date_range("20130101", periods=4, name="foo", tz="US/Eastern") Out[36]: DatetimeIndex(['2013-01-01 00:00:00-05:00', '2013-01-02 00:00:00-05:00', '2013-01-03 00:00:00-05:00', '2013-01-04 00:00:00-05:00'], dtype='datetime64[ns, US/Eastern]', name='foo', freq='D')

In [37]: pd.date_range("20130101", periods=25, freq="D") Out[37]: DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04', '2013-01-05', '2013-01-06', '2013-01-07', '2013-01-08', '2013-01-09', '2013-01-10', '2013-01-11', '2013-01-12', '2013-01-13', '2013-01-14', '2013-01-15', '2013-01-16', '2013-01-17', '2013-01-18', '2013-01-19', '2013-01-20', '2013-01-21', '2013-01-22', '2013-01-23', '2013-01-24', '2013-01-25'], dtype='datetime64[ns]', freq='D')

In [38]: pd.date_range("20130101", periods=104, name="foo", tz="US/Eastern") Out[38]: DatetimeIndex(['2013-01-01 00:00:00-05:00', '2013-01-02 00:00:00-05:00', '2013-01-03 00:00:00-05:00', '2013-01-04 00:00:00-05:00', '2013-01-05 00:00:00-05:00', '2013-01-06 00:00:00-05:00', '2013-01-07 00:00:00-05:00', '2013-01-08 00:00:00-05:00', '2013-01-09 00:00:00-05:00', '2013-01-10 00:00:00-05:00', ... '2013-04-05 00:00:00-04:00', '2013-04-06 00:00:00-04:00', '2013-04-07 00:00:00-04:00', '2013-04-08 00:00:00-04:00', '2013-04-09 00:00:00-04:00', '2013-04-10 00:00:00-04:00', '2013-04-11 00:00:00-04:00', '2013-04-12 00:00:00-04:00', '2013-04-13 00:00:00-04:00', '2013-04-14 00:00:00-04:00'], dtype='datetime64[ns, US/Eastern]', name='foo', length=104, freq='D')

Performance improvements#

Bug fixes#

Contributors#

A total of 58 people contributed patches to this release. People with a “+” by their names contributed a patch for the first time.