Version 0.16.0 (March 22, 2015) — pandas 3.0.0.dev0+2104.ge637b4290d documentation (original) (raw)

This is a major release from 0.15.2 and includes a small number of API changes, several new features, enhancements, and performance improvements along with a large number of bug fixes. We recommend that all users upgrade to this version.

Highlights include:

Check the API Changes and deprecations before updating.

What’s new in v0.16.0

New features#

DataFrame assign#

Inspired by dplyr’s mutate verb, DataFrame has a newassign() method. The function signature for assign is simply **kwargs. The keys are the column names for the new fields, and the values are either a value to be inserted (for example, a Series or NumPy array), or a function of one argument to be called on the DataFrame. The new values are inserted, and the entire DataFrame (with all original and new columns) is returned.

In [1]: iris = pd.read_csv('data/iris.data')

In [2]: iris.head() Out[2]: SepalLength SepalWidth PetalLength PetalWidth Name 0 5.1 3.5 1.4 0.2 Iris-setosa 1 4.9 3.0 1.4 0.2 Iris-setosa 2 4.7 3.2 1.3 0.2 Iris-setosa 3 4.6 3.1 1.5 0.2 Iris-setosa 4 5.0 3.6 1.4 0.2 Iris-setosa

[5 rows x 5 columns]

In [3]: iris.assign(sepal_ratio=iris['SepalWidth'] / iris['SepalLength']).head() Out[3]: SepalLength SepalWidth PetalLength PetalWidth Name sepal_ratio 0 5.1 3.5 1.4 0.2 Iris-setosa 0.686275 1 4.9 3.0 1.4 0.2 Iris-setosa 0.612245 2 4.7 3.2 1.3 0.2 Iris-setosa 0.680851 3 4.6 3.1 1.5 0.2 Iris-setosa 0.673913 4 5.0 3.6 1.4 0.2 Iris-setosa 0.720000

[5 rows x 6 columns]

Above was an example of inserting a precomputed value. We can also pass in a function to be evaluated.

In [4]: iris.assign(sepal_ratio=lambda x: (x['SepalWidth'] ...: / x['SepalLength'])).head() ...: Out[4]: SepalLength SepalWidth PetalLength PetalWidth Name sepal_ratio 0 5.1 3.5 1.4 0.2 Iris-setosa 0.686275 1 4.9 3.0 1.4 0.2 Iris-setosa 0.612245 2 4.7 3.2 1.3 0.2 Iris-setosa 0.680851 3 4.6 3.1 1.5 0.2 Iris-setosa 0.673913 4 5.0 3.6 1.4 0.2 Iris-setosa 0.720000

[5 rows x 6 columns]

The power of assign comes when used in chains of operations. For example, we can limit the DataFrame to just those with a Sepal Length greater than 5, calculate the ratio, and plot

In [5]: iris = pd.read_csv('data/iris.data')

In [6]: (iris.query('SepalLength > 5') ...: .assign(SepalRatio=lambda x: x.SepalWidth / x.SepalLength, ...: PetalRatio=lambda x: x.PetalWidth / x.PetalLength) ...: .plot(kind='scatter', x='SepalRatio', y='PetalRatio')) ...: Out[6]: <Axes: xlabel='SepalRatio', ylabel='PetalRatio'>

../_images/whatsnew_assign.png

See the documentation for more. (GH 9229)

Interaction with scipy.sparse#

Added SparseSeries.to_coo() and SparseSeries.from_coo() methods (GH 8048) for converting to and from scipy.sparse.coo_matrix instances (see here). For example, given a SparseSeries with MultiIndex we can convert to a scipy.sparse.coo_matrix by specifying the row and column labels as index levels:

s = pd.Series([3.0, np.nan, 1.0, 3.0, np.nan, np.nan]) s.index = pd.MultiIndex.from_tuples([(1, 2, 'a', 0), (1, 2, 'a', 1), (1, 1, 'b', 0), (1, 1, 'b', 1), (2, 1, 'b', 0), (2, 1, 'b', 1)], names=['A', 'B', 'C', 'D'])

s

SparseSeries

ss = s.to_sparse() ss

A, rows, columns = ss.to_coo(row_levels=['A', 'B'], column_levels=['C', 'D'], sort_labels=False)

A A.todense() rows columns

The from_coo method is a convenience method for creating a SparseSeriesfrom a scipy.sparse.coo_matrix:

from scipy import sparse A = sparse.coo_matrix(([3.0, 1.0, 2.0], ([1, 0, 0], [0, 2, 3])), shape=(3, 4)) A A.todense()

ss = pd.SparseSeries.from_coo(A) ss

String methods enhancements#

1 False
2 True
Length: 3, dtype: bool
In [9]: s.str.find('ab')
Out[9]:
0 0
1 -1
2 -1
Length: 3, dtype: int64

Length: 3, dtype: object

replaced with empty char

In [14]: s.str.slice_replace(0, 1)
Out[14]:
0 BCD
1 FGH
2 JK
Length: 3, dtype: object

Other enhancements#

Returns the 1st and 4th sheet, as a dictionary of DataFrames.

pd.read_excel('path_to_file.xls', sheetname=['Sheet1', 3])

Backwards incompatible API changes#

Changes in timedelta#

In v0.15.0 a new scalar type Timedelta was introduced, that is a sub-class of datetime.timedelta. Mentioned here was a notice of an API change w.r.t. the .seconds accessor. The intent was to provide a user-friendly set of accessors that give the ‘natural’ value for that unit, e.g. if you had a Timedelta('1 day, 10:11:12'), then .seconds would return 12. However, this is at odds with the definition of datetime.timedelta, which defines .seconds as 10 * 3600 + 11 * 60 + 12 == 36672.

So in v0.16.0, we are restoring the API to match that of datetime.timedelta. Further, the component values are still available through the .components accessor. This affects the .seconds and .microseconds accessors, and removes the .hours, .minutes, .milliseconds accessors. These changes affect TimedeltaIndex and the Series .dt accessor as well. (GH 9185, GH 9139)

Previous behavior

In [2]: t = pd.Timedelta('1 day, 10:11:12.100123')

In [3]: t.days Out[3]: 1

In [4]: t.seconds Out[4]: 12

In [5]: t.microseconds Out[5]: 123

New behavior

In [17]: t = pd.Timedelta('1 day, 10:11:12.100123')

In [18]: t.days Out[18]: 1

In [19]: t.seconds Out[19]: 36672

In [20]: t.microseconds Out[20]: 100123

Using .components allows the full component access

In [21]: t.components Out[21]: Components(days=1, hours=10, minutes=11, seconds=12, milliseconds=100, microseconds=123, nanoseconds=0)

In [22]: t.components.seconds Out[22]: 12

Indexing changes#

The behavior of a small sub-set of edge cases for using .loc have changed (GH 8613). Furthermore we have improved the content of the error messages that are raised:

2013-01-01 0.469112 -0.282863 -1.509059 -1.135632
2013-01-02 1.212112 -0.173215 0.119209 -1.044236
2013-01-03 -0.861849 -2.104569 -0.494929 1.071804
2013-01-04 0.721555 -0.706771 -1.039575 0.271860
2013-01-05 -0.424972 0.567020 0.276232 -1.087401
[5 rows x 4 columns]
In [25]: s = pd.Series(range(5), [-2, -1, 1, 2, 3])
In [26]: s
Out[26]:
-2 0
-1 1
1 2
2 3
3 4
Length: 5, dtype: int64
Previous behavior
In [4]: df.loc['2013-01-02':'2013-01-10']
KeyError: 'stop bound [2013-01-10] is not in the [index]'
In [6]: s.loc[-10:3]
KeyError: 'start bound [-10] is not the [index]'
New behavior
In [27]: df.loc['2013-01-02':'2013-01-10']
Out[27]:
A B C D
2013-01-02 1.212112 -0.173215 0.119209 -1.044236
2013-01-03 -0.861849 -2.104569 -0.494929 1.071804
2013-01-04 0.721555 -0.706771 -1.039575 0.271860
2013-01-05 -0.424972 0.567020 0.276232 -1.087401
[4 rows x 4 columns]
In [28]: s.loc[-10:3]
Out[28]:
-2 0
-1 1
1 2
2 3
3 4
Length: 5, dtype: int64

1 2
2 3
dtype: int64

Categorical changes#

In prior versions, Categoricals that had an unspecified ordering (meaning no ordered keyword was passed) were defaulted as ordered Categoricals. Going forward, the ordered keyword in the Categorical constructor will default to False. Ordering must now be explicit.

Furthermore, previously you could change the ordered attribute of a Categorical by just setting the attribute, e.g. cat.ordered=True; This is now deprecated and you should use cat.as_ordered() or cat.as_unordered(). These will by default return a new object and not modify the existing object. (GH 9347, GH 9190)

Previous behavior

In [3]: s = pd.Series([0, 1, 2], dtype='category')

In [4]: s Out[4]: 0 0 1 1 2 2 dtype: category Categories (3, int64): [0 < 1 < 2]

In [5]: s.cat.ordered Out[5]: True

In [6]: s.cat.ordered = False

In [7]: s Out[7]: 0 0 1 1 2 2 dtype: category Categories (3, int64): [0, 1, 2]

New behavior

In [29]: s = pd.Series([0, 1, 2], dtype='category')

In [30]: s Out[30]: 0 0 1 1 2 2 Length: 3, dtype: category Categories (3, int64): [0, 1, 2]

In [31]: s.cat.ordered Out[31]: False

In [32]: s = s.cat.as_ordered()

In [33]: s Out[33]: 0 0 1 1 2 2 Length: 3, dtype: category Categories (3, int64): [0 < 1 < 2]

In [34]: s.cat.ordered Out[34]: True

you can set in the constructor of the Categorical

In [35]: s = pd.Series(pd.Categorical([0, 1, 2], ordered=True))

In [36]: s Out[36]: 0 0 1 1 2 2 Length: 3, dtype: category Categories (3, int64): [0 < 1 < 2]

In [37]: s.cat.ordered Out[37]: True

For ease of creation of series of categorical data, we have added the ability to pass keywords when calling .astype(). These are passed directly to the constructor.

In [54]: s = pd.Series(["a", "b", "c", "a"]).astype('category', ordered=True)

In [55]: s Out[55]: 0 a 1 b 2 c 3 a dtype: category Categories (3, object): [a < b < c]

In [56]: s = (pd.Series(["a", "b", "c", "a"]) ....: .astype('category', categories=list('abcdef'), ordered=False))

In [57]: s Out[57]: 0 a 1 b 2 c 3 a dtype: category Categories (6, object): [a, b, c, d, e, f]

Other API changes#

Deprecations#

Removal of prior version deprecations/changes#

Performance improvements#

Bug fixes#

1 1.0000
2 1.0000
3 1.0000
4 1.0000
...
125 1.0000
126 1.0000
127 0.9999
128 1.0000
129 1.0000
dtype: float64

Contributors#

A total of 60 people contributed patches to this release. People with a “+” by their names contributed a patch for the first time.