Version 0.8.0 (June 29, 2012) — pandas 3.0.0.dev0+2104.ge637b4290d documentation (original) (raw)

This is a major release from 0.7.3 and includes extensive work on the time series handling and processing infrastructure as well as a great deal of new functionality throughout the library. It includes over 700 commits from more than 20 distinct authors. Most pandas 0.7.3 and earlier users should not experience any issues upgrading, but due to the migration to the NumPy datetime64 dtype, there may be a number of bugs and incompatibilities lurking. Lingering incompatibilities will be fixed ASAP in a 0.8.1 release if necessary. See the full release notes or issue tracker on GitHub for a complete list.

Support for non-unique indexes#

All objects can now work with non-unique indexes. Data alignment / join operations work according to SQL join semantics (including, if application, index duplication in many-to-many joins)

NumPy datetime64 dtype and 1.6 dependency#

Time series data are now represented using NumPy’s datetime64 dtype; thus, pandas 0.8.0 now requires at least NumPy 1.6. It has been tested and verified to work with the development version (1.7+) of NumPy as well which includes some significant user-facing API changes. NumPy 1.6 also has a number of bugs having to do with nanosecond resolution data, so I recommend that you steer clear of NumPy 1.6’s datetime64 API functions (though limited as they are) and only interact with this data using the interface that pandas provides.

See the end of the 0.8.0 section for a “porting” guide listing potential issues for users migrating legacy code bases from pandas 0.7 or earlier to 0.8.0.

Bug fixes to the 0.7.x series for legacy NumPy < 1.6 users will be provided as they arise. There will be no more further development in 0.7.x beyond bug fixes.

Time Series changes and improvements#

Note

With this release, legacy scikits.timeseries users should be able to port their code to use pandas.

Other new features#

New plotting methods#

import pandas as pd

fx = pd.read_pickle("data/fx_prices") import matplotlib.pyplot as plt

Series.plot now supports a secondary_y option:

plt.figure()

fx["FR"].plot(style="g")

fx["IT"].plot(style="k--", secondary_y=True)

Vytautas Jancauskas, the 2012 GSOC participant, has added many new plot types. For example, 'kde' is a new option:

s = pd.Series( np.concatenate((np.random.randn(1000), np.random.randn(1000) * 0.5 + 3)) ) plt.figure() s.hist(density=True, alpha=0.2) s.plot(kind="kde")

See the plotting page for much more.

Other API changes#

Potential porting issues for pandas <= 0.7.3 users#

The major change that may affect you in pandas 0.8.0 is that time series indexes use NumPy’s datetime64 data type instead of dtype=object arrays of Python’s built-in datetime.datetime objects. DateRange has been replaced by DatetimeIndex but otherwise behaved identically. But, if you have code that converts DateRange or Index objects that used to containdatetime.datetime values to plain NumPy arrays, you may have bugs lurking with code using scalar values because you are handing control over to NumPy:

In [1]: import datetime

In [2]: rng = pd.date_range("1/1/2000", periods=10)

In [3]: rng[5] Out[3]: Timestamp('2000-01-06 00:00:00')

In [4]: isinstance(rng[5], datetime.datetime) Out[4]: True

In [5]: rng_asarray = np.asarray(rng)

In [6]: scalar_val = rng_asarray[5]

In [7]: type(scalar_val) Out[7]: numpy.datetime64

pandas’s Timestamp object is a subclass of datetime.datetime that has nanosecond support (the nanosecond field store the nanosecond value between 0 and 999). It should substitute directly into any code that useddatetime.datetime values before. Thus, I recommend not castingDatetimeIndex to regular NumPy arrays.

If you have code that requires an array of datetime.datetime objects, you have a couple of options. First, the astype(object) method of DatetimeIndexproduces an array of Timestamp objects:

In [8]: stamp_array = rng.astype(object)

In [9]: stamp_array Out[9]: Index([2000-01-01 00:00:00, 2000-01-02 00:00:00, 2000-01-03 00:00:00, 2000-01-04 00:00:00, 2000-01-05 00:00:00, 2000-01-06 00:00:00, 2000-01-07 00:00:00, 2000-01-08 00:00:00, 2000-01-09 00:00:00, 2000-01-10 00:00:00], dtype='object')

In [10]: stamp_array[5] Out[10]: Timestamp('2000-01-06 00:00:00')

To get an array of proper datetime.datetime objects, use theto_pydatetime method:

In [11]: dt_array = rng.to_pydatetime()

In [12]: dt_array Out[12]: array([datetime.datetime(2000, 1, 1, 0, 0), datetime.datetime(2000, 1, 2, 0, 0), datetime.datetime(2000, 1, 3, 0, 0), datetime.datetime(2000, 1, 4, 0, 0), datetime.datetime(2000, 1, 5, 0, 0), datetime.datetime(2000, 1, 6, 0, 0), datetime.datetime(2000, 1, 7, 0, 0), datetime.datetime(2000, 1, 8, 0, 0), datetime.datetime(2000, 1, 9, 0, 0), datetime.datetime(2000, 1, 10, 0, 0)], dtype=object)

In [13]: dt_array[5] Out[13]: datetime.datetime(2000, 1, 6, 0, 0)

matplotlib knows how to handle datetime.datetime but not Timestamp objects. While I recommend that you plot time series using TimeSeries.plot, you can either use to_pydatetime or register a converter for the Timestamp type. See matplotlib documentation for more on this.

Warning

There are bugs in the user-facing API with the nanosecond datetime64 unit in NumPy 1.6. In particular, the string version of the array shows garbage values, and conversion to dtype=object is similarly broken.

In [14]: rng = pd.date_range("1/1/2000", periods=10)

In [15]: rng Out[15]: DatetimeIndex(['2000-01-01', '2000-01-02', '2000-01-03', '2000-01-04', '2000-01-05', '2000-01-06', '2000-01-07', '2000-01-08', '2000-01-09', '2000-01-10'], dtype='datetime64[ns]', freq='D')

In [16]: np.asarray(rng) Out[16]: array(['2000-01-01T00:00:00.000000000', '2000-01-02T00:00:00.000000000', '2000-01-03T00:00:00.000000000', '2000-01-04T00:00:00.000000000', '2000-01-05T00:00:00.000000000', '2000-01-06T00:00:00.000000000', '2000-01-07T00:00:00.000000000', '2000-01-08T00:00:00.000000000', '2000-01-09T00:00:00.000000000', '2000-01-10T00:00:00.000000000'], dtype='datetime64[ns]')

In [17]: converted = np.asarray(rng, dtype=object)

In [18]: converted[5] Out[18]: Timestamp('2000-01-06 00:00:00')

Trust me: don’t panic. If you are using NumPy 1.6 and restrict your interaction with datetime64 values to pandas’s API you will be just fine. There is nothing wrong with the data-type (a 64-bit integer internally); all of the important data processing happens in pandas and is heavily tested. I strongly recommend that you do not work directly with datetime64 arrays in NumPy 1.6 and only use the pandas API.

Support for non-unique indexes: In the latter case, you may have code inside a try:... catch: block that failed due to the index not being unique. In many cases it will no longer fail (some method like append still check for uniqueness unless disabled). However, all is not lost: you can inspect index.is_unique and raise an exception explicitly if it isFalse or go to a different code branch.

Contributors#

A total of 27 people contributed patches to this release. People with a “+” by their names contributed a patch for the first time.