API: add DatetimeBlockTZ #8260 by jreback · Pull Request #10477 · pandas-dev/pandas (original) (raw)

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation95 Commits1 Checks0 Files changed

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

closes #8260
closes #10763

ToDos:

doc updates
test with Series.dt.*
test with csv/HDF5
nat setting borked ATM
HDF5 example from 0.16.2

~~- [ ] get_values/values - make consistent~~
~~- [ ] maybe move DatetimeTZBlock.shift mostly to DatetimeIndex.shift~~

Also

This cleans up the internal blocks calling conventions a bit
Fixes a bug in DatetimeIndex and localizing when NaT's are present

-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------
timestamp_tz_ops_diff1                       |   0.1804 | 159.4527 |   0.0011 | # note these are 10k element
timestamp_tz_ops_diff2                       |   2.5350 | 156.4047 |   0.0162 | # note these are 10k elements
timeseries_timestamp_downsample_mean         |   3.1467 |   3.3040 |   0.9524 |
timestamp_series_compare                     |   9.0797 |   9.1290 |   0.9946 |
timestamp_ops_diff2                          |  19.7570 |  19.6819 |   1.0038 | # this is 1M elements
series_timestamp_compare                     |   9.3430 |   9.0226 |   1.0355 |
timestamp_ops_diff1                          |   9.7457 |   9.0450 |   1.0775 | # this is 1M elements
-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------

Ratio < 1.0 means the target commit is faster then the baseline.
Seed used: 1234

Target [8502474] : API: add Block.make_block
API: add DatetimeBlockWithTZ #8260
Base   [16a44ad] : Merge pull request #10199 from jreback/gil
PERF: releasing the GIL, #8882

Demo

In [1]:   df = DataFrame({'A' : date_range('20130101',periods=3),
   ...:                    'B' : date_range('20130101',periods=3,tz='US/Eastern'),
   ...:                    'C' : date_range('20130101',periods=3,tz='CET')})

In [2]:    df
Out[2]: 
           A                   B                   C
0 2013-01-01 2013-01-01 05:00:00 2012-12-31 23:00:00
1 2013-01-02 2013-01-02 05:00:00 2013-01-01 23:00:00
2 2013-01-03 2013-01-03 05:00:00 2013-01-02 23:00:00

In [3]:    df.dtypes
Out[3]: 
A                datetime64[ns]
B    datetime64[ns, US/Eastern]
C           datetime64[ns, CET]
dtype: object

In [4]: df.B
Out[4]: 
0   2013-01-01 00:00:00-05:00
1   2013-01-02 00:00:00-05:00
2   2013-01-03 00:00:00-05:00
Name: B, dtype: datetime64[ns, US/Eastern]

In [5]: df.B.dt.tz_localize(None)
Out[5]: 
0   2013-01-01
1   2013-01-02
2   2013-01-03
dtype: datetime64[ns]

Ooh, shiny! I'll try this out tonight.

@jreback Nice! I will try to go through it one of the coming days

@jreback one quirk I'm noticing is that if you construct a DataFrame directly from a DatetimeIndex, you lose the tz information:

In [1]: dates = date_range('2014-01-01', periods=10, tz='UTC')

In [2]: from_dict = DataFrame({'a': dates})

In [3]: from_dict.dtypes
Out[3]:
a    datetime64[ns, UTC]
dtype: object

In [4]: from_index = DataFrame(dates)

In [5]: from_index.dtypes
Out[5]:
0    datetime64[ns]
dtype: object

Is this expected behavior?

Also surprising: if I take a series of dtype datetime64[ns, UTC] and call.values on it, I get a DatetimeIndex rather than an ndarray:

In [7]: from_dict['a'].values
Out[7]:
DatetimeIndex(['2014-01-01', '2014-01-02', '2014-01-03', '2014-01-04',
               '2014-01-05', '2014-01-06', '2014-01-07', '2014-01-08',
               '2014-01-09', '2014-01-10'],
              dtype='datetime64[ns]', freq='D', tz='UTC')

The result I'd expect here is what you actually get from doing

In [21]: from_dict['a'].values.values
Out[21]:
array(['2013-12-31T19:00:00.000000000-0500',
       '2014-01-01T19:00:00.000000000-0500',
       '2014-01-02T19:00:00.000000000-0500',
       '2014-01-03T19:00:00.000000000-0500',
       '2014-01-04T19:00:00.000000000-0500',
       '2014-01-05T19:00:00.000000000-0500',
       '2014-01-06T19:00:00.000000000-0500',
       '2014-01-07T19:00:00.000000000-0500',
       '2014-01-08T19:00:00.000000000-0500',
       '2014-01-09T19:00:00.000000000-0500'], dtype='datetime64[ns]')

do you have the current one?

this all works (it might not have in a prior one)

This is the latest jreback@19ec61d I think

In [3]: from_dict
Out[3]: 
           a
0 2014-01-01
1 2014-01-02
2 2014-01-03
3 2014-01-04
4 2014-01-05
5 2014-01-06
6 2014-01-07
7 2014-01-08
8 2014-01-09
9 2014-01-10

In [4]: from_dict.dtypes
Out[4]: 
a    datetime64[ns, UTC]
dtype: object

Series.values IS a DatetimeIndex that is how its implemented. Its very similar to how Sparse/Categorical are done. This preserves the tz info inside the object (actually have freq too). Rather than relying upon a ndarray impl and passing it around. Much cleaner this way.

I'm testing with a development install of this branch:

(pandas)[~/clones/pandas]@(tz:2422fe50fc)$ git log HEAD^..HEAD
commit 2422fe50fcc5aac541a0be8d67a4e86309b3e2ec
Author: Jeff Reback <jeff@reback.net>
Date:   Sat Jun 27 17:55:29 2015 -0400

    API: add DatetimeBlockTZ #8260

commit 757bbf92c926d4584d01bce419a576c7cb831fce
Author: Jeff Reback <jeff@reback.net>
Date:   Wed Jul 1 19:38:16 2015 -0400

    start on csv

commit 2422fe50fcc5aac541a0be8d67a4e86309b3e2ec
Author: Jeff Reback <jeff@reback.net>
Date:   Sat Jun 27 17:55:29 2015 -0400

    API: add DatetimeBlockTZ #8260

everthing works except for csv rounding tripping (not sure I can get it as can't repro the tz upon readback) but we'll see, and to/from hdf5 (but soon).

@ssanderson I have been amending that commit, so for sure update.

I pulled 2422fe50fcc5aac541a0be8d67a4e86309b3e2ec ~20 minutes ago. The commit hash would change if the content had changed.

did you make? (it has a tad bit of cython code changes)

I did a pip install -e . in a fresh virtualenv. What in the above that I posted is different from what you're seeing?

sorry, I didn't see what you meant above. DataFrame(new_index) hang on a sec. Didnt have a test for that.

The from_dict case is working as expected modulo the unexpected type of .values, which sounds like it's actually as-designed. The one that I think is incorrect is direct construction from a DatetimeIndex.

very subtle path difference here....fixing

In [6]: DataFrame(date_range('20130101',periods=3,tz='US/Eastern')).dtypes
Out[6]: 
0    datetime64[ns]
dtype: object

In [7]: DataFrame(date_range('20130101',periods=3,tz='US/Eastern',name='foo')).dtypes
Out[7]: 
foo    datetime64[ns, US/Eastern]
dtype: object

if you have a chance I'd like to see how you are actually using it, if you could post some small sample code would be great. I can time things, but having a usecase is even better.

@jreback I think DataFrame indexing is broken with columns of tz-aware dtype:

In [2]: df = DataFrame({'a': date_range('2014-01-01', periods=10, tz='UTC')})

In [3]: df
Out[3]:
           a
0 2014-01-01
1 2014-01-02
2 2014-01-03
3 2014-01-04
4 2014-01-05
5 2014-01-06
6 2014-01-07
7 2014-01-08
8 2014-01-09
9 2014-01-10

In [4]: df.iloc[5]
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-4-1d4dcfe8a425> in <module>()
----> 1 df.iloc[5]

/home/ssanderson/clones/pandas/pandas/core/indexing.pyc in __getitem__(self, key)
   1187             return self._getitem_tuple(key)
   1188         else:
-> 1189             return self._getitem_axis(key, axis=0)
   1190
   1191     def _getitem_axis(self, key, axis=0):

/home/ssanderson/clones/pandas/pandas/core/indexing.pyc in _getitem_axis(self, key, axis)
   1478                 self._is_valid_integer(key, axis)
   1479
-> 1480             return self._get_loc(key, axis=axis)
   1481
   1482     def _convert_to_indexer(self, obj, axis=0, is_setter=False):

/home/ssanderson/clones/pandas/pandas/core/indexing.pyc in _get_loc(self, key, axis)
     87
     88     def _get_loc(self, key, axis=0):
---> 89         return self.obj._ixs(key, axis=axis)
     90
     91     def _slice(self, obj, axis=0, kind=None):

/home/ssanderson/clones/pandas/pandas/core/frame.pyc in _ixs(self, i, axis)
   1727                     copy=True
   1728                 else:
-> 1729                     new_values = self._data.fast_xs(i)
   1730
   1731                     # if we are a copy, mark as such

/home/ssanderson/clones/pandas/pandas/core/internals.pyc in fast_xs(self, loc)
   2899         """
   2900         if len(self.blocks) == 1:
-> 2901             return self.blocks[0].values[:, loc]
   2902
   2903         items = self.items

/home/ssanderson/clones/pandas/pandas/tseries/base.pyc in __getitem__(self, key)
     93             attribs['freq'] = freq
     94
---> 95             result = getitem(key)
     96             if result.ndim > 1:
     97                 return result

IndexError: too many indices for array

ok, I changed this. So now .values -> 'external values', and ._values -> 'internal values'. These are currently the same for everything, except DatetimeTZ. so this allows an internal implementation, and we can have an external .values that is different.

so:

In [1]: s = Series(date_range('20130101',periods=3,tz='US/Eastern'))

In [2]: s
Out[2]: 
0    2013-01-01 00:00:00-05:00
1    2013-01-02 00:00:00-05:00
2    2013-01-03 00:00:00-05:00
dtype: datetime64[ns, US/Eastern]

In [3]: s.values
Out[3]: 
array([Timestamp('2013-01-01 00:00:00-0500', tz='US/Eastern'),
       Timestamp('2013-01-02 00:00:00-0500', tz='US/Eastern'),
       Timestamp('2013-01-03 00:00:00-0500', tz='US/Eastern')], dtype=object)

In [4]: s._values
Out[4]: DatetimeIndex(['2013-01-01 00:00:00-05:00', '2013-01-02 00:00:00-05:00', '2013-01-03 00:00:00-05:00'], dtype='datetime64[ns, US/Eastern]', freq='D')

@jreback Thanks for this change!

Did some quick testing, and some more feedback:

What do you guys think what the dtype of .values should be?

object with Timestamp objects -> this is what it currently does in the PR, and is also backwards compatible (as this is how tz aware datetime data are stored now in a frame, as objects)
datetime64 -> I think this is the more useful return value, if the reason that you access the raw .values numpy arrays is to do some performant operation on it. The tz can always be accessed separately (s.dt.tz) if you want to keep it along the numpy array. There is also no easy way to get the datetime64 values from a object array of Timestamps I think?

Further, I noticed:

In [18]: df = DataFrame({'A' : date_range('20130101',periods=3), 
                         'B' : date_range('20130101',periods=3,tz='US/Eastern'),
                         'C' : date_range('20130101',periods=3,tz='CET')})

In [19]: df
Out[19]:
           A                          B                          C
0 2013-01-01  2013-01-01 00:00:00-05:00  2013-01-01 00:00:00+01:00
1 2013-01-02  2013-01-02 00:00:00-05:00  2013-01-02 00:00:00+01:00
2 2013-01-03  2013-01-03 00:00:00-05:00  2013-01-03 00:00:00+01:00

In [20]: s = df['B']

In [21]: s.astype('datetime64[ns]')

AttributeError: 'DatetimeIndex' object has no attribute 'to_dense'

fixed the bug:

In [2]: df.astype('datetime64[ns]')
Out[2]: 
           A                   B                   C
0 2013-01-01 2013-01-01 05:00:00 2012-12-31 23:00:00
1 2013-01-02 2013-01-02 05:00:00 2013-01-01 23:00:00
2 2013-01-03 2013-01-03 05:00:00 2013-01-02 23:00:00

In [3]: df.astype('datetime64[ns]').dtypes
Out[3]: 
A    datetime64[ns]
B    datetime64[ns]
C    datetime64[ns]
dtype: object

For the first, converted to UTC and displated as if its astype(['datetime64[ns]])

n [18]: df['B'].astype('datetime64[ns]')
Out[18]: 
0   2013-01-01 05:00:00
1   2013-01-02 05:00:00
2   2013-01-03 05:00:00
Name: B, dtype: datetime64[ns]

In [19]: df['B'].astype('datetime64[ns]').values
Out[19]: 
array(['2013-01-01T00:00:00.000000000-0500',
       '2013-01-02T00:00:00.000000000-0500',
       '2013-01-03T00:00:00.000000000-0500'], dtype='datetime64[ns]')

I could be on board with that, except again you lose the fact that this has a meaningful tz, but if you are really accessing .values then you know what you are doing and want a numpy array anyhow

ok making this change. it actually is exposing some bugs.....:)

ok, updated as I described above.

👍 from me on this proposal. If I'm accessing.values, it's almost always because I care about performance or because I'm doing something numpy-specific.

On Sep 4, 2015, at 8:33 AM, Joris Van den Bossche notifications@github.com wrote:

@jreback Thanks for this change!

Did some quick testing, and some more feedback:

What do you guys think what the dtype of .values should be?

object with Timestamp objects -> this is what it currently does in the PR, and is also backwards compatible (as this is how tz aware datetime data are stored now in a frame, as objects)
datetime64 -> I think this is the more useful return value, if the reason that you access the raw .values numpy arrays is to do some performant operation on it. The tz can always be accessed separately (s.dt.tz) if you want to keep it along the numpy array. There is also no easy way to get the datetime64 values from a object array of Timestamps I think?
Further, I noticed:

In [18]: df = DataFrame({'A' : date_range('20130101',periods=3),
'B' : date_range('20130101',periods=3,tz='US/Eastern'),
'C' : date_range('20130101',periods=3,tz='CET')})

In [19]: df
Out[19]:
A B C
0 2013-01-01 2013-01-01 00:00:00-05:00 2013-01-01 00:00:00+01:00
1 2013-01-02 2013-01-02 00:00:00-05:00 2013-01-02 00:00:00+01:00
2 2013-01-03 2013-01-03 00:00:00-05:00 2013-01-03 00:00:00+01:00

In [20]: s = df['B']

In [21]: s.astype('datetime64[ns]')

AttributeError: 'DatetimeIndex' object has no attribute 'to_dense'
—
Reply to this email directly or view it on GitHub.

Also +1 on .values returning a numpy array of datetime64s (without tz).

ok, pls have a final look if desired.

Final comment: the series .values now gives you the datetime64 values, but when having multiple columns, this are still the Timestamp objects. This seems a bit inconsistent:

In [10]: df['B'].values
Out[10]:
array(['2013-01-01T06:00:00.000000000+0100',
       '2013-01-02T06:00:00.000000000+0100',
       '2013-01-03T06:00:00.000000000+0100'], dtype='datetime64[ns]')

In [11]: df.values
Out[11]:
array([[Timestamp('2013-01-01 00:00:00'),
        Timestamp('2013-01-01 00:00:00-0500', tz='US/Eastern'),
        Timestamp('2013-01-01 00:00:00+0100', tz='CET')],
       [Timestamp('2013-01-02 00:00:00'),
        Timestamp('2013-01-02 00:00:00-0500', tz='US/Eastern'),
        Timestamp('2013-01-02 00:00:00+0100', tz='CET')],
       [Timestamp('2013-01-03 00:00:00'),
        Timestamp('2013-01-03 00:00:00-0500', tz='US/Eastern'),
        Timestamp('2013-01-03 00:00:00+0100', tz='CET')]], dtype=object)

hmm, the interleaving on a DataFrame.values could go either way, IOW, if you had an object column mixed in then that would be correct (it ends up as an object array and nothing is cast), but I suppose if its just mixed datetime-like then I can coerce .

Ah, yes, that is true. In this case, if it's not too difficult, I would say to coerce them all to datetime64, but leaving as is, is also not that strange then

fix scalar comparisons vs None generally

fix NaT formattting in Series

TST: skip postgresql test with tz's

update for msgpack

Conflicts: pandas/core/base.py pandas/core/categorical.py pandas/core/format.py pandas/tests/test_base.py pandas/util/testing.py

full interop for tz-aware Series & timedeltas pandas-dev#10763

I have left it as is. Then this is very consistent and not suddently changed if you add say an 'object' field or whatever. We could always adjust this later.

ok, bombs away....

the initial impl was only a week or so.....2 months to make it work properly....:>

jreback added a commit that referenced this pull request

Sep 5, 2015

API: add DatetimeBlockTZ #8260