API: add DatetimeBlockTZ #8260 by jreback · Pull Request #10477 · pandas-dev/pandas (original) (raw)
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Conversation95 Commits1 Checks0 Files changed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters
[ Show hidden characters]({{ revealButtonHref }})
ToDos:
- doc updates
- test with
Series.dt.*
- test with csv/HDF5
- nat setting borked ATM
- HDF5 example from 0.16.2
- [ ] get_values/values - make consistent- [ ] maybe move DatetimeTZBlock.shift
mostly to DatetimeIndex.shift
Also
- This cleans up the internal blocks calling conventions a bit
- Fixes a bug in
DatetimeIndex
and localizing when NaT's are present
-------------------------------------------------------------------------------
Test name | head[ms] | base[ms] | ratio |
-------------------------------------------------------------------------------
timestamp_tz_ops_diff1 | 0.1804 | 159.4527 | 0.0011 | # note these are 10k element
timestamp_tz_ops_diff2 | 2.5350 | 156.4047 | 0.0162 | # note these are 10k elements
timeseries_timestamp_downsample_mean | 3.1467 | 3.3040 | 0.9524 |
timestamp_series_compare | 9.0797 | 9.1290 | 0.9946 |
timestamp_ops_diff2 | 19.7570 | 19.6819 | 1.0038 | # this is 1M elements
series_timestamp_compare | 9.3430 | 9.0226 | 1.0355 |
timestamp_ops_diff1 | 9.7457 | 9.0450 | 1.0775 | # this is 1M elements
-------------------------------------------------------------------------------
Test name | head[ms] | base[ms] | ratio |
-------------------------------------------------------------------------------
Ratio < 1.0 means the target commit is faster then the baseline.
Seed used: 1234
Target [8502474] : API: add Block.make_block
API: add DatetimeBlockWithTZ #8260
Base [16a44ad] : Merge pull request #10199 from jreback/gil
PERF: releasing the GIL, #8882
Demo
In [1]: df = DataFrame({'A' : date_range('20130101',periods=3),
...: 'B' : date_range('20130101',periods=3,tz='US/Eastern'),
...: 'C' : date_range('20130101',periods=3,tz='CET')})
In [2]: df
Out[2]:
A B C
0 2013-01-01 2013-01-01 05:00:00 2012-12-31 23:00:00
1 2013-01-02 2013-01-02 05:00:00 2013-01-01 23:00:00
2 2013-01-03 2013-01-03 05:00:00 2013-01-02 23:00:00
In [3]: df.dtypes
Out[3]:
A datetime64[ns]
B datetime64[ns, US/Eastern]
C datetime64[ns, CET]
dtype: object
In [4]: df.B
Out[4]:
0 2013-01-01 00:00:00-05:00
1 2013-01-02 00:00:00-05:00
2 2013-01-03 00:00:00-05:00
Name: B, dtype: datetime64[ns, US/Eastern]
In [5]: df.B.dt.tz_localize(None)
Out[5]:
0 2013-01-01
1 2013-01-02
2 2013-01-03
dtype: datetime64[ns]
Ooh, shiny! I'll try this out tonight.
@jreback Nice! I will try to go through it one of the coming days
@jreback one quirk I'm noticing is that if you construct a DataFrame directly from a DatetimeIndex, you lose the tz information:
In [1]: dates = date_range('2014-01-01', periods=10, tz='UTC')
In [2]: from_dict = DataFrame({'a': dates})
In [3]: from_dict.dtypes
Out[3]:
a datetime64[ns, UTC]
dtype: object
In [4]: from_index = DataFrame(dates)
In [5]: from_index.dtypes
Out[5]:
0 datetime64[ns]
dtype: object
Is this expected behavior?
Also surprising: if I take a series of dtype datetime64[ns, UTC]
and call.values on it, I get a DatetimeIndex rather than an ndarray:
In [7]: from_dict['a'].values
Out[7]:
DatetimeIndex(['2014-01-01', '2014-01-02', '2014-01-03', '2014-01-04',
'2014-01-05', '2014-01-06', '2014-01-07', '2014-01-08',
'2014-01-09', '2014-01-10'],
dtype='datetime64[ns]', freq='D', tz='UTC')
The result I'd expect here is what you actually get from doing
In [21]: from_dict['a'].values.values
Out[21]:
array(['2013-12-31T19:00:00.000000000-0500',
'2014-01-01T19:00:00.000000000-0500',
'2014-01-02T19:00:00.000000000-0500',
'2014-01-03T19:00:00.000000000-0500',
'2014-01-04T19:00:00.000000000-0500',
'2014-01-05T19:00:00.000000000-0500',
'2014-01-06T19:00:00.000000000-0500',
'2014-01-07T19:00:00.000000000-0500',
'2014-01-08T19:00:00.000000000-0500',
'2014-01-09T19:00:00.000000000-0500'], dtype='datetime64[ns]')
do you have the current one?
this all works (it might not have in a prior one)
This is the latest jreback@19ec61d I think
In [3]: from_dict
Out[3]:
a
0 2014-01-01
1 2014-01-02
2 2014-01-03
3 2014-01-04
4 2014-01-05
5 2014-01-06
6 2014-01-07
7 2014-01-08
8 2014-01-09
9 2014-01-10
In [4]: from_dict.dtypes
Out[4]:
a datetime64[ns, UTC]
dtype: object
Series.values
IS a DatetimeIndex
that is how its implemented. Its very similar to how Sparse/Categorical
are done. This preserves the tz info inside the object (actually have freq too). Rather than relying upon a ndarray impl and passing it around. Much cleaner this way.
I'm testing with a development install of this branch:
(pandas)[~/clones/pandas]@(tz:2422fe50fc)$ git log HEAD^..HEAD
commit 2422fe50fcc5aac541a0be8d67a4e86309b3e2ec
Author: Jeff Reback <jeff@reback.net>
Date: Sat Jun 27 17:55:29 2015 -0400
API: add DatetimeBlockTZ #8260
commit 757bbf92c926d4584d01bce419a576c7cb831fce
Author: Jeff Reback <jeff@reback.net>
Date: Wed Jul 1 19:38:16 2015 -0400
start on csv
commit 2422fe50fcc5aac541a0be8d67a4e86309b3e2ec
Author: Jeff Reback <jeff@reback.net>
Date: Sat Jun 27 17:55:29 2015 -0400
API: add DatetimeBlockTZ #8260
everthing works except for csv rounding tripping (not sure I can get it as can't repro the tz upon readback) but we'll see, and to/from hdf5 (but soon).
@ssanderson I have been amending that commit, so for sure update.
I pulled 2422fe50fcc5aac541a0be8d67a4e86309b3e2ec ~20 minutes ago. The commit hash would change if the content had changed.
did you make? (it has a tad bit of cython code changes)
I did a pip install -e .
in a fresh virtualenv. What in the above that I posted is different from what you're seeing?
sorry, I didn't see what you meant above. DataFrame(new_index)
hang on a sec. Didnt have a test for that.
The from_dict case is working as expected modulo the unexpected type of .values
, which sounds like it's actually as-designed. The one that I think is incorrect is direct construction from a DatetimeIndex.
very subtle path difference here....fixing
In [6]: DataFrame(date_range('20130101',periods=3,tz='US/Eastern')).dtypes
Out[6]:
0 datetime64[ns]
dtype: object
In [7]: DataFrame(date_range('20130101',periods=3,tz='US/Eastern',name='foo')).dtypes
Out[7]:
foo datetime64[ns, US/Eastern]
dtype: object
if you have a chance I'd like to see how you are actually using it, if you could post some small sample code would be great. I can time things, but having a usecase is even better.
@jreback I think DataFrame indexing is broken with columns of tz-aware dtype:
In [2]: df = DataFrame({'a': date_range('2014-01-01', periods=10, tz='UTC')})
In [3]: df
Out[3]:
a
0 2014-01-01
1 2014-01-02
2 2014-01-03
3 2014-01-04
4 2014-01-05
5 2014-01-06
6 2014-01-07
7 2014-01-08
8 2014-01-09
9 2014-01-10
In [4]: df.iloc[5]
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-4-1d4dcfe8a425> in <module>()
----> 1 df.iloc[5]
/home/ssanderson/clones/pandas/pandas/core/indexing.pyc in __getitem__(self, key)
1187 return self._getitem_tuple(key)
1188 else:
-> 1189 return self._getitem_axis(key, axis=0)
1190
1191 def _getitem_axis(self, key, axis=0):
/home/ssanderson/clones/pandas/pandas/core/indexing.pyc in _getitem_axis(self, key, axis)
1478 self._is_valid_integer(key, axis)
1479
-> 1480 return self._get_loc(key, axis=axis)
1481
1482 def _convert_to_indexer(self, obj, axis=0, is_setter=False):
/home/ssanderson/clones/pandas/pandas/core/indexing.pyc in _get_loc(self, key, axis)
87
88 def _get_loc(self, key, axis=0):
---> 89 return self.obj._ixs(key, axis=axis)
90
91 def _slice(self, obj, axis=0, kind=None):
/home/ssanderson/clones/pandas/pandas/core/frame.pyc in _ixs(self, i, axis)
1727 copy=True
1728 else:
-> 1729 new_values = self._data.fast_xs(i)
1730
1731 # if we are a copy, mark as such
/home/ssanderson/clones/pandas/pandas/core/internals.pyc in fast_xs(self, loc)
2899 """
2900 if len(self.blocks) == 1:
-> 2901 return self.blocks[0].values[:, loc]
2902
2903 items = self.items
/home/ssanderson/clones/pandas/pandas/tseries/base.pyc in __getitem__(self, key)
93 attribs['freq'] = freq
94
---> 95 result = getitem(key)
96 if result.ndim > 1:
97 return result
IndexError: too many indices for array
ok, I changed this. So now .values
-> 'external values', and ._values
-> 'internal values'. These are currently the same for everything, except DatetimeTZ
. so this allows an internal implementation, and we can have an external .values
that is different.
so:
In [1]: s = Series(date_range('20130101',periods=3,tz='US/Eastern'))
In [2]: s
Out[2]:
0 2013-01-01 00:00:00-05:00
1 2013-01-02 00:00:00-05:00
2 2013-01-03 00:00:00-05:00
dtype: datetime64[ns, US/Eastern]
In [3]: s.values
Out[3]:
array([Timestamp('2013-01-01 00:00:00-0500', tz='US/Eastern'),
Timestamp('2013-01-02 00:00:00-0500', tz='US/Eastern'),
Timestamp('2013-01-03 00:00:00-0500', tz='US/Eastern')], dtype=object)
In [4]: s._values
Out[4]: DatetimeIndex(['2013-01-01 00:00:00-05:00', '2013-01-02 00:00:00-05:00', '2013-01-03 00:00:00-05:00'], dtype='datetime64[ns, US/Eastern]', freq='D')
@jreback Thanks for this change!
Did some quick testing, and some more feedback:
What do you guys think what the dtype of .values
should be?
- object with Timestamp objects -> this is what it currently does in the PR, and is also backwards compatible (as this is how tz aware datetime data are stored now in a frame, as objects)
datetime64
-> I think this is the more useful return value, if the reason that you access the raw.values
numpy arrays is to do some performant operation on it. The tz can always be accessed separately (s.dt.tz
) if you want to keep it along the numpy array. There is also no easy way to get the datetime64 values from a object array of Timestamps I think?
Further, I noticed:
In [18]: df = DataFrame({'A' : date_range('20130101',periods=3),
'B' : date_range('20130101',periods=3,tz='US/Eastern'),
'C' : date_range('20130101',periods=3,tz='CET')})
In [19]: df
Out[19]:
A B C
0 2013-01-01 2013-01-01 00:00:00-05:00 2013-01-01 00:00:00+01:00
1 2013-01-02 2013-01-02 00:00:00-05:00 2013-01-02 00:00:00+01:00
2 2013-01-03 2013-01-03 00:00:00-05:00 2013-01-03 00:00:00+01:00
In [20]: s = df['B']
In [21]: s.astype('datetime64[ns]')
AttributeError: 'DatetimeIndex' object has no attribute 'to_dense'
fixed the bug:
In [2]: df.astype('datetime64[ns]')
Out[2]:
A B C
0 2013-01-01 2013-01-01 05:00:00 2012-12-31 23:00:00
1 2013-01-02 2013-01-02 05:00:00 2013-01-01 23:00:00
2 2013-01-03 2013-01-03 05:00:00 2013-01-02 23:00:00
In [3]: df.astype('datetime64[ns]').dtypes
Out[3]:
A datetime64[ns]
B datetime64[ns]
C datetime64[ns]
dtype: object
For the first, converted to UTC and displated as if its astype(['datetime64[ns]])
n [18]: df['B'].astype('datetime64[ns]')
Out[18]:
0 2013-01-01 05:00:00
1 2013-01-02 05:00:00
2 2013-01-03 05:00:00
Name: B, dtype: datetime64[ns]
In [19]: df['B'].astype('datetime64[ns]').values
Out[19]:
array(['2013-01-01T00:00:00.000000000-0500',
'2013-01-02T00:00:00.000000000-0500',
'2013-01-03T00:00:00.000000000-0500'], dtype='datetime64[ns]')
I could be on board with that, except again you lose the fact that this has a meaningful tz, but if you are really accessing .values
then you know what you are doing and want a numpy array anyhow
ok making this change. it actually is exposing some bugs.....:)
ok, updated as I described above.
👍 from me on this proposal. If I'm accessing.values, it's almost always because I care about performance or because I'm doing something numpy-specific.
On Sep 4, 2015, at 8:33 AM, Joris Van den Bossche notifications@github.com wrote:
@jreback Thanks for this change!
Did some quick testing, and some more feedback:
What do you guys think what the dtype of .values should be?
object with Timestamp objects -> this is what it currently does in the PR, and is also backwards compatible (as this is how tz aware datetime data are stored now in a frame, as objects)
datetime64 -> I think this is the more useful return value, if the reason that you access the raw .values numpy arrays is to do some performant operation on it. The tz can always be accessed separately (s.dt.tz) if you want to keep it along the numpy array. There is also no easy way to get the datetime64 values from a object array of Timestamps I think?
Further, I noticed:In [18]: df = DataFrame({'A' : date_range('20130101',periods=3),
'B' : date_range('20130101',periods=3,tz='US/Eastern'),
'C' : date_range('20130101',periods=3,tz='CET')})In [19]: df
Out[19]:
A B C
0 2013-01-01 2013-01-01 00:00:00-05:00 2013-01-01 00:00:00+01:00
1 2013-01-02 2013-01-02 00:00:00-05:00 2013-01-02 00:00:00+01:00
2 2013-01-03 2013-01-03 00:00:00-05:00 2013-01-03 00:00:00+01:00In [20]: s = df['B']
In [21]: s.astype('datetime64[ns]')
AttributeError: 'DatetimeIndex' object has no attribute 'to_dense'
—
Reply to this email directly or view it on GitHub.
Also +1 on .values
returning a numpy array of datetime64s (without tz).
ok, pls have a final look if desired.
Final comment: the series .values
now gives you the datetime64 values, but when having multiple columns, this are still the Timestamp objects. This seems a bit inconsistent:
In [10]: df['B'].values
Out[10]:
array(['2013-01-01T06:00:00.000000000+0100',
'2013-01-02T06:00:00.000000000+0100',
'2013-01-03T06:00:00.000000000+0100'], dtype='datetime64[ns]')
In [11]: df.values
Out[11]:
array([[Timestamp('2013-01-01 00:00:00'),
Timestamp('2013-01-01 00:00:00-0500', tz='US/Eastern'),
Timestamp('2013-01-01 00:00:00+0100', tz='CET')],
[Timestamp('2013-01-02 00:00:00'),
Timestamp('2013-01-02 00:00:00-0500', tz='US/Eastern'),
Timestamp('2013-01-02 00:00:00+0100', tz='CET')],
[Timestamp('2013-01-03 00:00:00'),
Timestamp('2013-01-03 00:00:00-0500', tz='US/Eastern'),
Timestamp('2013-01-03 00:00:00+0100', tz='CET')]], dtype=object)
hmm, the interleaving on a DataFrame.values
could go either way, IOW, if you had an object column mixed in then that would be correct (it ends up as an object array and nothing is cast), but I suppose if its just mixed datetime-like then I can coerce .
Ah, yes, that is true. In this case, if it's not too difficult, I would say to coerce them all to datetime64, but leaving as is, is also not that strange then
fix scalar comparisons vs None generally
fix NaT formattting in Series
TST: skip postgresql test with tz's
update for msgpack
Conflicts: pandas/core/base.py pandas/core/categorical.py pandas/core/format.py pandas/tests/test_base.py pandas/util/testing.py
full interop for tz-aware Series & timedeltas pandas-dev#10763
I have left it as is. Then this is very consistent and not suddently changed if you add say an 'object' field or whatever. We could always adjust this later.
ok, bombs away....
the initial impl was only a week or so.....2 months to make it work properly....:>
jreback added a commit that referenced this pull request
API: add DatetimeBlockTZ #8260