CLN: Make ufunc works for Index by sinhrks · Pull Request #10638 · pandas-dev/pandas (original) (raw)

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation37 Commits1 Checks0 Files changed

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

sinhrks

closes #9966, #9974 (PR)

I understand these are never used because Index is no longer the subclass of np.array. Correct?

np.sin(pd.Categorical([1, 2, 3]))
array([ 0.84147098,  0.90929743,  0.14112001])

@jreback

yep I think you could take them out. as np.array(index) will coerce to a ndarray (and not an index sub-class). This is done by _shallow_copy in any event.

@shoyer

No, we need this. Otherwise np.sin(idx) returns an ndarray.

It's not documented, but __array_finalize__ works even on non-subclasses.

@jreback

ok let's add some tests and clean up the code
I suspect this should be in core/base then

eg to support the example above

@jreback

IIRC there is another issue where for example (#9966 with a PR #9974) :

In [22]: np.sin(Index(range(5)))
Out[22]: Int64Index([0.0, 0.841470984808, 0.909297426826, 0.14112000806, -0.756802495308], dtype='float64')

which is wrong, this needs to be created w/o slightly differently, e.g.

Index(func(...), **self._get_attributes_dict())

so that the attributes get propogates AND the correct type gets inferred (e.g. this would return a FloatI64Index in this case)

maybe you want to supersede #9974

@sinhrks

@shoyer

@sinhrks sinhrks changed the titleCLN: Remote Datetime-like Index.__array_finalize__ CLN: Remove Datetime-like Index.__array_finalize__

Jul 22, 2015

@sinhrks

Let me confirm 2 points:

  1. Invalid ops for datetime-like indexes.
    I understand numeric ufuncs (like np.sin) for datetime-like is almost meaningless. PeriodIndex is likely to return unexpected result because of current __array__ definition. One idea is to raise error when PeriodIndex get float (invalid) result.
import pandas as pd  
# OK  
np.sin(pd.DatetimeIndex(['2015-07', '2015-08']))  
# TypeError: ufunc 'sin' not supported for the input types, and the inputs could not be safely coerced to   any supported types according to the casting rule ''safe''  
# NG  
pidx = pd.PeriodIndex(['2015-07', '2015-08'], freq='M')  
np.sin(pidx)  
# Float64Index([-0.594884318359, 0.354966539136], dtype='float64')  
# ufunc is applied to i8 values  
pidx.__array__()  
# array([546, 547])  
  1. Some ufunc returns bool dtype. In such cases, it should return bool array as it is (like Index.duplicated)
np.isnan(pidx)  
# Index([False, False], dtype='object')  

@jreback

@sinhrks both of those points look right

@shoyer

jreback

@@ -281,6 +281,16 @@ def __contains__(self, key):
return False
return key.ordinal in self._engine
def __array_wrap__(self, result, context=None):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't this need to go in tseries/base, e.g. ufuncs are not valid for any datetimelike?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

timedelta64 and datetime64 are real numpy dtypes, so ufuncs will already choke on them:

In [1]: import pandas as pd

In [2]: i = pd.date_range('2000-01-01', periods=5)

In [3]: import numpy as np

In [4]: np.square(i)
TypeError: ufunc 'square' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

In [5]: np.square(i.to_period())
Out[5]:
PeriodIndex(['330671-10-02', '330731-10-03', '330791-10-05', '330851-10-09',
             '330911-10-16'],
            dtype='int64', freq='D')

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that tests for these other index types would be a good idea, though.

Also, I think we should probably prohibit all ufuncs on PeriodIndex, not just those that return float. For example, it's not valid to square periods, even though that returns integers.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Raising TypeError in __array_wrap__ affects to arithmetic using np.array. Currently, it works:

pd.PeriodIndex(['2011', '2012'], freq='A') + np.array([1, 2])
# PeriodIndex(['2012', '2014'], dtype='int64', freq='A-DEC')

So it can't be prohibited?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PeriodIndex + integer ndarray should not be expected to work -- the array
does not have the right type!

On Mon, Aug 17, 2015 at 3:35 PM, Sinhrks notifications@github.com wrote:

In pandas/tseries/period.py
#10638 (comment):

@@ -281,6 +281,16 @@ def contains(self, key):
return False
return key.ordinal in self._engine

  • def array_wrap(self, result, context=None):

Raising TypeError in array_wrap affects to arithmetic using np.array.
Currently, it works:

pd.PeriodIndex(['2011', '2012'], freq='A') + np.array([1, 2])

PeriodIndex(['2012', '2014'], dtype='int64', freq='A-DEC')

So it can't be prohibited?


Reply to this email directly or view it on GitHub
https://github.com/pydata/pandas/pull/10638/files#r37245206.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually I disagree
add integers to period works and makes sense as its a freq op

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, because this is a valid freq shift, np.array should work as the same.

pd.PeriodIndex(['2011', '2012'], freq='A') + 1
# PeriodIndex(['2012', '2013'], dtype='int64', freq='A-DEC')

And #10744 has been done expecting arithmetic using np.array works.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reconsidered this, and defining _add_ndarray for PeriodIndex may be an alternative. I think options are:

  1. Always raise TypeError in __array_wraps__. Support ndarray ops in another methods (PeriodIndex must be lhs).
  2. Raise TypeError if __array_wraps__ gets non-integers. Some ufunc which return int outputs meaningless results (like square)

NOTE: Maybe not intentional, PeriodIndex.shift works for ndarray. Thus we can use shift for ndarray ops.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sinhrks remember that ATM PeriodIndex is an odd duck. Its a real int64 dtyped array, that happens to have boxing. So it allows any integer-like ops (as opposed to DatetimeIndex which prohibits a lot more because its M8[ns]).

So its needs handling to error on prohibited ops. So on on raising in __array_wraps__ for prohibited ops, but this might require some dispatching mechanism to the sub-class to determine what is allowed. E.g. we do it similarly like this in the _add_numeric_ops and like routines.

@sinhrks sinhrks changed the titleCLN: Remove Datetime-like Index.__array_finalize__ CLN: Make ufunc works for Index

Jul 23, 2015

@jreback

@jreback

sinhrks

elif isinstance(func, np.ufunc):
if 'M->M' not in func.types:
msg = "ufunc '{0}' not supported for the PeriodIndex"
raise ValueError(msg.format(func.__name__))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be TypeError, but cannot be raised from here because numpy catches.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shoyer you know any better way to override the ufunc calling scheme? (e.g. as __numpy_ufunc__ doesn't exist yet). or is this best way?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this sort of thing is the best we can do for now

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok let's add a note that eventually we could replace this with numpy_ufunc

@sinhrks

jreback

np.arctanh, np.deg2rad, np.rad2deg]:
if isinstance(idx, pd.tseries.base.DatetimeIndexOpsMixin):
# raise TypeError or ValueError (PeriodIndex)
# PeriodIndex behavior should be changed in future version

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why the different exception behavior?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TypeError raised from __array_wrap__ seems to be catchrd by numpy and return unintended result.

@jreback

this doesn't seeem to close #9966 ?

In [9]: df = pd.DataFrame({'i': np.linspace(0, 5, 6).astype(np.int64), 
               'j': np.linspace(2, 3, 6)}).set_index('i')

In [10]: df
Out[10]: 
     j
i     
0  2.0
1  2.2
2  2.4
3  2.6
4  2.8
5  3.0

In [11]: df.index
Out[11]: Int64Index([0, 1, 2, 3, 4, 5], dtype='int64', name=u'i')

In [12]: df.index = df.index / 10.

In [13]: df.index
Out[13]: Int64Index([0.0, 0.1, 0.2, 0.3, 0.4, 0.5], dtype='float64', name=u'i')

@sinhrks

@jreback Standard arithmetic is handled by _numeric_binop. Now fixed.

df.index / 2.
# Float64Index([0.0, 0.5, 1.0, 1.5, 2.0, 2.5], dtype='float64', name=u'i')

np.divide(df.index, 2.)
# Float64Index([0.0, 0.5, 1.0, 1.5, 2.0, 2.5], dtype='float64', name=u'i')

@sinhrks

Failed in Timedelta ops. On current master, mult and div doesn't affect to existing freq, but it should?

import pandas as pd
idx = pd.timedelta_range('2 days', periods=3, freq='2D')

idx * 2
# TimedeltaIndex(['4 days', '8 days', '12 days'], dtype='timedelta64[ns]', freq='2D')
# -> freq should be "4D"?

idx / 2
# TimedeltaIndex(['1 days', '2 days', '3 days'], dtype='timedelta64[ns]', freq='2D')
# -> freq should be "D"?

@jreback

@sinhrks hmm I think you need to set the freq=None so it get's reinferred (or maybe can et it to 'infer')

@sinhrks

jreback added a commit that referenced this pull request

Sep 6, 2015

@jreback

CLN: Make ufunc works for Index

@jreback