ENH: Support returning the same dtype as the caller for window ops (including extension dtypes) · Issue #11446 · pandas-dev/pandas (original) (raw)

In sample code below rolling_apply takes an argument 'ix' which is a numpy array of dtype = 'int64' and by the time this array gets to get_type() function, its dtype has changed to 'float64'. I can make an explicit call in get_type() function to change this back: ix = ix.astype('int64'), but was curious why it gets changed.

Example below. I'm on version '0.17.0':

import numpy as np
import pandas as pd


def get_type(ix, df, hours):
    # invoked by rolling_apply to illustrate the problem
    # of rolling_apply changing the dtype of 'ix' array from
    # int64 to float64

    print ix.dtype

    # need to convert index dtype back to int64
    #ix = ix.astype('int64')

    ixv = ix[ix > -1]
    print ixv.dtype

    # the data in ix must be int64 else following fails with 
    # IndexError: arrays used as indices must be of integer (or boolean) type
    h = hours[ixv] - hours[ixv[0]]
    df.iloc[ix[-1]] = h[0]
    return 0.0


# we start out with ix.dtype = int64 but rolling_apply changes this to float64
ix = np.arange(0, 10)
hours = np.random.randint(0, 10, len(ix))
df = pd.DataFrame(np.random.randn(10, 1), columns=['h'])

pd.rolling_apply(ix, window=3, func=get_type, args=(df, hours,))

I also stepped through the code and believe I've identified the source of the problem. I thought I'd report it and see if others see this as an issue before trying to fix. Doing an explicit type change inside the get_type function as in this example also works.

The _process_data_structure() function turns this into a float.

Here's the logic that is explicitly changing the dtype to a float the first time. This can be omitted and the check updated to include 'float':

    if kill_inf and values.dtype == float:
        values = values.copy()
        values[np.isinf(values)] = np.NaN

However, the cython code that I assume does the rolling window, also expects a float64. In this case, maybe an option is to update the dtype after the call_cython function.