ENH: fill_value argument for shift by ResidentMario · Pull Request #15527 · pandas-dev/pandas (original) (raw)

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

closes ENH: fill_value argument for shift #15486
tests added / passed
passes git diff upstream/master | flake8 --diff
whatsnew entry

@ResidentMario tip. write tests first! This validates that you are actually testing the right thing.

Will try to follow TDD when here going forward.

Right now, this looks like it works for non-datetimes, but with datetimes the fill value is treated as UNIX time. So e.g. pd.Series(pd.date_range('1/1/2015', '1/30/2015')).shift(1, fill_value=0) replaces the first instance with 1970-01-01, not 0.

What behavior should be followed in that case? Trying to read the fill_value input as a datetime maybe?

It seems to me that the behavior the emulate here w.r.t. df.shift(fill_value=0) is that of df.iloc[:, 0] = 0. Which does the following by column:

int and object columns get 0 (int) and retain their dtype.
float columns get 0.0 (float) and retain their dtype.
datetime columns get 1970-01-01 (datetime; UNIX timestamp) and retain their dtype.
period columns get 0 (int) and become object dtype. (!)
categorical columns fail out with ValueError: Cannot setitem on a Categorical with a new category, set the categories first.
sparse columns fail with a NotImplementedError.

no the fill value has to be a valid value for that dtype (this is checked already internally)

The tests for shift are kind of all over the place (?). For the moment, I'm keeping the one's I'm writing in tests/frame/test_timeseries::TestDataFrameTimeSeriesMethods.test_shift_fillna (for DataFrame.shift) and tests/series/test_timeseries::TestSeriesTimeSeriesMethods.test_shift_fillna (for Series.shift).

As you stated above, the implementation is from a surface level as simple as passing the fill_value argument down to the requisite shift methods (several exist, one for each "kind" of axis), where in most cases there's a _maybe_upcast function which accepts the new fill_value as a keyword argument.

A couple of things I'm seeing in testing, however. I'm finding that in this implementation non-valid values actually get converted to np.nan: such is the fate of ts.shift(1, fill_value="Hello World"), which results in a first value of NaT. I'd have expected a ValueError instead, however...is this the desired behavior?

The other thing is that Categorical and Sparse don't make a call to _maybe_upcast internally. Should something else be done in these cases instead? Maybe these column types should raise a NotImplemented when passed a fill_value instead?

yes .shift tests are a bit of a mixed bag (always glad to consolidate things though - generally separate PR for things like that is good).

So we certainly can raise on a not-None fill_value for certain types (categorical; sparse in theory could work though, but don't spend too much time on, you can simply open an issue)

and to your last part, we are prob not validating things as much as we should so help there would be great as well!

Would it be OK to make reworking the fill_value validation a separate issue? This seems like a broader issue, as other functions which have their own validation code paths. For example, with Series.add:

pd.Series([0, 1]).add(pd.Series([np.nan, np.nan]), fill_value='Hello')

Raises:

ValueError: invalid literal for int() with base 10: 'Hello'

On:

.../series.py in _binop(self, other, func, level, fill_value)
   1639             # one but not both
   1640             mask = this_mask ^ other_mask
-> 1641             this_vals[this_mask & mask] = fill_value
   1642             other_vals[other_mask & mask] = fill_value
   1643

Whereas this implementation results in an upcast to object type:

 >>> pd.Series([0, 1]).shift(1, fill_value="he")
 0    he
 1     0
 dtype: object

It would be good to write up something like a _maybe_fill method that would handle this check across all of the APIs that implement a fill_value and make this behavior consistent.

With this PR, I suggest:

Raise a NotImplemented error for categorical/sparse.
Write the tests where they are, for now.
Merge the current working implementation for non-categorical/non-sparse.
File an issue for consolidating shift tests.
File an issue for a _maybe_fill consolidator.
File an issue for extending fill_value to sparse.

@ResidentMario absolutely!

in general, 1 PR -> 1 'issue'. It makes it easier to reason about / review.

So for example I would leave this PR, and do another that does the validation as you suggest. You can then rebase this one on your new one (until it is merged), then rebase on master.

Very happy with independent (or even dependent PR's as I suggest above), just indicate if that is the case in the top of the PR.

and certainly filing issues is good as well; for bugs / features even if you plan to do them. It keeps things organized.

closing as stale but good idea. master has changed a bit since this PR, might be easier now.

Labels

Missing-data

np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate

ENH: fill_value argument for shift by ResidentMario · Pull Request #15527 · pandas-dev/pandas (original) (raw)

Conversation

Labels

2 participants