ENH: fill_value argument for shift by ResidentMario · Pull Request #15527 · pandas-dev/pandas (original) (raw)
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters
[ Show hidden characters]({{ revealButtonHref }})
- closes ENH: fill_value argument for shift #15486
- tests added / passed
- passes
git diff upstream/master | flake8 --diff - whatsnew entry
@ResidentMario tip. write tests first! This validates that you are actually testing the right thing.
Will try to follow TDD when here going forward.
Right now, this looks like it works for non-datetimes, but with datetimes the fill value is treated as UNIX time. So e.g. pd.Series(pd.date_range('1/1/2015', '1/30/2015')).shift(1, fill_value=0) replaces the first instance with 1970-01-01, not 0.
What behavior should be followed in that case? Trying to read the fill_value input as a datetime maybe?
It seems to me that the behavior the emulate here w.r.t. df.shift(fill_value=0) is that of df.iloc[:, 0] = 0. Which does the following by column:
intandobjectcolumns get0(int) and retain theirdtype.floatcolumns get0.0(float) and retain theirdtype.datetimecolumns get1970-01-01(datetime; UNIX timestamp) and retain theirdtype.periodcolumns get0(int) and becomeobjectdtype. (!)categoricalcolumns fail out withValueError: Cannot setitem on a Categorical with a new category, set the categories first.sparsecolumns fail with aNotImplementedError.
no the fill value has to be a valid value for that dtype (this is checked already internally)
The tests for shift are kind of all over the place (?). For the moment, I'm keeping the one's I'm writing in tests/frame/test_timeseries::TestDataFrameTimeSeriesMethods.test_shift_fillna (for DataFrame.shift) and tests/series/test_timeseries::TestSeriesTimeSeriesMethods.test_shift_fillna (for Series.shift).
As you stated above, the implementation is from a surface level as simple as passing the fill_value argument down to the requisite shift methods (several exist, one for each "kind" of axis), where in most cases there's a _maybe_upcast function which accepts the new fill_value as a keyword argument.
A couple of things I'm seeing in testing, however. I'm finding that in this implementation non-valid values actually get converted to np.nan: such is the fate of ts.shift(1, fill_value="Hello World"), which results in a first value of NaT. I'd have expected a ValueError instead, however...is this the desired behavior?
The other thing is that Categorical and Sparse don't make a call to _maybe_upcast internally. Should something else be done in these cases instead? Maybe these column types should raise a NotImplemented when passed a fill_value instead?
yes .shift tests are a bit of a mixed bag (always glad to consolidate things though - generally separate PR for things like that is good).
So we certainly can raise on a not-None fill_value for certain types (categorical; sparse in theory could work though, but don't spend too much time on, you can simply open an issue)
and to your last part, we are prob not validating things as much as we should so help there would be great as well!
Would it be OK to make reworking the fill_value validation a separate issue? This seems like a broader issue, as other functions which have their own validation code paths. For example, with Series.add:
pd.Series([0, 1]).add(pd.Series([np.nan, np.nan]), fill_value='Hello')
Raises:
ValueError: invalid literal for int() with base 10: 'Hello'
On:
.../series.py in _binop(self, other, func, level, fill_value)
1639 # one but not both
1640 mask = this_mask ^ other_mask
-> 1641 this_vals[this_mask & mask] = fill_value
1642 other_vals[other_mask & mask] = fill_value
1643
Whereas this implementation results in an upcast to object type:
>>> pd.Series([0, 1]).shift(1, fill_value="he")
0 he
1 0
dtype: object
It would be good to write up something like a _maybe_fill method that would handle this check across all of the APIs that implement a fill_value and make this behavior consistent.
With this PR, I suggest:
- Raise a
NotImplementederror for categorical/sparse. - Write the tests where they are, for now.
- Merge the current working implementation for non-categorical/non-sparse.
- File an issue for consolidating
shifttests. - File an issue for a
_maybe_fillconsolidator. - File an issue for extending
fill_valuetosparse.
@ResidentMario absolutely!
in general, 1 PR -> 1 'issue'. It makes it easier to reason about / review.
So for example I would leave this PR, and do another that does the validation as you suggest. You can then rebase this one on your new one (until it is merged), then rebase on master.
Very happy with independent (or even dependent PR's as I suggest above), just indicate if that is the case in the top of the PR.
and certainly filing issues is good as well; for bugs / features even if you plan to do them. It keeps things organized.
closing as stale but good idea. master has changed a bit since this PR, might be easier now.
Labels
np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate