API: convert_objects, xref #11116, instead of warning, raise a ValueError by jreback · Pull Request #11133 · pandas-dev/pandas (original) (raw)
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters
[ Show hidden characters]({{ revealButtonHref }})
closes #11116
This is a lot more noisy that the current impl. It pretty much raises on the old-style input if it had coerce in it. This is correct as the actual behavior of coerce has changed (in that it now it much more strict).
errror on old-style input
- raise a ValueError for df.convert_objects('coerce')
- raise a ValueError for df.convert_objects(convert_dates='coerce') (and convert_numeric,convert_timedelta)
In [1]: data = """foo,bar
...: 2015-09-14,True
...: 2015-09-15,
...: """
In [2]: df = pd.read_csv(StringIO(data),sep=',')
In [3]: df
Out[3]:
foo bar
0 2015-09-14 True
1 2015-09-15 NaN
In [4]: df.convert_objects('coerce')
ValueError: The use of 'coerce' as an input is deprecated. Instead set coerce=True.
In [5]: df.convert_objects(coerce=True)
ValueError: coerce=True was provided, with no options for conversions.excatly one of 'datetime', 'numeric' or 'timedelta' must be True when when coerce=True.
In [8]: df.convert_objects(convert_dates='coerce')
/Users/jreback/miniconda/bin/ipython:1: FutureWarning: the 'convert_dates' keyword is deprecated, use 'datetime' instead
ValueError: The use of 'coerce' as an input is deprecated. Instead set coerce=True.
…e a ValueError on old-style input
- raise a ValueError for df.convert_objects('coerce')
- raise a ValueError for df.convert_objects(convert_dates='coerce') (and convert_numeric,convert_timedelta)
I am not really in favor of this, as this really completely breaks people's code.
Another option would be to not do any mapping of the old keyword arguments to the new ones. So leave the old keywords with deprecation warnings but their original behaviour, and the new keywords with the new behaviour.
it's not the deprecation mappings that are a problem. it's that the coerce conversion by definition is forced. using it on s frame is just not a good idea at all.
in prior versions it worked because it was really never a forced conversion.
users code was simply wrong and should break
I don't see any way around this nor should their be
they were relying on broken behavior
this is a loud break showing that what you were doing was wrong
I suppose s better error message would help
Yes, the fact that previously the conversion was not really 'forced' (as if it only converted if at least one could be converted), and now it is 'forced', this is what it is making a breaking change. Peoples code was not wrong, they were just using the function how it worked.
Not doing the mapping of the old to the new keyword arguments would be a way to introduce the new behaviour (under the new kwargs) without breaking people's code (but of course, the old behaviour should still be supported in the code base in this case, which could make it more complex)
I agree that this PR is in fact (in some cases) better than current master, as it raises an error notifying the user he/she should look at it, instead of just changing the result and breaking possible further processing.
I don't see any reason to support old code like that
it just prolongs the change period
Actually, what you are doing here is partly the same as I am saying: not doing the mapping of old to new kwargs (for the 'coerce' case). But the question is then what to do with the old kwargs: raising an error (this PR), or keeping the old behaviour but deprecating (allowing a more smooth transition).
@jreback I don't say that is the better or easy approach, but the reason to do this is pretty clear: keep current user code working with 0.17.0 and giving them some time to adapt.
Further, not doing the mapping would also allow us to further clean up some inconsistencies in the current API (which would otherwise be more breaking changes):
numeric=Truealso impliescoerce=True(while this is not the case for datetime and timedelta), so converting numeric strings without coercing is not possible (actually, it is possible withastype, but not withconvert_objects)datetime=Truedoes not parse strings to datetimes, whiledatetime=True, coerce=Truedoes parse strings. Whilecoerceshould be about converting inconvertibles to NaN or nto.
as I said I think we could give a better error message (kind of like instructions on how his changed)
I'll put that up a bit later
but this is a pretty narrow case - breaking is sometimes necessary
I think that we should just change this whole approach - this is too error prone
let's deprecate this entire function - not adding anything (so I guess revert to the prior behavior) - or just outright raise an error
add pd.to_numeric(errors='raise'|'ignore'|'coerce)
eg same signature as pd.to_datetime/timedelta
that way the pattern is that the user needs to apply this to individual columns -
another option is to make a new method.convert with the new behavior
and leave .convert_objects alone (but deprecate it fully)
using it on s frame is just not a good idea at all.
I agree with this. Better practice is to use
df['a'] = df['a'].astype('float')
or
df['a'] = pd.to_datetime(df['a'])
so the the minimal conversions is used (or an error is raised).
let's deprecate this entire function - not adding anything (so I guess revert to the prior behavior) - or just outright raise an error
I think this is probably the right approach, plus adding to_numeric which will handle the numeric uses of convert_objects
@bashtage you want to take a crack at this?
- restore original
.convert_objects(though you prob want to use your new structure, maybe have acompatparm (into the_possibly_convert_objects) - deprecate
.convert_objects - change whatsnew
pd.to_numeric
@bashtage are you going to be able to address this?
+1 to adding a pd.to_numeric
I would not add a new convert method. If we want this functionality, we can do it within convert_objects (instead of deprecating the method in favor of a new, we can deprecate the keyword arguments in favor of new arguments, within the same method. In the end, the result can be the same, but we don't need a new method on DataFrame).
I can get to it this week.
IMO the old behavior of convert_objects is at best poor and in the long run it would be better to remove it. For example, it didn't even respect the coerce flag, and has wierd dependencies between types (since it tries one, moves on to the next,etc, and so it not robust to lots of reasonable refactorings).
I think there are two ways forward:
- add
to_numericto there is ato_datetimeandto_timestampwhich completes the set. Start deprecation ofconvert_objects - Move new version to
convertand start deprecation.
In principle adding to_numeric and convert aren't orthogonal, so both could be done.
@bashtage thanks!
- I think let's just add
pd.to_numeric - as far as the original
.convert_objects. I agree the original behavior was just misleading. I think it might simply be best to just use kind of what this PR does. E.g. map the original arguments, if nocoerceis specified, then it should work, if not, refuse to guess and just raise. Further I think let's simply deprecate the entire function (.convert_objects).
Yes this will break code, but it will loudly do it and I think only a very very small subset of users will experience this. We cannot cleanly do this as the fundamental behavior is changed. I suppose could revert to the original behavior, but simplest/easiest from this point is just raise.
Agree on pd.to_numeric, this is useful in any case!
On convert_objects, if we want to deprecate it, I would revert it to how it was originally (0.16.2). I don't really see a point in breaking people's code to make something more consistent in a feature we are going to deprecate anyway. So why not just deprecating it then and pointing them to pd.to_numeric/datetime. I would say it is either change how convert_objects (to make it better, in order to keep it) or either deprecate it.
ok, let's rollback the entire .convert_objects change, and just deprecate it entirely.
If you want to leave the logic you have internally for _possibly_convert_objects thats fine as well (though should be the same exact external API), so not sure what is easier.
and add pd.to_numeric (which is essentially your new logic).
@bashtage change the documentation to advertise to_numeric/to_datetime/to_timedelta, rather than .convert_objects.
I have been working on the following:
- Revert the public
convert_objectswith the small bug fixed. - Turn the current
convert_objectsto a private_convert. This function is used internally in a handful of places and works correctly. - Add
to_numeric
One question: where should to_numeric live? generic.py? The other to_ live in time series specific modules.
@bashtage awesome!
pd.to_numeric could live in pandas/tools/util.py (not much there) ATM (then import in __init__ into the .pd API).
prob should reorg this at some point, though that is purely internal.
bashtage pushed a commit to bashtage/pandas that referenced this pull request
Restores the v0.16 behavior of convert_objects and moves the new version of _convert Adds to_numeric for directly converting numeric data
closes pandas-dev#11116 closes pandas-dev#11133