ENH: add downcast to pd.to_numeric by gfyoung · Pull Request #13425 · pandas-dev/pandas (original) (raw)
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters
[ Show hidden characters]({{ revealButtonHref }})
Title is self-explanatory. Closes #13352.
Current coverage is 84.31%
@@ master #13425 diff @@
Files 138 138
Lines 51157 51176 +19
Methods 0 0
Messages 0 0
Branches 0 0
- Hits 43131 43151 +20
- Misses 8026 8025 -1
Partials 0 0
Powered by Codecov. Last updated by 5701c69...4758dcc
@@ -2097,3 +2097,35 @@ def _random_state(state=None): |
---|
else: |
raise ValueError("random_state must be an integer, a numpy " |
"RandomState, or None") |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
put in pandas.types.missing.py
(new file); let's not add anymore to core/common.py
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough. Done.
@jreback : Refactored na_values
and added the version tags. Travis is still happy, so ready to merge if there are no other concerns.
lgtm. @jorisvandenbossche
small issue is whethere the kwargs are the right names; further use_unsigned
deps on compat_ints
which is a little awkward.
maybe:
compat_ints
, compat_ints_to_unsigned
? (and decouple them). I know its longer, but its more explicit.
@jreback : I see what you're saying, but the decoupling would raise the issue of which one takes precedence? i.e. what happens if for some reason or another, someone passes in compact_ints=1
and compact_ints_to_unsigned=1
? Also, not sure what I would necessarily say that use_unsigned
totally depends on compact_ints
- the way I look at it is that it further parameterizes the compacting.
@gfyoung that would be an error, both options cannot be True.
# compact int64 "NaN" value |
---|
int8_na = na_values[np.int8] |
int64_na = na_values[np.int64] |
s = [int64_na, 2, 3] |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know the "NaN" like values is the current lib.downcast_int64
behaviour, but I am not sure we should put that in user-facing code?
Because that means you are actually changing the content (and they are currently not used as NaNs)
At the moment, the implementation in this PR is not really a substitute for the pd.read_csv(... compact_ints=True)
I think? As it will only do something when there is actually a conversion from object dtype to integer dtype.
Of course it could be easily changed that the keyword is also used when the dtype is already numeric (the if com.is_numeric_dtype(values): pass
at the moment).
But, after second thought, I am actually not sure if to_numeric
is the appropriate place for this functionality IMO. As then you would feed integer data to to_numeric
, not to "convert it to numeric dtype" (the current purpose of the function), but to downcast them (which broadens the scope of the function). And it feels this should actually belong in a separate function.
@jorisvandenbossche I view to_numeric
as basically .astype
, with better error handling. So in that vein, I don't think it has to be object
to begin with, though maybe we should also add a dtype=
option.
alternatively, maybe add these instead to .astype
.
and @gfyoung certainly remove the non int64-sentinels though.
@jorisvandenbossche : that is not correct - downcasting occurs so long as the converted data is of integer dtype, irregardless of its starting dtype i.e. data that is already integral can also be downcasted. This is true in read_csv
and in to_numeric
as well (with my changes).
While I do see what you are saying that this change could be viewed as expanding the functionality beyond what it was originally intended for, compact_ints
and use_unsigned
can be viewed as further specifying the manner in which data is casted to numerical values. From this viewpoint, this added functionality remains within the bounds of what this function was originally designed to do.
@gfyoung Looks good! Made a couple of doc comments
Last comment: I would personally just choose one of 'integer'
or 'signed'
that we think suits best instead of allowing both for exactly the same
Hmmm...seems reasonable enough @jreback : should we choose one of integer
or signed
OR just keep both of them to mean "smallest signed integer"?
@@ -1788,6 +1788,46 @@ but occasionally has non-dates intermixed and you want to represent as missing. |
---|
In addition, :meth:`~DataFrame.convert_objects` will attempt the *soft* conversion of any *object* dtypes, meaning that if all |
the objects in a Series are of the same type, the Series will have that dtype. |
:meth:`~pandas.to_numeric` is very similar to `pd.to_datetime`, except that it is for conversion to numerical data: |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can make pd.to_datetime
a function reference as well
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, done.
@jreback: with the massive Travis backlog currently going on here. Could you cancel these two old builds of this PR here and here?
1) :meth:`~pandas.to_datetime` (conversion to datetime objects) |
---|
.. ipython:: python |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just use lists here - no need to use a numpy array
also say these operate on 1dim things (or scalars)
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough. Done.
@jreback : Travis is passing, so ready to merge if there are no other concerns.
thanks @gfyoung another nice PR!
though we DO seem to be having long discussions :)
give a look on the docs once built. I edited slightly.
@jreback: well, I think the long discussion was quite beneficial. We caught a bug in the Travis build script and made the API slightly more compact (one arg instead of two). Will take a look at the doc.
gfyoung deleted the to-numeric-enhance branch
nateGeorge pushed a commit to nateGeorge/pandas that referenced this pull request
Title is self-explanatory. Closes pandas-dev#13352.
Author: gfyoung gfyoung17@gmail.com
Closes pandas-dev#13425 from gfyoung/to-numeric-enhance and squashes the following commits:
4758dcc [gfyoung] ENH: add 'downcast' to pd.to_numeric