API: Deprecate compact_ints and use_unsigned in read_csv by gfyoung · Pull Request #13323 · pandas-dev/pandas (original) (raw)

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

Title is self-explanatory.

xref #12686 - I don't quite understand why these are marked (if at all) as internal to the C engine only, as the benefits for having these options accepted for the Python engine is quite clear based on the documentation I added as well.

Implementation simply just calls the already-written function in pandas/parsers.pyx - as it isn't specific to the TextReader class, crossing over to grab this function from Cython (instead of duplicating in pure Python) seems reasonable while maintaining that separation between the C and Python engines.

FYI: the test_compact_ints test that I deleted was a duplicate of test_compact_ints_as_recarray and also did not correctly reflect on the situation of the test data.

Current coverage is 84.22%

Merging #13323 into master will decrease coverage by <.01%

@@ master #13323 diff @@

Files 138 138
Lines 50713 50671 -42
Methods 0 0
Messages 0 0
Branches 0 0

Hits 42715 42676 -39

Misses 7998 7995 -3
Partials 0 0

Powered by Codecov. Last updated by 99e78da...0d56c09

This is fine to document these. In reality the parser should not be doing this, and these options just add to an already huge API. Better to have this an an option to to_numeric() or something. Or the user can simply pass in dtypes as appropriate.

I get that this makes things more 'automatic'.

So I think that we should deprecate these for now.

@jorisvandenbossche ?

@jreback : No issue with simplifying the API, though it did seem like it was a somewhat decent memory-conscientious option. If it is deprecated, telling them to refer to dtype might make more sense because I don't see where we would be going with adding to these to to_numeric, though that would be more inconvenient if there was a large table of small integers.

@gfyoung you could simply add them as options to to_numeric, which is basically a smart .astype.
(that handles errors and such).

How would they be toggled from the parser API exactly?

@gfyoung they wouldn't that why we deprecate :>

though you could very easily. Move these to a post-processing step and simply call something like

if compact_ints:
    x = pd.to_numeric(x, errors='ignore', compact_ints=compact_ints, use_unsigned=use_unsigned)

and simply move the code downcast_int64 to a call inside pd.to_numeric. (and move the code to inference.pyx just for cleanliness).

This is indepedent of deprecation or not though. (e.g. you can fix this and can separately deprecate).

Ah, I see. Will do that then.

Actually, it's not as straightforward as it seems. parser.pyx creates its own list of na_values (see here) that it uses when casting inside _downcast_to_int64, and that list of values is used in other functions within the file. If I move it to inference.pyx to be called via to_numeric (which I'm no longer sure about since that is a further API change), I would have to duplicate the code that creates na_values - how does the refactoring sound now?

you can just do this is maybe_convert_numeric or equivalent

@jreback : Refactored downcast_to_int64 as a function in inference.pyx, and Travis is happy. So just waiting to see what @jorisvandenbossche has to say about deprecation.

I don't think this works if the arr has any na_values in it
(not tested either)

@jreback : I literally moved the function from one location to another. If that's an issue, then it would have appeared in the parser as well. Will add a test though for that.

@gfyoung no, if its by efinition int64 coming in, then na_values are superfulous here. I know you copied it. that's not to say it was completely correct before.

@jreback : na_values is actually used in several places in the implementation, so it's not as superfluous as you make it sound (unless you would like to hard-code values). Also, the function does work when the na_value for np.int64 is included in the array. Added test to make sure.

@gfyoung you are missing my point. int arrays don't support missing values (currently), so if you actually have a missing value then this must be converted to float which is impossible in the current impl. so they ARE superfluous.

@jreback : That's not the point of this function at all. If it has missing values (i.e.nan), the array won't go through this path. That na_values parameter is hardly superfluous if you look at what it actually has stored in it.

@gfyoung then prove it by showing an example

@jreback : I think my test cases speak for themselves.

it's kind of like a fancy astype

Hmm...not sure why my test for the deprecation is failing on just one machine. What am I doing wrong here? It couldn't be this difficult to add, should it?

jorisvandenbossche

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit strange to announce this as a new feature, while it has been deprecated at the same time ...

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a fair point, but I did add support though. The exact direction of this PR is still somewhat in flux, and I'll adjust the whatsnew once things have settled down a bit. More importantly, I would like to figure out why Travis is unhappy with me at the moment, as the tests pass on my machine.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, certainly fine to change things here in the end (and not sure why travis is failing, I also do not see something obvious)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, I agree here. Since we are deprecating this, I think that no need to announce it for python engine support.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. Will change as soon as I placate the Travis.

Regarding the deprecation, I would be fine with that (I don't think many people already have heard of these keywords (in any case, I don't), although it would be nice if we could check this somewhere), but, it would be nice if there is then an alternative available (like the pd.to_numeric idea).

I am only wondering if it is needed to add the documentation to the docstring when we are deprecating it at the same time (so we don't actually want to encourage people to use it, and the docstring is already really long)

@jorisvandenbossche : yeah, me neither. I hadn't really heard of it until I started looking into what it did in the CParser. Surprising just how many options are undocumented for a function that works quite well for most everyday purposes such as my own. 😄

The pd.to_numeric idea will need some work to flesh out, but I don't think that it is appropriate for this PR as @jreback has already mentioned. I think the doc-string is still useful because comprehensive documentation is always a good idea IMO just so that there is no confusion about all of these options. Otherwise, how would you know that it was deprecated?

jreback

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

every one of these needs a tm.assert_produces_warning block. The warning on travis is because of below when you DO try to catch the warning it has already been triggered (and hence doesn't show). These are somewhat hard to isolated. Usually just keep running smaller and smaller blocks of code to figure out where it is triggering the uncaught warning, then inspect the code.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah! I did not know that. Fixed.

I am ok with doc-stringing this (always helpful) and deprecating is fine with me. Whenever we add this to pd.to_numeric (which I think is best place). Then can simply update the deprecation warnings to recommend a new usage (even if its done later). maybe make a placeholder issue for that enhancement.

@jreback @jorisvandenbossche : Travis is happy now, and I made the requested doc changes. Ready to merge if there are no other concerns.

It still feels a bit strange to add (to the python parser) and doc a feature that we are deprecating directly :-), but OK to merge for me

@gfyoung ty, pls also create an issue to add to pd.to_numeric

gfyoung deleted the python-engine-compact-ints branch

June 2, 2016 23:19

gfyoung changed the title~~ENH: Add support for compact_ints and use_unsigned in Python engine~~ API: Deprecate compact_ints and use_unsigned in read_csv

Jun 2, 2016

gfyoung added a commit to forking-repos/pandas that referenced this pull request

Dec 19, 2017

gfyoung added a commit to forking-repos/pandas that referenced this pull request

Dec 20, 2017

jreback pushed a commit that referenced this pull request

Dec 21, 2017