BUG, ENH: Improve infinity parsing for read_csv by gfyoung · Pull Request #13274 · pandas-dev/pandas (original) (raw)
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters
[ Show hidden characters]({{ revealButtonHref }})
- Allow mixed-case infinity strings for the Python engine
Bug was traced back via lib.maybe_convert_numeric
to the floatify
function in pandas/src/parse_helper.h
. In addition to correcting the bug and adding tests for it, this PR also moves the test_inf_parsing
test from c_parser_only.py
to common.py
in the pandas/io/tests/parser
dir.
- Interpret
+inf
as positive infinity for both engines
float('+inf')
in Python is interpreted as positive infinity, so we should allow it too in parsing.
@@ -252,3 +252,4 @@ Bug Fixes |
---|
- Bug in ``groupby`` where ``apply`` returns different result depending on whether first result is ``None`` or not (:issue:`12824`) |
- Bug in ``Categorical.remove_unused_categories()`` changes ``.codes`` dtype to platform int (:issue:`13261`) |
- Bug in ``pd.lib.maybe_convert_numeric`` in which infinities of mixed-case forms were not being interpreted properly (:issue:`13274`) |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
give a more user friendly example (the user doesn't see this function as its internal)
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough. I'll rewrite it to refer to read_csv
and the Python engine.
ok looks reasonable. pls run parser asv's just to be sure (I doubt we have something that specifically has infs in it; you can add if you want).
btw, not averse to signed inf (e.g. +inf
), if its in the IEEE spec (not sure)
- Not sure if it's necessary to add a benchmark for this. A couple of string checks shouldn't have much effect, so unless there is a strong objection, I'll leave it for now.
- I can't actually run benchmarks at the moment. There seems to be an issue with linux-64 channels in Anaconda for
tables
andlibpython
that prevents me from running the benchmarks. Can you confirm on a different architecture? - Can't confirm if
+inf
is in IEEE standard (Googled "IEEE infinity," and that didn't turn up anything). However,float('+inf')
does work in Python, so I guess we should accept it. However, it is not supported in the CParser either right now, so I'll need to see where infinity parsing is done there.
pls just run the current benchmarks - don't need a new one
you don't need libpython on Linux
I'm just running asv continuous master floatify-infinity-parsing -b io_bench
, and that's what it's complaining about, nothing special here, right?
gfyoung changed the title
Fix infinity parsing in parse_helper.floatify BUG, ENH: Improve infinity parsing for read_csv
@jreback : Added +inf
parsing to both parsers, and Travis is still giving the green light. While I cannot run the benchmarks because of that Anaconda issue, I wouldn't be surprised if they remain relatively the same since my changes shouldn't down anything too much. Ready to merge if there are no other concerns.
strcpy(inf_str, data); |
---|
for ( ; *inf_ptr; ++inf_ptr) *inf_ptr = tolower(*inf_ptr); |
if (0 == strcmp(inf_str, "-inf")) { |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why aren't you using strcasecomp
here?
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question! I just do include "headers/portable.h"
and then I can call it? That would indeed save me pain of this char
copying.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't do C very much :> but I think yes; though its included the same way as strcmp
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haha, fair enough. Will give it a shot and see. Otherwise, I'll ask Google.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also not sure if this will be more clear to make a if/else if for len 3 and another for len 4 (I 'c' lots of c code doing things like this)
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough. Can break up for clarity.
- Python infinity parsing bug
Initially an attempt to fix a Python parsing bug of mixed-case infinity strings, the bug was traced back via lib.maybe_convert_numeric to the 'floatify' method in pandas/src/parse_helper.h.
In addition to correcting the bug and adding tests for it, this commit also moves the infinity-parsing test from CParser-only to common.
- Interpret '+inf' as positive infinity
This is consistent with the Python API, where float('+inf') is interpreted as positive infinity.
lgtm, ping on green. we don't have an associated issue right?
Nope. Discovered it when I was refactoring test_parsers
, and someone had inadvertently put read_csv
instead of self.read_csv
for that test I moved in this PR.
@jreback : Travis giving the green light . Ready to merge if there is nothing else.
nice PR @gfyoung
only comment is that don't put Bug Fixes at the very end, causes conflicts, stick them somewhat randomly in the spaces within the current list (that's why I leave the space)
gfyoung deleted the floatify-infinity-parsing branch
@jreback : Ah, fair point. Bad habit on my part. Thanks for the reminder!
2 participants