BUG, ENH: Improve infinity parsing for read_csv by gfyoung · Pull Request #13274 · pandas-dev/pandas (original) (raw)

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

gfyoung

  1. Allow mixed-case infinity strings for the Python engine

Bug was traced back via lib.maybe_convert_numeric to the floatify function in pandas/src/parse_helper.h. In addition to correcting the bug and adding tests for it, this PR also moves the test_inf_parsing test from c_parser_only.py to common.py in the pandas/io/tests/parser dir.

  1. Interpret +inf as positive infinity for both engines

float('+inf') in Python is interpreted as positive infinity, so we should allow it too in parsing.

jreback

@@ -252,3 +252,4 @@ Bug Fixes
- Bug in ``groupby`` where ``apply`` returns different result depending on whether first result is ``None`` or not (:issue:`12824`)
- Bug in ``Categorical.remove_unused_categories()`` changes ``.codes`` dtype to platform int (:issue:`13261`)
- Bug in ``pd.lib.maybe_convert_numeric`` in which infinities of mixed-case forms were not being interpreted properly (:issue:`13274`)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

give a more user friendly example (the user doesn't see this function as its internal)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. I'll rewrite it to refer to read_csv and the Python engine.

@jreback

ok looks reasonable. pls run parser asv's just to be sure (I doubt we have something that specifically has infs in it; you can add if you want).

@jreback

btw, not averse to signed inf (e.g. +inf), if its in the IEEE spec (not sure)

@gfyoung

  1. Not sure if it's necessary to add a benchmark for this. A couple of string checks shouldn't have much effect, so unless there is a strong objection, I'll leave it for now.
  2. I can't actually run benchmarks at the moment. There seems to be an issue with linux-64 channels in Anaconda for tables and libpython that prevents me from running the benchmarks. Can you confirm on a different architecture?
  3. Can't confirm if +inf is in IEEE standard (Googled "IEEE infinity," and that didn't turn up anything). However, float('+inf') does work in Python, so I guess we should accept it. However, it is not supported in the CParser either right now, so I'll need to see where infinity parsing is done there.

@jreback

pls just run the current benchmarks - don't need a new one

you don't need libpython on Linux

@gfyoung

I'm just running asv continuous master floatify-infinity-parsing -b io_bench, and that's what it's complaining about, nothing special here, right?

@gfyoung gfyoung changed the titleFix infinity parsing in parse_helper.floatify BUG, ENH: Improve infinity parsing for read_csv

May 25, 2016

@gfyoung

@jreback : Added +inf parsing to both parsers, and Travis is still giving the green light. While I cannot run the benchmarks because of that Anaconda issue, I wouldn't be surprised if they remain relatively the same since my changes shouldn't down anything too much. Ready to merge if there are no other concerns.

jreback

strcpy(inf_str, data);
for ( ; *inf_ptr; ++inf_ptr) *inf_ptr = tolower(*inf_ptr);
if (0 == strcmp(inf_str, "-inf")) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why aren't you using strcasecomp here?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question! I just do include "headers/portable.h" and then I can call it? That would indeed save me pain of this char copying.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't do C very much :> but I think yes; though its included the same way as strcmp

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha, fair enough. Will give it a shot and see. Otherwise, I'll ask Google.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also not sure if this will be more clear to make a if/else if for len 3 and another for len 4 (I 'c' lots of c code doing things like this)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. Can break up for clarity.

@gfyoung

  1. Python infinity parsing bug

Initially an attempt to fix a Python parsing bug of mixed-case infinity strings, the bug was traced back via lib.maybe_convert_numeric to the 'floatify' method in pandas/src/parse_helper.h.

In addition to correcting the bug and adding tests for it, this commit also moves the infinity-parsing test from CParser-only to common.

  1. Interpret '+inf' as positive infinity

This is consistent with the Python API, where float('+inf') is interpreted as positive infinity.

@jreback

lgtm, ping on green. we don't have an associated issue right?

@gfyoung

Nope. Discovered it when I was refactoring test_parsers, and someone had inadvertently put read_csv instead of self.read_csv for that test I moved in this PR.

@gfyoung

@jreback : Travis giving the green light . Ready to merge if there is nothing else.

@jreback

nice PR @gfyoung

only comment is that don't put Bug Fixes at the very end, causes conflicts, stick them somewhat randomly in the spaces within the current list (that's why I leave the space)

@gfyoung gfyoung deleted the floatify-infinity-parsing branch

May 25, 2016 19:26

@gfyoung

@jreback : Ah, fair point. Bad habit on my part. Thanks for the reminder!

2 participants

@gfyoung @jreback