BUG: Respect usecols even with empty data by gfyoung · Pull Request #12506 · pandas-dev/pandas (original) (raw)

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation161 Commits2 Checks0 Files changed

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

closes #12493

in which the usecols argument was not being respected for empty data. This is because no filtering was applied when the first (and only) chunk was being read.

Tests are passing. Should be good to merge.

jreback


def check_usecols(stringIO):
df = read_csv(stringIO, names=names, usecols=usecols)
assert_array_equal(df.columns, usecols)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NEVER use numpy.testing imports!!!!!

ALWAYS use pandas testing functions. These are much more comprehensive and fully test all behavior. It is also confusing to a reader.

Fully test vs an expected frame and use assert_frame_equal

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I got it with the first exclamation point. ;) - fixed.

If some could cancel this build #18023 (it's an old build), that would be great. Thanks!

@jreback : Tests are passing once more. Should be good to merge AFAICT.

jreback

while self.line_pos <= hr:
line = self._next_line()

except StopIteration:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you adding all of this code and making this very complicated?

the idea is to make things simpler. pls see if you can find a general soln.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How complicated is this code? It's essentially boilerplate and logic from the C engine code. I'm not writing anything new here. I don't understand why people would have a hard time understanding this code but yet the C engine code is considered comprehensible.

@gfyoung you had to change a whole bunch of things to get your code to work (e.g. changing long established behavior in excel/html parsing), the StopIteration can now also be a ValueError. That is just crazy. My point is to make minimal changes that get the job done, use general methods where possible, and don't duplicate code.

@jreback : Simplified things a little by changing the error from ValueError to StopIteration. I still kept some of the error messages from the C engine because I find them to be quite useful.

at a glance looks ok. still working on 0.18.0, so will get to next week.

👍 will ping on all of them next week then.

Sorry, slowly getting back on track :-) I will take a look now.

Just a quick question, my previous comment was about HTML and Excel parser. The whatsnew note now does not mention those anymore. There are no longer changes to to read_html and read_excel ?

@jorisvandenbossche : No worries. 😄 There are in fact changes to read_html and read_excel. If you look at the changed files, both changed the exception they are catching from StopIteration to EmptyDataError when there are empty tables. However, those changes are a result of the changes made to the parsers and are not direct changes to read_html and read_excel.

jorisvandenbossche

df = self.read_csv(StringIO(''), names=names, usecols=usecols)

def test_read_with_bad_header(self):
name = self.__class__.__name__

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this used?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait...why did flake8 not catch that? No, I don't think so AFAICT.

@gfyoung But they only catch it under the hood? (I mean, it doesn't bubble up to the user?)

jorisvandenbossche


In [1]: df = pd.read_csv(StringIO(''), engine='c')
...
pandas.io.common.EmptyDataError: No columns to parse from file

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it actually shown with the full name? (just a question, didn't test it, didn't fetch the PR, but I would just show the same as in an actual console)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which is what I did. 😄 - FYI you can observe this full out name thing if you trigger any current CParserError.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, perfect! (I just wondered if it was the case)

@jorisvandenbossche : Not AFAICT (it's an explicit try-except block for read_html and read_excel)

@jreback @gfyoung I didn't look at the code changes in detail, but the whatsnew is in any case very clear and sounds logical! So good to go for me

jreback

@@ -179,6 +179,43 @@ New Behavior:
# Output is a DataFrame
df.groupby(pd.TimeGrouper(key='date', freq='M')).apply(lambda x: x[['value']].sum())

Changes in ``read_csv`` errors

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this sub-section should be in API changes.

add a direction above for referncding this: .._whatsnew.......

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

say Change inread_csvexceptions

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@gfyoung ok just a couple of minor doc comments. Pls also have a look thru io.rst and see if any exceptions are mentioned (and if so fix them), as well as doc-strings.

In Python, when reading an empty file, it used to throw a StopIteration error with no error message. This PR helps to differentiate the case when no columns are inferable, which now leads to an EmptyDataError for both the C and Python engines.

[ci skip]

jreback


In addition to this error change, several others have been made as well:

- ``CParserError`` is now a ``ValueError`` instead of just an ``Exception`` (:issue:`12551`)