BUG: Respect usecols even with empty data by gfyoung · Pull Request #12506 · pandas-dev/pandas (original) (raw)
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Conversation161 Commits2 Checks0 Files changed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters
[ Show hidden characters]({{ revealButtonHref }})
closes #12493
in which the usecols
argument was not being respected for empty data. This is because no filtering was applied when the first (and only) chunk was being read.
Tests are passing. Should be good to merge.
def check_usecols(stringIO): |
df = read_csv(stringIO, names=names, usecols=usecols) |
assert_array_equal(df.columns, usecols) |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NEVER use numpy.testing
imports!!!!!
ALWAYS use pandas testing functions. These are much more comprehensive and fully test all behavior. It is also confusing to a reader.
Fully test vs an expected frame and use assert_frame_equal
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I got it with the first exclamation point. ;) - fixed.
If some could cancel this build #18023 (it's an old build), that would be great. Thanks!
@jreback : Tests are passing once more. Should be good to merge AFAICT.
while self.line_pos <= hr: |
---|
line = self._next_line() |
except StopIteration: |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are you adding all of this code and making this very complicated?
the idea is to make things simpler. pls see if you can find a general soln.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How complicated is this code? It's essentially boilerplate and logic from the C engine code. I'm not writing anything new here. I don't understand why people would have a hard time understanding this code but yet the C engine code is considered comprehensible.
@gfyoung you had to change a whole bunch of things to get your code to work (e.g. changing long established behavior in excel/html parsing), the StopIteration
can now also be a ValueError
. That is just crazy. My point is to make minimal changes that get the job done, use general methods where possible, and don't duplicate code.
@jreback : Simplified things a little by changing the error from ValueError
to StopIteration
. I still kept some of the error messages from the C engine because I find them to be quite useful.
at a glance looks ok. still working on 0.18.0, so will get to next week.
👍 will ping on all of them next week then.
Sorry, slowly getting back on track :-) I will take a look now.
Just a quick question, my previous comment was about HTML and Excel parser. The whatsnew note now does not mention those anymore. There are no longer changes to to read_html
and read_excel
?
@jorisvandenbossche : No worries. 😄 There are in fact changes to read_html
and read_excel
. If you look at the changed files, both changed the exception they are catching from StopIteration
to EmptyDataError
when there are empty tables. However, those changes are a result of the changes made to the parsers and are not direct changes to read_html
and read_excel
.
df = self.read_csv(StringIO(''), names=names, usecols=usecols) |
---|
def test_read_with_bad_header(self): |
name = self.__class__.__name__ |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this used?
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait...why did flake8
not catch that? No, I don't think so AFAICT.
@gfyoung But they only catch it under the hood? (I mean, it doesn't bubble up to the user?)
In [1]: df = pd.read_csv(StringIO(''), engine='c') |
... |
pandas.io.common.EmptyDataError: No columns to parse from file |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it actually shown with the full name? (just a question, didn't test it, didn't fetch the PR, but I would just show the same as in an actual console)
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which is what I did. 😄 - FYI you can observe this full out name thing if you trigger any current CParserError
.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, perfect! (I just wondered if it was the case)
@jorisvandenbossche : Not AFAICT (it's an explicit try
-except
block for read_html
and read_excel
)
@jreback @gfyoung I didn't look at the code changes in detail, but the whatsnew is in any case very clear and sounds logical! So good to go for me
@@ -179,6 +179,43 @@ New Behavior: |
---|
# Output is a DataFrame |
df.groupby(pd.TimeGrouper(key='date', freq='M')).apply(lambda x: x[['value']].sum()) |
Changes in ``read_csv`` errors |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this sub-section should be in API changes.
add a direction above for referncding this: .._whatsnew.......
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
say Change in
read_csvexceptions
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@gfyoung ok just a couple of minor doc comments. Pls also have a look thru io.rst
and see if any exceptions are mentioned (and if so fix them), as well as doc-strings.
In Python, when reading an empty file, it used to throw a StopIteration error with no error message. This PR helps to differentiate the case when no columns are inferable, which now leads to an EmptyDataError for both the C and Python engines.
[ci skip]
In addition to this error change, several others have been made as well: |
- ``CParserError`` is now a ``ValueError`` instead of just an ``Exception`` (:issue:`12551`) |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was this whatsnew just not put in before? (the PR was already merged)
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but I moved it into this section because it's related
@jreback : Sure thing! FYI, you can also cancel my build here.