DEPR: error_bad_lines and warn_bad_lines for read_csv by lithomas1 · Pull Request #40413 · pandas-dev/pandas (original) (raw)

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation37 Commits30 Checks0 Files changed

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

closes API/ENH: read_csv handling of bad lines (too many/few fields) #15122
tests added / passed
Ensure all linting tests pass, see here for how to run them
whatsnew entry

Summary of contents
- Adds new on_bad_lines parameter (I found on_bad_lines a more explicit name than bad_lines)
- This defaults to None at first in order to preserve compatibility, however it should be changed to error in 2.0 after
error_bad_lines and warn_bad_lines are removed.
- Cleanup of some C/Python Parser code ( add enum instead of using 2 variables for C and only use on_bad_lines in Python)

Hello @lithomas1! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-05-28 00:23:35 UTC

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool, am +1 on this change, some comments

@@ -382,6 +402,7 @@
"memory_map": False,
"error_bad_lines": True,
"warn_bad_lines": True,
"on_bad_lines": None,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah should remove error/warn_bad_lines from here

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_get_options_with_defaults is really spaghetti-fied right now, so removing this would not the args not passed to the parser. I will try to clean up _get_options_with_defaults in a future PR if I have time.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to set the original options to None to avoid way more complexity.

else:
raise ValueError(f"Argument {on_bad_lines} is invalid for on_bad_lines")
else:
if kwds.get("error_bad_lines"):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these need a deprecation warning

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_deprecated_defaults handles this for us.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe i am not seeing it. but it seems that the old options (error_bad_lines / warn_bad_lines) are still passed around. these should be handled in exactly 1 place, then immediately discarded (with a possible warning / error if multiple things are specified), and then only on_bad_lines should exist.

self.on_bad_lines = self.BadLineHandleMethod.SKIP
else:
raise ValueError(f"Argument {on_bad_lines} is invalid for on_bad_lines")
# Override on_bad_lines w/ deprecated args for backward compatibility

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't all of these cases show a deprecation warning? L227-238

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @lithomas1 this was a bear! thanks for sticking with it

Thanks for the reviews @jreback. Would it be possible to mention this in the deprecations tracker #30228?(I can't edit it myself.)

TLouf pushed a commit to TLouf/pandas that referenced this pull request

JulianWgs pushed a commit to JulianWgs/pandas that referenced this pull request

phofl mentioned this pull request

zhengruifeng pushed a commit to apache/spark that referenced this pull request

… from read_csv & enabling more tests

What changes were proposed in this pull request?

This PR proposes to remove squeeze parameter from read_csv to follow the behavior of latest pandas. See pandas-dev/pandas#40413 and pandas-dev/pandas#43427 for detail.

This PR also enables more tests for pandas 2.0.0 and above.

Why are the changes needed?

To follow the behavior of latest pandas, and increase the test coverage.

Does this PR introduce any user-facing change?

squeeze will be no longer available from read_csv. Otherwise, it's test-only.

How was this patch tested?

Enabling & updating the existing tests.

Closes #42551 from itholic/pandas_remaining_tests.

Authored-by: itholic haejoon.lee@databricks.com Signed-off-by: Ruifeng Zheng ruifengz@apache.org

valentinp17 pushed a commit to valentinp17/spark that referenced this pull request

… from read_csv & enabling more tests

What changes were proposed in this pull request?

This PR proposes to remove squeeze parameter from read_csv to follow the behavior of latest pandas. See pandas-dev/pandas#40413 and pandas-dev/pandas#43427 for detail.

This PR also enables more tests for pandas 2.0.0 and above.

Why are the changes needed?

To follow the behavior of latest pandas, and increase the test coverage.

Does this PR introduce any user-facing change?

squeeze will be no longer available from read_csv. Otherwise, it's test-only.

How was this patch tested?

Enabling & updating the existing tests.

Closes apache#42551 from itholic/pandas_remaining_tests.

Authored-by: itholic haejoon.lee@databricks.com Signed-off-by: Ruifeng Zheng ruifengz@apache.org

ragnarok56 pushed a commit to ragnarok56/spark that referenced this pull request

… from read_csv & enabling more tests

What changes were proposed in this pull request?

This PR proposes to remove squeeze parameter from read_csv to follow the behavior of latest pandas. See pandas-dev/pandas#40413 and pandas-dev/pandas#43427 for detail.

This PR also enables more tests for pandas 2.0.0 and above.

Why are the changes needed?

To follow the behavior of latest pandas, and increase the test coverage.

Does this PR introduce any user-facing change?

squeeze will be no longer available from read_csv. Otherwise, it's test-only.

How was this patch tested?

Enabling & updating the existing tests.

Closes apache#42551 from itholic/pandas_remaining_tests.

Authored-by: itholic haejoon.lee@databricks.com Signed-off-by: Ruifeng Zheng ruifengz@apache.org