ENH: inconsistent naming convention for read_excel column selection (#4988) by abarber4gh · Pull Request #16488 · pandas-dev/pandas (original) (raw)
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters
[ Show hidden characters]({{ revealButtonHref }})
- update API to use ‘usecols’ instead of ‘parse_cols’. still functionally the same as ‘parse_col’,
- added test cases for ‘usecols’, added assert_produces_warning(FutureWarning) to other test
cases that use ‘parse_cols’. - refactor column use column parsing to only occur once per sheet.
- updated whats new with deprecated parse_col argument.
- closes ENH: inconsistent naming convention for read_csv and read_excel column selection #4988
- tests added / passed
- passes
git diff upstream/master --name-only -- '*.py' | flake8 --diff - whatsnew entry
- packers.* BENCHMARKS NOT SIGNIFICANTLY CHANGED.
FYI, just discussing a bug related to usecols #15316, might be easiest to work that change into your refactoring, or can be separate, cc @stanleyguan
| - ``RangeIndex.append`` now returns a ``RangeIndex`` object when possible (:issue:`16212`) |
|---|
| - :func:`to_pickle` has gained a protocol parameter (:issue:`16252`). By default, this parameter is set to `HIGHEST_PROTOCOL https://docs.python.org/3/library/pickle.html#data-stream-format`__ |
| - :func:`api.types.infer_dtype` now infers decimals. (:issue: `15690`) |
| - :func:`read_excel` now allows a column character list (E.G. ['A', 'C', 'D']) with the ``usecols`` parameter (:issue:`4988`). |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we are listed this in deprecations so this is unecessary
| Deprecations |
|---|
| ~~~~~~~~~~~~ |
| - :func:`read_excel()` has deprecated ``sheetname`` in favor of ``sheet_name`` for consistency with to_excel() (:issue:`10559`). |
| - :func:`read_excel()` has deprecated ``parse_cols`` in favor of ``usecols`` for consistency with other read_ functions (:issue:`4988`). |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pd.read_* functions
| * If string then indicates comma separated list of column names and |
|---|
| column ranges (e.g. "A:E" or "A,C,E:F") |
| parse_cols : int or list, default None |
| .. deprecated:: 0.21.0 |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, didn't know sphinx had this deprecated tag. @TomAugspurger @jorisvandenbossche should maybe change other deprecations to use this if it looks nice.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
appears as:
parse_cols : int or list, default None
Deprecated since version 0.21.0: Use usecols instead
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TomAugspurger if we like this (and lgtm), let's make an issue to updated all DEPRECATED in codebase with this directive
| dfref = self.get_csv_refdf('test1') |
|---|
| dfref = dfref.reindex(columns=['A', 'B', 'C']) |
| df1 = self.get_exceldf('test1', 'Sheet1', index_col=0, usecols=3) |
| df2 = self.get_exceldf('test1', 'Sheet2', skiprows=[1], index_col=0, |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so you need to change ALL tests to use the new one (usecols), except for a single test to actually hit the deprecation.
| tm.assert_frame_equal(df1, dfref, check_names=False) |
| tm.assert_frame_equal(df2, dfref, check_names=False) # backward compat |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are these new tests?
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, new tests for the new functionality.
| * If string then indicates comma separated list of column names and |
|---|
| column ranges (e.g. "A:E" or "A,C,E:F") |
| parse_cols : int or list, default None |
| .. deprecated:: 0.21.0 |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TomAugspurger if we like this (and lgtm), let's make an issue to updated all DEPRECATED in codebase with this directive
| """ |
|---|
| def _excel2num(x): |
| "Convert Excel column name like 'AB' to 0-based column index" |
| return reduce(lambda s, a: s * 26 + ord(a) - ord('A') + 1, |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a reason you are changing this code? does this involve the deprecation somehow? if you are cleaning/fixing, pls do in another PR.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code changed to allow passing list of strings, similar to other read_* functions, and keep the read_excel "superpower" to pass a string that is parsed as mentioned in the original issue (#4988, point 3).
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and that's fine, but needs to be in a separate PR from the deprecation change.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, opened #16510 for this functionality. will re-submit with only the argument change (parse_cols -> usecols).
| tm.assert_frame_equal(df2, dfref, check_names=False) |
|---|
| tm.assert_frame_equal(df3, dfref, check_names=False) # backward compat |
| def test_parse_cols_str(self): |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
leave the original tests structure (sure you can change the name to conform), but don't change the tests (in THIS PR).
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tests are back to original but with changed function & kwarg names.
@abarber4gh can you rebase, now that the excel tests are running?
- removed usecols mention in Other Enhancments section, remains in Deprecations.
- removed test_parse_* test methods in favor of test_usecols_* methods.
- changed parse_cols to usecols in test_read_one_empty_col_* instead of catching warning.
…n selection (#4988)
update API to use ‘usecols’ instead of ‘parse_cols’. still functionally the same as ‘parse_col’, added test cases for ‘usecols’, added assert_produces_warning(FutureWarning) to other test cases that use ‘parse_cols’.
refactor column use column parsing to only occur once per sheet.
updated whats new with deprecated parse_col argument and other enhancements to usecols functionality. update documentation to show new usecols functionality in read_excel().
add check_stacklevel=False to test_excel_oldindex_format()
Replaces most uses of implicit global state from matplotlib in test_datetimelike.py. This was potentially causing random failures where a figure expected to be on a new, blank figure would instead plot on an existing axes (that's the guess at least).
Adding some more documentation on dataframe with regards to dtype
Making example for creating dataframe from np matrix easier
PERF: vectorize _interp_limit
CLN: remove old implementation
fixup! CLN: remove old implementation
BUG: Handle numpy strings in index names in HDF5 #13492
REF: refactor to _ensure_str
…ches (#16460)
gh-14671 Check if usecols with type string contains a subset of names, if not throws an error
tests added for gh-14671, expected behavior of simultaneous use of usecols and names unclear so these tests are commented out
Review comments
Splits extra information about the license and copyright holders to AUTHORS.md.
COMPAT: numpy 1.13 test compat
CI: fix doc build to 1.12
- removed usecols mention in Other Enhancments section, remains in Deprecations.
- removed test_parse_* test methods in favor of test_usecols_* methods.
- changed parse_cols to usecols in test_read_one_empty_col_* instead of catching warning.
…n selection (#4988)
update API to use ‘usecols’ instead of ‘parse_cols’. still functionally the same as ‘parse_col’, added test cases for ‘usecols’, added assert_produces_warning(FutureWarning) to other test cases that use ‘parse_cols’.
refactor column use column parsing to only occur once per sheet.
updated whats new with deprecated parse_col argument and other enhancements to usecols functionality. update documentation to show new usecols functionality in read_excel().
add check_stacklevel=False to test_excel_oldindex_format()
…o issue#4988
- 'issue#4988' of https://github.com/abarber4gh/pandas:
add
deprecate_kwargfrom_decoratorsaddcheck_stacklevel=Falsetotest_excel_oldindex_format()removed excess blank line. change parse_cols to usecols change tests keyword from parse_cols to usecol. no message ENH: inconsistent naming convention for read_csv and read_excel column selection (#4988) implement changes request in PR#16488 - removed usecols mention in Other Enhancments section, remains in Deprecations. - removed test_parse_* test methods in favor of test_usecols_* methods. - changed parse_cols to usecols in test_read_one_empty_col_* instead of catching warning. rebased
can you rebase / update according to comments
closing as stale. pls ping to reopen if you want to continue.