GH13347 ENH Allow true_values and false_values options in read_excel by WillAyd · Pull Request #14002 · pandas-dev/pandas (original) (raw)
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Conversation20 Commits1 Checks0 Files changed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters
[ Show hidden characters]({{ revealButtonHref }})
- [ X] closes Allow true_values and false_values options in read_excel #13347
- [ X] tests added / passed
- [ X] passes
git diff upstream/master | flake8 --diff
- [ X] whatsnew entry
@@ -373,6 +388,12 @@ def _parse_cell(cell_contents, cell_typ): |
---|
val = int(cell_contents) |
if val == cell_contents: |
cell_contents = val |
elif true_values and cell_contents in true_values: |
# GH13347 |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are going to need to make a set of these if they r not none
eg this will break with a scalar
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just so I'm clear you are asking to change it from:elif true_values and cell_contents in true_values:
toelif true_values and cell_contents in set(true_values):
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can try but that will fail as well
if true_values=5 or whatever
Current coverage is 85.24% (diff: 100%)
@@ master #14002 diff @@
Files 140 140
Lines 50563 50563
Methods 0 0
Messages 0 0
Branches 0 0
- Hits 43103 43102 -1
- Misses 7460 7461 +1
Partials 0 0
Powered by Codecov. Last update 54ab5be...15ec67f
@WillAyd ok, the issue is for some other things like this we accept scalar and lists. We need validation on this (in the parser code) to turn a scalar to a list (and a test for this).
Thanks Jeff. I did however notice that the keyword arguments in read_csv don't do anything to turn scalars into a list. For instance, if I had a line in test.csv that was simply "foo,bar" then running the below would not convert the record foo to a bool.
df = pd.read_csv('test.csv', true_values='foo')
I have to specify the argument as:
df = pd.read_csv('test.csv', true_values=['foo'])
to make it work. I am thinking of two options:
- Add the aforementioned validation to read_excel or
- Have read_excel call the same function for bool checking as read_csv (maybe_convert_bool from pandas.lib)
The former would obviously fix the problem at hand but make the two read functions behave slightly differently. The second one would keep them consistent but require more work or an additional change to add the validation you are describing. Let me know which route you prefer.
they should behave the same. You can for now mimic read_csv and we can come back in a later issue and fix this (as its a bug in read_csv). Or could do it all here.
Latest commit should have fixed this to add the appropriate keyword arguments of true_values and false_values to the TextParser object referenced in the excel.py file. This keeps it consistent with read_csv. Updated asv_bench screenshot attached
@@ -412,6 +412,7 @@ Other enhancements |
---|
- Added documentation to :ref:`I/O<io.dtypes>` regarding the perils of reading in columns with mixed dtypes and how to handle it (:issue:`13746`) |
- Raise ImportError for in the sql functions when sqlalchemy is not installed and a connection string is used (:issue:`11920`). |
- :func:`read_excel` now supports the true_values and false_values keyword arguments (:issue: `13347`) |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use double back ticks around true_values, false_values
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no space (:issue:
13347)
Thanks for the callout on mixing strings / floats, etc... into the testing scenario. I realized my code isn't working when you mix in floats into the file and seems to be replacing them with np.nan. I'll dig further back in and try to see what's going on
closes pandas-dev#13347
Author: willayd william.ayd@gmail.com
Squashes the following commits:
242a6a6 ENH: Allow true_values and fals_values options in read_excel
Rebase should now be complete
@WillAyd, note that 'foo' is not a scalar, but a string, which is a list of characters. I think your code converts 'f' and 'o' to booleans
Just flagging this in case it's useful
@adrivsh you would have to show an example
strings are normally treated similarly to scalars
The example is
import pandas as pd
pd.DataFrame(["f","o","o"]).to_csv("foo.csv")
pd.read_csv("foo.csv", false_values="foo")
The result is:
Unnamed: 0 0
0 0 False
1 1 False
2 2 False
What i am saying is that python duck-types strings as list of characters, so that false_values= "foo" and false_values = ["f","o","o"] will produce the same result, unless some code prevents that
Interestingly,
import pandas as pd
pd.DataFrame(["f","o","o","b","a","r"]).to_csv("foo.csv")
pd.read_csv("foo.csv", false_values="foo")
does not transform "f"s and "o"s to False, so I actually have no idea what i am talking about.
@adrivsh I think that is possibly correct, but is a bit a pitfall (and is not documented very well). The string is regarded as a list of characters, if you pass it a single element list it works as expected:
In [15]: pd.read_csv("foo.csv", false_values=["foo"])
Out[15]:
Unnamed: 0 0
0 0 f
1 1 o
2 2 o
@adrivsh do you want to open a new issue for that?
yarikoptic added a commit to neurodebian/pandas that referenced this pull request
- commit 'v0.19.0rc1-25-ga7469cf': (471 commits) ENH: Add divmod to series and index. (pandas-dev#14208) Fix generator tests to run (pandas-dev#14245) BUG: GH13629 Binned groupby median function with empty bins (pandas-dev#14225) TST/TEMP: fix pyqt to 4.x for plotting tests (pandas-dev#14240) DOC: added example to Series.map showing use of na_action parameter (GH14231) DOC: split docstring into multiple lines in excel.py (pandas-dev#14073) MAINT: Use module in _DeprecatedModule. (pandas-dev#14181) ENH: Allow true_values and false_values options in read_excel (pandas-dev#14002) DOC: fix incorrect example in unstack docstring (GH14206) (pandas-dev#14211) BUG: iloc fails with non lex-sorted MultiIndex pandas-dev#13797 BUG: add check for infinity in call of EngFormatter In gbq.to_gbq allow the DataFrame column order to differ from schema BLD: require cython if tempita is needed DOC: add source links to api docs (pandas-dev#14200) BUG: compat with Stata ver 111 Fix: F999 dictionary key '2000q4' repeated with different values (pandas-dev#14198) BLD: Test for Python 3.5 with C locale BUG: DatetimeTZBlock can't assign values near dst boundary BUG: union_categorical with Series and cat idx BUG: fix str.contains for series containing only nan values ...
WillAyd deleted the true-false-excel-options branch