GH13347 ENH Allow true_values and false_values options in read_excel by WillAyd · Pull Request #14002 · pandas-dev/pandas (original) (raw)

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation20 Commits1 Checks0 Files changed

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

WillAyd

jreback

@@ -373,6 +388,12 @@ def _parse_cell(cell_contents, cell_typ):
val = int(cell_contents)
if val == cell_contents:
cell_contents = val
elif true_values and cell_contents in true_values:
# GH13347

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are going to need to make a set of these if they r not none

eg this will break with a scalar

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just so I'm clear you are asking to change it from:
elif true_values and cell_contents in true_values:
to
elif true_values and cell_contents in set(true_values):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can try but that will fail as well

if true_values=5 or whatever

@codecov-io

Current coverage is 85.24% (diff: 100%)

Merging #14002 into master will decrease coverage by <.01%

@@ master #14002 diff @@

Files 140 140
Lines 50563 50563
Methods 0 0
Messages 0 0
Branches 0 0

Powered by Codecov. Last update 54ab5be...15ec67f

@jreback

@WillAyd ok, the issue is for some other things like this we accept scalar and lists. We need validation on this (in the parser code) to turn a scalar to a list (and a test for this).

@WillAyd

Thanks Jeff. I did however notice that the keyword arguments in read_csv don't do anything to turn scalars into a list. For instance, if I had a line in test.csv that was simply "foo,bar" then running the below would not convert the record foo to a bool.

df = pd.read_csv('test.csv', true_values='foo')

I have to specify the argument as:

df = pd.read_csv('test.csv', true_values=['foo'])

to make it work. I am thinking of two options:

  1. Add the aforementioned validation to read_excel or
  2. Have read_excel call the same function for bool checking as read_csv (maybe_convert_bool from pandas.lib)

The former would obviously fix the problem at hand but make the two read functions behave slightly differently. The second one would keep them consistent but require more work or an additional change to add the validation you are describing. Let me know which route you prefer.

@jreback

they should behave the same. You can for now mimic read_csv and we can come back in a later issue and fix this (as its a bug in read_csv). Or could do it all here.

@WillAyd

Latest commit should have fixed this to add the appropriate keyword arguments of true_values and false_values to the TextParser object referenced in the excel.py file. This keeps it consistent with read_csv. Updated asv_bench screenshot attached
asv_read_excel

jreback

@@ -412,6 +412,7 @@ Other enhancements
- Added documentation to :ref:`I/O<io.dtypes>` regarding the perils of reading in columns with mixed dtypes and how to handle it (:issue:`13746`)
- Raise ImportError for in the sql functions when sqlalchemy is not installed and a connection string is used (:issue:`11920`).
- :func:`read_excel` now supports the true_values and false_values keyword arguments (:issue: `13347`)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use double back ticks around true_values, false_values

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no space (:issue:13347)

@WillAyd

Thanks for the callout on mixing strings / floats, etc... into the testing scenario. I realized my code isn't working when you mix in floats into the file and seems to be replacing them with np.nan. I'll dig further back in and try to see what's going on

@jreback

@WillAyd

closes pandas-dev#13347

Author: willayd william.ayd@gmail.com

Squashes the following commits:

242a6a6 ENH: Allow true_values and fals_values options in read_excel

@WillAyd

Rebase should now be complete

@jorisvandenbossche

@adrivsh

@WillAyd, note that 'foo' is not a scalar, but a string, which is a list of characters. I think your code converts 'f' and 'o' to booleans
Just flagging this in case it's useful

@jreback

@adrivsh you would have to show an example
strings are normally treated similarly to scalars

@adrivsh

The example is

import pandas as pd
pd.DataFrame(["f","o","o"]).to_csv("foo.csv")
pd.read_csv("foo.csv", false_values="foo")

The result is:

Unnamed: 0 0
0 0 False
1 1 False
2 2 False

What i am saying is that python duck-types strings as list of characters, so that false_values= "foo" and false_values = ["f","o","o"] will produce the same result, unless some code prevents that

@adrivsh

Interestingly,

import pandas as pd
pd.DataFrame(["f","o","o","b","a","r"]).to_csv("foo.csv")
pd.read_csv("foo.csv", false_values="foo")

does not transform "f"s and "o"s to False, so I actually have no idea what i am talking about.

@jorisvandenbossche

@adrivsh I think that is possibly correct, but is a bit a pitfall (and is not documented very well). The string is regarded as a list of characters, if you pass it a single element list it works as expected:

In [15]: pd.read_csv("foo.csv", false_values=["foo"])
Out[15]: 
   Unnamed: 0  0
0           0  f
1           1  o
2           2  o

@adrivsh do you want to open a new issue for that?

yarikoptic added a commit to neurodebian/pandas that referenced this pull request

Oct 13, 2016

@yarikoptic

@WillAyd WillAyd deleted the true-false-excel-options branch

December 12, 2017 15:39