API: Warn about dups in names for read_csv by gfyoung · Pull Request #17346 · pandas-dev/pandas (original) (raw)

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

gfyoung

Title is self-explanatory.

xref #17095.

@codecov

jreback

counts = {}
warn_dups = False
for name in names:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just use set intersection here

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How so? This is a fail-early method, which is why I chose it.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

simply check for len(names) !+ len(set(names)). much more idiomatic

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. Done.

counts[name] = True
if warn_dups:
msg = ("Duplicate names specified. This "

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so are we deprecating this? then this should be a FutureWarning.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. Done.

jreback

@@ -406,6 +438,10 @@ def _read(filepath_or_buffer, kwds):
chunksize = _validate_integer('chunksize', kwds.get('chunksize', None), 1)
nrows = _validate_integer('nrows', kwds.get('nrows', None))
# Check for duplicates in names.
names = kwds.get("names", None)
_check_dup_names(names)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

call this _validate_names and have it return names, so its a similar patter to the other validators

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@gfyoung

@jreback : All comments addressed, and tests are green. PTAL

@gfyoung

jreback

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo in whatsnew, otherwise lgtm. make sure this is on the deprecation list as well

@@ -283,6 +283,7 @@ Other API Changes
- The Categorical constructor no longer accepts a scalar for the ``categories`` keyword. (:issue:`16022`)
- Accessing a non-existent attribute on a closed :class:`~pandas.HDFStore` will now
raise an ``AttributeError`` rather than a ``ClosedFileError`` (:issue:`16301`)
- :func:`read_csv` now issues a ``UserWarning`` if the ``names`` parameter contains duplicates (:issue:`17095`)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be FutureWarning

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doh! My bad for not catching that. Fixed.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing back to UserWarning in light of later discussion.

@gfyoung

Fixed typo and added to deprecation list. Will merge on green then unless told otherwise.

@jreback

@gfyoung wait for @jorisvandenbossche comment (as not sure if he commented here). IIRC a comment he made that having duplicate names is ok .

@gfyoung

@gfyoung

@gfyoung

@TomAugspurger

I think Joris is off on holiday. I believe he's back next week.

@gfyoung

@TomAugspurger : Ah! I had a feeling that that was the case (I remember seeing an email about that). I'll wait then until he gets back.

@gfyoung

@jorisvandenbossche

Can you remember me the rationale for deprecating this?
Is it because we cannot actually handle it well?

Because I actually previously had a usecase where this proved useful (I had a non-informative column every other column, gave it the same name in names and then dropped the single name. But this specific case can of course easily be solved differently, by giving names like 'dummy1', 'dummy2', .. and then removing all columns that start with 'dummy')

@gfyoung

@jorisvandenbossche : Here is what you said back in July here.

Essentially, we are deprecating this behavior because names is a user-specified parameter, and passing in duplicate names deliberately only encourages buggy behavior.

@gfyoung

@jorisvandenbossche

Sorry for the slow response.

So maybe a more general question: is it our intention to once fix mangle_dupe_cols=True ? (as currently actually only the default False works). If we do, I see no good reason to disallow duplicates in passed names vs names in the csv file.

@gfyoung

@jorisvandenbossche : No worries! I think you meant the other way around. mangle_dupe_cols=True is fully supported at this point. It is mangle_dupe_cols=False that we disabled.

The reason for us discouraging duplicates in names is because duplicates are generally more error-prone. Contrary to duplicates in a file, names is a deliberate choice.

@jreback : Thoughts?

@jorisvandenbossche

I think you meant the other way around

yes, of course ... :-)

Contrary to duplicates in a file, names is a deliberate choice.

That's true. But there are many other ways to deliberately make a dataframe with duplicate columns which we don't disallow anyway.

To be clear, in general I am all for a restricted scope of capabilities/possibilties. But in this case, limiting the abilities of name does not actually reduce code complexity, but increases it. As the duplicates are already perfectly handled by the code, so we are introducing a special case. Therefore I was wondering whether this is actually needed (user will see that the names are mangled anyhow).

@gfyoung

@jorisvandenbossche : True that we'll see them mangled anyhow, but why the need to add complexity to just handle them in the first place? I added the handling for duplicate names in an earlier PR because it was a bug, not to enhance support.

If the user really wants to have duplicate names, they can set it themselves and reading in the file, but I don't know if we want to actively encourage setting duplicate names to a read-in DataFrame.

@jorisvandenbossche

Ah, I assumed that the mangling of names or the names from the header of the file was done using the same code path? That's not the case ?
(to state it another way: once we remove the deprecation in this PR, we can actually remove more code than is added in this PR?)

@gfyoung

See #17095 : it's a not a ridiculous amount of new logic that I added, but new logic nonetheless 😄 Also, see @jreback comment in that PR here

@gfyoung

@jreback

so the only reason we have mangle_dupe_columns is to support duplicates in the first place. I think I stated we should deprecate that argument entirely, then I would allow duplicates in names but show a UserWarning if names contains duplicates.

So pretty much allow what is happening today but with a UserWarning and reducing the path complexity a bit (removing mangle_dupe_columes).

Its not an error to have duplicates in names but I guess can't disallow it entirely.

@gfyoung

@jreback : I don't recall you saying this before. In addition, I think there has been user interest and not mangling in cases when the CSV file itself contains dupe names. That being said, if we think making the warning less harsh is a good idea, I can do that.

@gfyoung

@jreback : I made it issue a UserWarning instead. PTAL.

@jreback

lgtm. you might want to note in the doc-string the same warning.

@gfyoung

lgtm. you might want to note in the doc-string the same warning.

Sounds good. I'll quickly add that.

@gfyoung

jreback

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm modulo small comment

Check if the `names` parameter contains duplicates.
Currently, this function issues a warning if that is the case. In the
future, we will raise an error.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doc string needs updating

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Fixed.

@gfyoung

@gfyoung

@jreback

thanks @gfyoung I think fine for now, we can always revisit if needed.

alanbato pushed a commit to alanbato/pandas that referenced this pull request

Nov 10, 2017

@gfyoung @alanbato

No-Stream pushed a commit to No-Stream/pandas that referenced this pull request

Nov 28, 2017

@gfyoung @No-Stream