API: Warn about dups in names for read_csv by gfyoung · Pull Request #17346 · pandas-dev/pandas (original) (raw)
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters
[ Show hidden characters]({{ revealButtonHref }})
Title is self-explanatory.
xref #17095.
counts = {} |
---|
warn_dups = False |
for name in names: |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just use set intersection here
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How so? This is a fail-early method, which is why I chose it.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
simply check for len(names) !+ len(set(names)).
much more idiomatic
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough. Done.
counts[name] = True |
---|
if warn_dups: |
msg = ("Duplicate names specified. This " |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so are we deprecating this? then this should be a FutureWarning.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough. Done.
@@ -406,6 +438,10 @@ def _read(filepath_or_buffer, kwds): |
---|
chunksize = _validate_integer('chunksize', kwds.get('chunksize', None), 1) |
nrows = _validate_integer('nrows', kwds.get('nrows', None)) |
# Check for duplicates in names. |
names = kwds.get("names", None) |
_check_dup_names(names) |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
call this _validate_names
and have it return names, so its a similar patter to the other validators
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@jreback : All comments addressed, and tests are green. PTAL
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo in whatsnew, otherwise lgtm. make sure this is on the deprecation list as well
@@ -283,6 +283,7 @@ Other API Changes |
---|
- The Categorical constructor no longer accepts a scalar for the ``categories`` keyword. (:issue:`16022`) |
- Accessing a non-existent attribute on a closed :class:`~pandas.HDFStore` will now |
raise an ``AttributeError`` rather than a ``ClosedFileError`` (:issue:`16301`) |
- :func:`read_csv` now issues a ``UserWarning`` if the ``names`` parameter contains duplicates (:issue:`17095`) |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be FutureWarning
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doh! My bad for not catching that. Fixed.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing back to UserWarning
in light of later discussion.
Fixed typo and added to deprecation list. Will merge on green then unless told otherwise.
@gfyoung wait for @jorisvandenbossche comment (as not sure if he commented here). IIRC a comment he made that having duplicate names is ok .
I think Joris is off on holiday. I believe he's back next week.
@TomAugspurger : Ah! I had a feeling that that was the case (I remember seeing an email about that). I'll wait then until he gets back.
Can you remember me the rationale for deprecating this?
Is it because we cannot actually handle it well?
Because I actually previously had a usecase where this proved useful (I had a non-informative column every other column, gave it the same name in names
and then dropped the single name. But this specific case can of course easily be solved differently, by giving names like 'dummy1', 'dummy2', .. and then removing all columns that start with 'dummy')
@jorisvandenbossche : Here is what you said back in July here.
Essentially, we are deprecating this behavior because names
is a user-specified parameter, and passing in duplicate names deliberately only encourages buggy behavior.
Sorry for the slow response.
So maybe a more general question: is it our intention to once fix mangle_dupe_cols=True
? (as currently actually only the default False works). If we do, I see no good reason to disallow duplicates in passed names
vs names in the csv file.
@jorisvandenbossche : No worries! I think you meant the other way around. mangle_dupe_cols=True
is fully supported at this point. It is mangle_dupe_cols=False
that we disabled.
The reason for us discouraging duplicates in names
is because duplicates are generally more error-prone. Contrary to duplicates in a file, names
is a deliberate choice.
@jreback : Thoughts?
I think you meant the other way around
yes, of course ... :-)
Contrary to duplicates in a file, names is a deliberate choice.
That's true. But there are many other ways to deliberately make a dataframe with duplicate columns which we don't disallow anyway.
To be clear, in general I am all for a restricted scope of capabilities/possibilties. But in this case, limiting the abilities of name
does not actually reduce code complexity, but increases it. As the duplicates are already perfectly handled by the code, so we are introducing a special case. Therefore I was wondering whether this is actually needed (user will see that the names are mangled anyhow).
@jorisvandenbossche : True that we'll see them mangled anyhow, but why the need to add complexity to just handle them in the first place? I added the handling for duplicate names
in an earlier PR because it was a bug, not to enhance support.
If the user really wants to have duplicate names, they can set it themselves and reading in the file, but I don't know if we want to actively encourage setting duplicate names to a read-in DataFrame
.
Ah, I assumed that the mangling of names
or the names from the header of the file was done using the same code path? That's not the case ?
(to state it another way: once we remove the deprecation in this PR, we can actually remove more code than is added in this PR?)
See #17095 : it's a not a ridiculous amount of new logic that I added, but new logic nonetheless 😄 Also, see @jreback comment in that PR here
so the only reason we have mangle_dupe_columns
is to support duplicates in the first place. I think I stated we should deprecate that argument entirely, then I would allow duplicates in names
but show a UserWarning
if names
contains duplicates.
So pretty much allow what is happening today but with a UserWarning
and reducing the path complexity a bit (removing mangle_dupe_columes
).
Its not an error to have duplicates in names
but I guess can't disallow it entirely.
@jreback : I don't recall you saying this before. In addition, I think there has been user interest and not mangling in cases when the CSV file itself contains dupe names. That being said, if we think making the warning less harsh is a good idea, I can do that.
@jreback : I made it issue a UserWarning
instead. PTAL.
lgtm. you might want to note in the doc-string the same warning.
lgtm. you might want to note in the doc-string the same warning.
Sounds good. I'll quickly add that.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm modulo small comment
Check if the `names` parameter contains duplicates. |
---|
Currently, this function issues a warning if that is the case. In the |
future, we will raise an error. |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doc string needs updating
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. Fixed.
thanks @gfyoung I think fine for now, we can always revisit if needed.
alanbato pushed a commit to alanbato/pandas that referenced this pull request
No-Stream pushed a commit to No-Stream/pandas that referenced this pull request