Allow merging on object / non-object column by jorisvandenbossche · Pull Request #21681 · pandas-dev/pandas (original) (raw)
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Conversation35 Commits10 Checks0 Files changed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters
[ Show hidden characters]({{ revealButtonHref }})
Follow-up on discussion in #21310 (cc @jreback )
What this does: allow merge
on an object and a non-object column (eg object and int, object and categorical, ..), by coercing the non-object column to object dtype.
Reasoning:
- object dtype can be anything, so IMO pandas should not try to be smart and think to know the merge is not valid
- all other, certainly invalid dtype combinations (eg int and datetime64) keep raising an error
The main disadvantage is that we no longer detect the case of merging object strings with eg integer column, but IMO we cannot avoid such cases as long as we don't have a proper string dtype but use object dtype for this (you can also mix string and ints in a column, we also don't raise for that for a similar reason).
Additional option would be to issue a warning in this case.
Closes #23733
-1 in this - if u want to merge an object column with a non object column then coerce to object
this will propagate objects columns very easily - against the principle of strongly dtyping things
what is the use case that is motivating this?
what is the use case that is motivating this?
The concrete case was a regression in the released version of geopandas because of the change in pandas to start raising errors on merging with unequal dtypes. This actual case was caused because we first merge an empty frame with another one, which creates object columns of NaN, and then merging the result yet with another, where the object column of NaN is merged on a float column (integers with NaNs), raising the error.
Yes, we can workaround this, by for example coercing both key columns to object as you mentioned (or coercing both to float). But that adds some complexity to the code, and given that I had it in geopandas could indicate that also other people might run into it.
But my main reason is what I said in the top post: object dtype can be anything => pandas should not make assumptions on what it thinks is valid/invalid for operating on such data. That's up to the user.
But given that merging by accident strings on integers (where the strings should have been coerced to ints before) silently gives a not useful result, we could raise a warning as well.
It also boils down a bit to that the original change to start erroring should ideally have been done with a deprecation warning instead of directly erroring, so it would not cause regression directly.
I disagree here. merging on non-alike dtypes is simply wrong. objects cannot merge with non-object, full stop. Sure we have a special case for boolean, but allowing object with integer is just a disaster. '
If you want to try a bit more inference I am ok with that. e.g. say you have an object column, then am ok with running infer_dtype
on it. and allowing that to merge with a similar type (e.g. integer). That is a trivial coercion.
objects cannot merge with non-object, full stop.
They can perfectly well, depending on what is inside the object dtype column. It's not about "cannot", it is about "do we allow it or not". And for the record, until one month ago pandas has always allowed it.
If you want to try a bit more inference I am ok with that.
I have thought about that as well, but inference can be costly, so I would rather not do it.
Also, it will not work in all cases, eg with an empty frame.
Are you OK with raising an informative warning instead? Where we clearly indicate to the user what is happening and this typically points to a problem with their types? (but at least it allow people to ignore the warning, or to fix it only later) (I thought you hinted on being OK with that in the previous PR, but not sure)
BTW, input from some other devs is certainly welcome, Jeff and I clearly disagree on this .. :-)
Are you OK with raising an informative warning instead? Where we clearly indicate to the user what is happening and this typically points to a problem with their types? (but at least it allow people to ignore the warning, or to fix it only later) (I thought you hinted on being OK with that in the previous PR, but not sure)
I guess that is ok. The original reason that this was changed is to avoid the all-NaN problem (e.g. you merge with something that has no matches because of the dtype mismatch). I don't want to go back to that as its worse from a user POV.
open to compromise but would rather be strict here. (and inference is not generally expensive compared to a borked merge :>)
I'm +1 on being strict. At the very least if the end user is explicit about types they stand to benefit performance- and resource-wise instead of pandas falling back to object
.
(and inference is not generally expensive compared to a borked merge :>)
yeah, that is true .. :)
Thinking this a bit more through, then the main problem seems to how to handle the result. For example then need to add logic that eg 'integer' and 'mixed-integer-float' might be OK, which is repeating the logic (and also need to special case the 'empty').
Of course, I could also do the full existing check based on the inferred types hierarchy (to avoid such duplicated logic), assuming that for non-object dtypes this inferring of the type is cheap.
The original reason that this was changed is to avoid the all-NaN problem .. I don't want to go back to that as its worse from a user POV.
I just realized that I actually think a warning can be better from a user POV. Instead of the error, you would then see the NaNs, together with the informative warning explaining why the NaNs are there and how to solve it. I think this could actually be better to get what is happening than only the error.
And this also alleviates the direct code breakages caused by the change in 0.23.0 to be more strict.
At the very least if the end user is explicit about types they stand to benefit performance- and resource-wise instead of pandas falling back to object.
Yes, but note that there can be good reasons to want or to end up with object data, and not all use cases of pandas are performance sensitive (eg small data).
@@ -946,11 +947,14 @@ def _maybe_coerce_merge_keys(self): |
---|
"you should use pd.concat".format(lk_dtype=lk.dtype, |
rk_dtype=rk.dtype)) |
coerce_to_object = False |
if is_object_dtype(lk) or is_object_dtype(rk): |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this actually necessary here? e.g. it would hit the else clause if this is true as well.
Then can just do the conversions in the else. (e.g. stuff on line 1011)
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without this we hit
956 elif ((is_numeric_dtype(lk) and not is_bool_dtype(lk))
957 and not is_numeric_dtype(rk)):
--> 958 raise ValueError(msg)
and raise.
I've had colleages come and complain to me about this recently - especially since this used to work.
I would advocate to be strict, but based on the inferred type. The case where it's an all-integer index hiding behind an object dtype would be easily solved through that, and the other cases would continue to emit the warning.
Hello @jorisvandenbossche! Thanks for updating the PR.
Cheers ! There are no PEP8 issues in this Pull Request. 🍻
Comment last updated on December 27, 2018 at 12:56 Hours UTC
Fixed the merge conflicts. Haven't taken a look at the changes yet.
I pushed a commit which fixes this up in a reasonably clean way. if folks can look at the cases and see if anything glaring pops out. We have a few odd cases already, e.g. bools are inferred if possible and we actually allow stringlike non-unicode strings to infer, but these worked before.
cc @pandas-dev/pandas-core
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
May leave this open for a day in case @jorisvandenbossche is around to look tomorrow, but I think this is good to go.
@jreback So basically you reverted to allow again all combinations of an object dtype with a non-object dtype?
yes anything with object just forces conversion to object
we
could be more specific and disallow strings vs non strings
added a commit to be more restrictive.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@jreback and thanks for finishing the PR!
One thing I note now, after your last comment to be a bit more strict on merging object and numeric columns, the whatsnew note is not fully correct any more.
I am also wondering a bit how costly this inferring of the type for object columns is.
One thing I note now, after your last comment to be a bit more strict on merging object and numeric columns, the whatsnew note is not fully correct any more.
yeah ok we are a bit more strict so whatsnew could be sligthly better.
inference is << cost of actually merging things, its a no-brainer and we do it all over the place.
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request
I am running into this issue and found this thread. Was this change pushed to pandas or can we not merge object/non-object type.
Here is my code:
res_leads = pd.merge(df_temp,
df_market[['Tenant_Industry','Tenant','num_employees','size_calc']],
left_on = df_temp.index,
right_on = 'Tenant',
how='left')
Error traceback:
Traceback (most recent call last):
File "/Users/....../market.py", line 267, in render_table
result = top_leads(company, market)
File "/Users/...../CRE/Platform/Deploy/return_leads.py", line 77, in top_leads
how='left')
File "/anaconda3/lib/python3.7/site-packages/pandas/core/reshape/merge.py", line 61, in merge
validate=validate)
File "/anaconda3/lib/python3.7/site-packages/pandas/core/reshape/merge.py", line 555, in __init__
self._maybe_coerce_merge_keys()
File "/anaconda3/lib/python3.7/site-packages/pandas/core/reshape/merge.py", line 983, in _maybe_coerce_merge_keys
raise ValueError(msg)
ValueError: You are trying to merge on int64 and object columns. If you wish to proceed you should use pd.concat