Proposal to change behaviour with .loc and missing keys · Issue #15747 · pandas-dev/pandas (original) (raw)

In [2]: pd.Series([1, 2, 3]).loc[[2,3]] Out[2]: 2 3.0 3 NaN dtype: float64

In [3]: pd.Series([1, 2, 3]).loc[[3]] [...] KeyError: 'None of [[3]] are in the [index]'

Problem description

Although coherent (except for some unfortunate side-effects - some of them below) with the docs where they say "At least 1 of the labels for which you ask, must be in the index or a KeyError will be raised!", the current behavior is - I claim - a terrible choice for both developers and users.

There are (at least) three ways to behave with missing labels:

you raise an error if requested at least one missing label
you raise an error if requested only missing labels
2a ... while if at least one label is present, missing labels become NaN (current)
2b. ... while if at least one label is present, missing labels are silently dropped
you never raise an error for missing labels
3a ... and they become NaN
3b. ... and they are silently dropped

For developers

Options 1. and 3. are both much easier to implement, because in both cases you can reduce the question "am I going to get an error?" in smaller pieces - e.g. when indexing a MultiIndex, you will get an error if you get an error on any of its levels. Option 2. is instead more complicated (and computationally expensive), because you need to first aggregate in some way across levels/axes, and only then can you decide whether to raise an error or not. Several incoherences came out as a consequence of this choice, some of them still unsolved, such as #15452, this, the fact that pd.Series(range(2)).loc[[]] does not raise, and the fact that pd.DataFrame.ix[[missing_label]] doesn't either.

Other consequences of 2.

Additionally, it was decided that the behavior with missing labels would be to introduce NaNs (rather than to drop them), and I think this was also not a good choice (and indeed partial indexing MultiIndexes does not behave this way - it couldn't). I think it is also undocumented.

And finally, since the above wouldn't always tell you what to do when there are missing labels in a MultiIndex, it was decided that .loc would rather behave as .reindex when there are missing and incomplete labels, which is totally unexpected and, I think, undocumented.

Notice that these further issues (and more in general, the question "what to do when some labels are missing and you are not raising an error") would partially still hold with 3, but could be dealt with, I think, more elegantly.

For users

I think the current behavior is annoying to users not just because of those "Other consequences", but also because it is more complicated to describe in terms of set operation on labels/indices. For instance, with options 1. and 3.

pd.concat([chunk.loc[something] for chunk in chunks])

and

pd.concat(chunks).loc[something]

both return the same result (or raise). Instead with 2. it actually depends on how missing labels are distributed across chunks.

(Why this?)

It is worth understanding why 2. was picked in the first place, and I think the answer is "to be coherent with intervals". But I think it's not worth the damage - after all, an iterable and an interval are different objects. And moreover, introducing NaNs for missing labels is anyway incoherent with intervals.

Backward incompatibility

Option 1. is, I think, the best, because it is also coherent with numpy's behavior with out-of-bounds indices (e.g. np.array([[1,2], [3,4]])[0,[1,3]] raises an IndexError).

But while clearly both 1. and 3. could break some existing code, 3. would be better from this point of view, in the sense that it would break only code assuming that an error is raised. Although one might even claim that 1., by breaking code which looks for missing labels, can help discover bugs in user code (not a great argument, I know).

So overall I am not sure about what we should pick between 1. and 3. But I really think we should leave 2., and that the later it is done, the worse. @jreback , @jorisvandenbossche if you want to tell me your thoughts about this, I can elaborate on what we could do with the "Other consequences" in the desired option.

Then if you approve the change, I'm willing to help in implementing it.