DEPR: Deprecate convert parameter in take by gfyoung · Pull Request #17352 · pandas-dev/pandas (original) (raw)

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation66 Commits1 Checks0 Files changed

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

return self.data.take(self.sort_idx, axis=self.axis, convert=False)
with warnings.catch_warnings():
warnings.simplefilter("ignore")
return self.data.take(self.sort_idx, axis=self.axis, convert=False)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback @jorisvandenbossche : This is a tricky one for me here. On the one hand, I want to remove convert=False, but because axis of the groupby object might be invalid when calling Series.take, this is not possible for me to do (when True, it re-indexes the value according to the length of the slice along the index).

On the other hand, this catch_warnings(), while an easy way to get things to pass for now, is not a solution I want to merge. Suggestions on how to handle this problem (or is it best that we just expose convert for all relevant implementations, though I'm not a fan).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, might have found the answer myself 😄

@@ -3354,7 +3354,7 @@ def dropna(self, axis=0, how='any', thresh=None, subset=None,
else:
raise TypeError('must specify how or thresh')

result = self.take(mask.nonzero()[0], axis=axis, convert=False)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think these internal uses should be removed. We're guaranteed the indexer is inbounds, so it's an unnecessary performance hit to check it.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by this? DataFrame.take doesn't even respect the convert parameter. Thus, this will have no effect on anything. 😄

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, this example doesn't matter b/c it's on a frame, but does with the usages of _data.take

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And Series.take

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there's a bounds check anymore in Series.take or _data.take because I removed all of them for that same reason. We have so many bounds checks all over the place that removing just one of them is okay now.

if kwargs:
nv.validate_take(tuple(), kwargs)

# check/convert indicies here
if convert:
indices = maybe_convert_indices(indices, len(self._get_axis(axis)))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure this is safe to remove - with non numpy types, we call to an internal take impl that doesn't bounds check. Can you try pd.Series([1,2,3], dtype='category').take([-1, -2]) on your branch?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, so that breaks:

2 NaN 1 1.0 dtype: category Categories (3, int64): [1, 2, 3]

That being said, that call works without dtype='category', meaning we do handle negative indices correctly in some cases already. I might as well add that handling for category too.

So I originally added this because I wanted the ability to a) convert neg to positive, and b) bounds check; these are not the default because of perf (IOW internally we often know that we don't need either of these things, but from a public perspetive you do need this).

These shouldn't have been on the public NDFrame though (so good to remove). So the question is, do we need a ._take() to do these things? (and .take() would just be a wrapper)?

I'd be in favor of a _take with options to be unsafe. It's not like boundschecking is that expensive, but it's such a core operation I hate to give up any performance.

I suppose that in the name of performance, that's fair. I think I would just need to implement this solely for DataFrame and Series right?

Yes, that's correct. BlockManager and SingleBlockManager will need to keep their take implementation, but I would keep it unprefixed since they are already internal.

@chris-b1 : Sorry, somehow your answer only confused me more. Yes or no, do I need to implement _take for just Series and DataFrame ?

@chris-b1 : Alright, here goes. Let's see what CI has to say about my implementation.

@@ -2033,7 +2033,7 @@ def _ixs(self, i, axis=0):
return self.loc[:, lab_slice]
else:
if isinstance(label, Index):
return self.take(i, axis=1, convert=True)
return self.take(i, axis=1)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

every reference to an internal take should be to ._take (this is pretty much all of them)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really needed?
Can't we keep take and only use _take where it is needed? IMO it just makes it more complex to understand the code.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with @jorisvandenbossche . I'm not sure I really understand what you mean by "reference to an internal take" since the original references are to a public function.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you have now 2 functions to do the same thing , a public and private version

it is much more clear to simply use _take internally in all cases

the public function is just an interface

otherwise a future reader will not know which to use - recipe for confusion

pls change all uses to _take

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There will be confusion anyhow, because then people will wonder what the difference is with pubkic take, and in majority of the cases, there is no difference. IMO that is more clear by using the public function.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is AFAIK only one class where negative indexing isn't handled properly, and that is Categorical. Makes we wonder whether we should revisit the option of adding supporting given this disagreement about design though (@chris-b1 thoughts?)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me boil it down to this (in reference to my previous comment): besides potential test failures, is there reason why we call take_1d in pandas.core.categorical instead of just calling self._codes.take(...) ?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls change all references to _take

.take is ONLY a wrapper function and should not be used anywhere in the code base
this eliminates any possibility of confusion and is inline with the existing style
mixing usages is cause for trouble

your other references are simply not relevant here and are again purely internal uses (take_1d)
if u want to bring them up in another issue by all means

let's keep the scope to changing refs to the NDFrame .take changes

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there were multiple places where I had to use ._take(), then I think I would be more inclined to agree with you @jreback . However, at this point, there is only one place where this had to happen, and if necessary, I can easily "clear up" any confusion by adding a comment.

IMO one usage isn't convincing enough for me to convert all of other invocations (both in NDFrame and by any other pandas class to use ._take()). I think I still have to side with @jorisvandenbossche on this one.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gfyoung if this were a new public function, and you wrote a private impl _take as well. Then you certainly would use the private impl everywhere. I don't see a difference here. Pls make the change.

@@ -2058,6 +2058,25 @@ def __delitem__(self, key):
except KeyError:
pass

def _take(self, indices, axis=0, convert=True, is_copy=False):
"""
The internal version of `self.take` that will house the `convert`

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more descriptive doc-string with parameters and such

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, done.

result._set_is_copy(self)

return result

def take(self, indices, axis=0, convert=True, is_copy=True, **kwargs):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also deprecate is_copy; that is another internal parameter; its only necessary for _take

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a fair point. If you don't mind, I'll do this in a subsequent PR once the infrastructure for _take has been merged in.


See also
--------
numpy.ndarray.take

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we have this? now that it is generic is shouldn't be needed

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have what exactly? You commented on a not-so clear part of the diff.

@@ -604,14 +604,38 @@ def sparse_reindex(self, new_index):

def take(self, indices, axis=0, convert=True, args, *kwargs):
"""
Sparse-compatible version of ndarray.take

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use a shared doc string for all public .take()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


self._consolidate_inplace()
new_data = self._data.take(indices,
axis=self._get_block_manager_axis(axis),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pass through convert

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

so this still needs to change virtually all uses of .take() -> ._take(). We want to limit the exposure to user-facing .take() in any code that is not explicitly user facing.

if convert:
indices = maybe_convert_indices(indices, len(self._get_axis(axis)))

indices = _ensure_platform_int(indices)
new_index = self.index.take(indices)
new_values = self._values.take(indices)
return (self._constructor(new_values, index=new_index, fastpath=True)
.__finalize__(self))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason you can't just call the super method here?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For ._constructor ? I was trying to preserve existing (working) code as much as possible when making these changes.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no the entire routine
give it s try

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, not really sure why it shoulnt' work. its almost the same code

xref pandas-devgh-16948. The parameter is not respected, nor is it a parameter in many 'take' implementations.

@jreback : I've done all of the converting (without breaking tests). PTAL.

ghost pushed a commit to reef-technologies/pandas that referenced this pull request

Oct 2, 2017

'master' of github.com:pandas-dev/pandas: (188 commits) Separate out _convert_datetime_to_tsobject (pandas-dev#17715) DOC: remove whatsnew note for xref pandas-dev#17131 BUG: Regression in .loc accepting a boolean Index as an indexer (pandas-dev#17738) DEPR: Deprecate cdate_range and merge into bdate_range (pandas-dev#17691) CLN: replace %s syntax with .format in pandas.core: categorical, common, config, config_init (pandas-dev#17735) Fixed the memory usage explanation of categorical in gotchas from O(nm) to O(n+m) (pandas-dev#17736) TST: add backward compat for offset testing for pickles (pandas-dev#17733) remove unused time conversion funcs (pandas-dev#17711) DEPR: Deprecate convert parameter in take (pandas-dev#17352) BUG:Time Grouper bug fix when applied for list groupers (pandas-dev#17587) BUG: Fix some PeriodIndex resampling issues (pandas-dev#16153) BUG: Fix unexpected sort in groupby (pandas-dev#17621) DOC: Fixed typo in documentation for 'pandas.DataFrame.replace' (pandas-dev#17731) BUG: Fix series rename called with str altering name rather index (GH17407) (pandas-dev#17654) DOC: Add examples for MultiIndex.get_locs + cleanups (pandas-dev#17675) Doc improvements for IntervalIndex and Interval (pandas-dev#17714) BUG: DataFrame sort_values and multiple "by" columns fails to order NaT correctly Last of the timezones funcs (pandas-dev#17669) Add missing file to _pyxfiles, delete commented-out (pandas-dev#17712) update imports of DateParseError, remove unused imports from tslib (pandas-dev#17713) ...

pandas/tests/sparse/test_series.py::TestSparseSeries::()::test_numpy_take
/home/travis/miniconda3/envs/pandas/lib/python3.5/site-packages/numpy/core/fromnumeric.py:123: FutureWarning: The 'convert' parameter is deprecated and will be removed in a future version.
return take(indices, axis, out, mode)

coming up on the 3.5 build of master

seems we r not suppressing somewhere (or prob needs to change to use _take maybe)

can u take care of @gfyoung

@jreback : The cause of the warning is that an older version of numpy passes in None to convert by default when calling np.take (see here).

Thus, this is a compatibility issue, as it disappears in later versions of numpy (I actually worked on fixing fromnumeric compatibility issues at some point).

I'll put a check in the test to expect that warning for older versions of numpy, as checking if not convert is pretty idiomatic, and I wouldn't want to change it just because an older version of numpy is issuing a warning when we call .take() in a non-recommended way.

gfyoung added a commit to forking-repos/pandas that referenced this pull request

Oct 3, 2017

gfyoung added a commit to forking-repos/pandas that referenced this pull request

Oct 3, 2017

jreback pushed a commit that referenced this pull request

Oct 3, 2017

ghost pushed a commit to reef-technologies/pandas that referenced this pull request

Oct 16, 2017

alanbato pushed a commit to alanbato/pandas that referenced this pull request

Nov 10, 2017

xref pandas-devgh-16948. The parameter is not respected, nor is it a parameter in many 'take' implementations.

alanbato pushed a commit to alanbato/pandas that referenced this pull request

Nov 10, 2017

No-Stream pushed a commit to No-Stream/pandas that referenced this pull request

Nov 28, 2017

xref pandas-devgh-16948. The parameter is not respected, nor is it a parameter in many 'take' implementations.

No-Stream pushed a commit to No-Stream/pandas that referenced this pull request

Nov 28, 2017