pandas (original) (raw)

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation393 Commits64 Checks0 Files changed

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

closes Add key to sorting functions #3942
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

Added a key parameter to the DataFrame/Series.sort_values and sort_index functions matching Python sorted semantics for allowing custom sorting orders. Address open issue #3942.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice PR! Can you add support for Index to go with DataFrame / Series as well?

From typing import Callable at the top of the module then just Callable for the annotation

…

Sent from my iPhone

On Jul 6, 2019, at 10:41 AM, Jacob Austin ***@***.***> wrote: @ja3067 commented on this pull request. In pandas/core/frame.py: > @@ -4977,6 +4977,7 @@ def sort_values( inplace=False, kind="quicksort", na_position="last", + key=None Would you just like key : typing.Callable? — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

Hello @jacobaustin123! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-04-27 02:12:02 UTC

Just pushed a new commit that adds support for Index.sort_values(key=...) and adds a key flag to nargsort(...) and lexsort_indexer(...). I've also added static type hints. I still use Index.map to apply the key function for sort_index functions, because it seems simpler and semantically better to apply the key to the values and not the codes.

@ja3067 is this still active?

I will make the requested changes this weekend. The only decision I think needs to be made is how sort_values(key=...) and sort_index(key=...) should behave for multiple levels or columns. Should the same function be applied to every level/column, or should the key function be expected to operate on a tuple? Likewise, how should the key behave for sort_remaining? Should it be applied to everything?

The only decision I think needs to be made is how sort_values(key=...) and sort_index(key=...) should behave for multiple levels or columns.

Without a key is just operates on the levels individually, no? If not mistaken something should be the same here

Without a key is just operates on the levels individually, no? If not mistaken something should be the same here

@WillAyd Agreed. In the case, say, of one numeric column and one string column, should we for instance support a dictionary of {column_name : Callable} so that different keys can be passed, or the key can be passed only to the primary column. Same question for sort_index.

I would leave that enhancement to a separate PR if requested

@ja3067 is this still active?

@WillAyd I'll make the necessary updates this weekend. Sorry for the delay.

@WillAyd Ok. The final PR is almost done. Last question is about the key for sortindex for a MultiIndex when level is specified. There are two options:

the key function is applied to the entire MultiIndex tuple like the map function does currently. You could do something like key=lambda x : x[0].lower()
the key function is applied to each level in the MultiIindex separately. The easiest way to do this would be to add a method to the MultiIndex called idx.maplevel(self, mapper, level=0) that applies a map function to a set of levels. However, this would probably require modifying the Index API. Would this be OK/worthwhile? This would be a little less flexible, but maybe more reasonable.

What do you think?

@WillAyd I've incorporated the review changes, and it passes tests now. Should be good.

Can you merge master and repush? Not sure what 37 error was may be resolved

@jorisvandenbossche the only big change is how keys are applied to MultiIindex objects. Keys are vectorized, and they're applied per column to DataFrames and now per level to MultiIndex. So if you do df.sort_index(key=...) the key is passed each level of the index separately. Otherwise things are the same.

Thanks, that sounds good!

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few small nitpicks. For the rest looks good!

We've added a ``key`` argument to the DataFrame and Series sorting methods, including
:meth:`DataFrame.sort_values`, :meth:`DataFrame.sort_index`, :meth:`Series.sort_values`,
and :meth:`Series.sort_index`. The ``key`` can be any callable function which is applied
to the each column of a DataFrame before sorting is performed (:issue:`27237`). See

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to the each column of a DataFrame before sorting is performed (:issue:`27237`). See
to each column of a DataFrame before sorting is performed (:issue:`27237`). See

Apart from the typo, I find "each column" a bit confusing, as it is of course only applied to those columns that are used for sorting?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See if the new version is clearer.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New version looks good!

@WillAyd @jorisvandenbossche @jreback This is badly timed since I know we want to finish this, but I made some changes today that add support for dictionary keys, i.e.

df = DataFrame({0: ["Hello", "goodbye"], 1: [0, 1]}) result = df.sort_values([0, 1], key={0: lambda col: col.str.lower()})

So you can specify a different key function for different levels/columns. I have a commit ready I can push now, but if people prefer, we can finalize this and have that as a separate commit, although as part of this addition I refactored the code significantly so everything is pulled into ensure_key_mapped. @jreback might prefer that, since it means there's no more key in lexsort_indexer or nargsort. I just didn't want to push it if people wanted to merge this and be done.

let’s not add anything more

_as = self.argsort()
if not ascending:
_as = _as[::-1]
sorted_index = self.take(_as)
return sorted_index, _as
else:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback do you know what this second branch is here for? I'm tempted to make the argsort method the default since it's necessary for key sorting, unless it's an important optimization.

if axis == 0:
new_df = DataFrame._from_arrays(new_levels, df.columns, df.index)
else:
new_df = DataFrame._from_arrays(new_levels, df.index, df.columns).transpose()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to do this (construct a DataFrame along either row or column axis) that's better than transposing it like this?

@jreback Ok. I just reverted those changes but kept some of the refactoring so e.g. lexort_indexer no longer has a key function, and ensure_key_mapped handles all the logic for different datatypes. There are two questions I point out in a review above. Otherwise I'm happier with this version and I think it achieves your goal for having key mapping totally encapsulated in ensure_key_mapped.

Dict keys can be an easy PR on top.

@jreback gentle bump. would love to finish this up.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jacobaustin123 I would really like to merge this. but you keep changing an enormous amount of code. Please got back to when I make comments I think 10 or 12 days ago.

changed all of the ensure_key_mapped again, this is a mess. Please just go back to the much simpler version.

@jreback ok. I can force revert to the previous version, or just address your new comments here. Let me know what you prefer. The motivation for this was to do as you asked and move all logic to ensure_key_mapped.

@jreback ok. I can force revert to the previous version, or just address your new comments here. Let me know what you prefer. The motivation for this was to do as you asked and move all logic to ensure_key_mapped.

well I'ld lke to make the doc changes I have above, and not have 3 different ensure_key_mapped functions. you had 1 or maybe 2 before. Please simplify. I am happy to take changes as a followup, but this is too much now.

@jreback. Understood. I'll address doc comments and revert.

@jreback. Understood. I'll address doc comments and revert.

thanks.

this is just a very large PR and hard to grok what has changed.

@jreback ok I just reverted to the prior version and updated documentation. I had to force push, but I just reverted to the commit before the major changes and then added two commits on top.

@@ -2986,6 +3039,9 @@ def sort_values(
)

def _try_kind_sort(arr):
arr = ensure_key_mapped(arr, key)
arr = getattr(arr, "_values", arr)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in followon you can use extract_array here

@@ -204,15 +218,14 @@ def lexsort_indexer(keys, orders=None, na_position: str = "last"):
elif orders is None:
orders = [True] * len(keys)

for key, order in zip(keys, orders):

for k, order in zip(keys, orders):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you do this

We've added a ``key`` argument to the DataFrame and Series sorting methods, including
:meth:`DataFrame.sort_values`, :meth:`DataFrame.sort_index`, :meth:`Series.sort_values`,
and :meth:`Series.sort_index`. The ``key`` can be any callable function which is applied
column-by-column to each column used for sorting, before sorting is performed (:issue:`27237`).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you also add the orginal issue number here (or rather replace this issue number )

thanks @jacobaustin123

3 followon comments if you would. Also happy to take a refactoring PR in sorting generally. Targeted PRs are best.

@jreback awesome thank you! I'll address these and maybe do a small refactoring PR. And maybe a second PR to add dictionary key sorting.

rhshadrach pushed a commit to rhshadrach/pandas that referenced this pull request

May 10, 2020