PERF: RangeIndex.format performance by topper-123 · Pull Request #35712 · pandas-dev/pandas (original) (raw)

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation26 Commits22 Checks0 Files changed

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

#35440 dropped RangeIndex._format_with_header, which was functionally not needed, but needed to avoid creating an internal ndarray.

This rectifies that + gives some perf. improvements.

idx = pd.RangeIndex(1_000_000) %timeit idx.format() 4.6 s ± 102 ms per loop # pandas v1.1.0 1.67 s ± 19.6 ms per loop # master 595 ms ± 2.35 ms per loop # this PR

Also, now the _data attribute isn't called, so this PR gives a perf. & memory improvement in some use cases compared to master:

idx = pd.RangeIndex(1_000_000) idx.format() "_data" in idx._cache False # pandas v.1.1.0 True # master False # this PR

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Do we have benchmarks for this? If not do we want to add them?

@@ -187,6 +187,11 @@ def _format_data(self, name=None):
# we are formatting thru the attributes
return None

def _format_with_header(self, header, na_rep="NaN") -> List[str]:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you type

looks good. this is worth a perf mention in 1.2 whatsnew. ping on green.

The TravisCI issue is just a timeout issue, but In CI / Web and docs, I've got a problem under Check ipython directive errors that I'm not able to fix:

Run ! grep -B1 "^<<<-------------------------------------------------------------------------$" sphinx.log
[0 rows x 1 columns]
<<<-------------------------------------------------------------------------
##[error]Process completed with exit code 1.

The source code for this in .github/workflows/ci.yml says:

# This can be removed when the ipython directive fails when there are errors,
# including the `tee sphinx.log` in te previous step (https://github.com/ipython/ipython/issues/11547)
- name: Check ipython directive errors
  run: "! grep -B1 \"^<<<-------------------------------------------------------------------------$\" sphinx.log"

I don't understand what this check is for or why my PR fails here. @TomAugspurger, you posted that mentioned ipython issue, do you mind looking at why my PR fails on this?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. I think this is ok for 1.1.2 as a perf regresion, can you move the note? ping on green.

topper-123 changed the title~~PERF: RangeIndex.format perf regression~~ PERF: RangeIndex.format performance

Aug 22, 2020

@@ -187,6 +187,15 @@ def _format_data(self, name=None):
# we are formatting thru the attributes
return None

def _format_with_header(self, header: List[str], na_rep: str = "NaN") -> List[str]:
if len(self._range) == 0:
return []

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be return header?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

I added tests for this, and found that DateTimeIndex([]).format(name=True) returns ["None"], where it should return [""]. Same for PeriodIndex. I fixed that bug.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comment, if you can address and merge on green.

@@ -187,6 +187,15 @@ def _format_data(self, name=None):
# we are formatting thru the attributes
return None

def _format_with_header(self, header: List[str], na_rep: str = "NaN") -> List[str]:
if len(self._range) == 0:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if not len(self._range)

That merge brancg master has caused some strange error. anyone know how to fix it?

Thanks, @simonjayhawkins, were there any git oneliners you used for this?

I don't profess to know what was going on. resetting HEAD, merging master and then pulling from the branch seemed to fix except for the duplicate release note.

minor comment, if you can address and merge on green.

comment addressed

Owee, I'm MrMeeseeks, Look at me.

There seem to be a conflict, please backport manually. Here are approximate instructions:

Checkout backport branch and update it.

$ git checkout 1.1.x
$ git pull

Cherry pick the first parent branch of the this PR on top of the older branch:

$ git cherry-pick -m1 b7a31ebeb873f37939838b3b788dab0e4c41b8de

You will likely have some merge/cherry-pick conflict here, fix them and commit:

$ git commit -am 'Backport PR #35712: PERF: RangeIndex.format performance'

Push to a named branch :

git push YOURFORK 1.1.x:auto-backport-of-pr-35712-on-1.1.x

Create a PR against branch 1.1.x, I would have named this PR:

"Backport PR #35712 on branch 1.1.x"

And apply the correct labels and milestones.

Congratulation you did some good work ! Hopefully your backport PR will be tested by the continuous integration and merged soon!

If these instruction are inaccurate, feel free to suggest an improvement.

simonjayhawkins pushed a commit to simonjayhawkins/pandas that referenced this pull request

Aug 26, 2020

simonjayhawkins added a commit that referenced this pull request

Aug 26, 2020

Co-authored-by: Terji Petersen contribute@tensortable.com