PERF: RangeIndex.format performance by topper-123 · Pull Request #35712 · pandas-dev/pandas (original) (raw)
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Conversation26 Commits22 Checks0 Files changed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters
[ Show hidden characters]({{ revealButtonHref }})
#35440 dropped RangeIndex._format_with_header
, which was functionally not needed, but needed to avoid creating an internal ndarray.
This rectifies that + gives some perf. improvements.
idx = pd.RangeIndex(1_000_000) %timeit idx.format() 4.6 s ± 102 ms per loop # pandas v1.1.0 1.67 s ± 19.6 ms per loop # master 595 ms ± 2.35 ms per loop # this PR
Also, now the _data
attribute isn't called, so this PR gives a perf. & memory improvement in some use cases compared to master:
idx = pd.RangeIndex(1_000_000) idx.format() "_data" in idx._cache False # pandas v.1.1.0 True # master False # this PR
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Do we have benchmarks for this? If not do we want to add them?
@@ -187,6 +187,11 @@ def _format_data(self, name=None): |
---|
# we are formatting thru the attributes |
return None |
def _format_with_header(self, header, na_rep="NaN") -> List[str]: |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you type
looks good. this is worth a perf mention in 1.2 whatsnew. ping on green.
The TravisCI issue is just a timeout issue, but In CI / Web and docs
, I've got a problem under Check ipython directive errors
that I'm not able to fix:
Run ! grep -B1 "^<<<-------------------------------------------------------------------------$" sphinx.log
[0 rows x 1 columns]
<<<-------------------------------------------------------------------------
##[error]Process completed with exit code 1.
The source code for this in .github/workflows/ci.yml
says:
# This can be removed when the ipython directive fails when there are errors,
# including the `tee sphinx.log` in te previous step (https://github.com/ipython/ipython/issues/11547)
- name: Check ipython directive errors
run: "! grep -B1 \"^<<<-------------------------------------------------------------------------$\" sphinx.log"
I don't understand what this check is for or why my PR fails here. @TomAugspurger, you posted that mentioned ipython issue, do you mind looking at why my PR fails on this?
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. I think this is ok for 1.1.2 as a perf regresion, can you move the note? ping on green.
topper-123 changed the title
PERF: RangeIndex.format perf regression PERF: RangeIndex.format performance
@@ -187,6 +187,15 @@ def _format_data(self, name=None): |
---|
# we are formatting thru the attributes |
return None |
def _format_with_header(self, header: List[str], na_rep: str = "NaN") -> List[str]: |
if len(self._range) == 0: |
return [] |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be return header
?
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
I added tests for this, and found that DateTimeIndex([]).format(name=True)
returns ["None"], where it should return [""]. Same for PeriodIndex. I fixed that bug.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor comment, if you can address and merge on green.
@@ -187,6 +187,15 @@ def _format_data(self, name=None): |
---|
# we are formatting thru the attributes |
return None |
def _format_with_header(self, header: List[str], na_rep: str = "NaN") -> List[str]: |
if len(self._range) == 0: |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if not len(self._range)
That merge brancg master has caused some strange error. anyone know how to fix it?
Thanks, @simonjayhawkins, were there any git oneliners you used for this?
I don't profess to know what was going on. resetting HEAD, merging master and then pulling from the branch seemed to fix except for the duplicate release note.
minor comment, if you can address and merge on green.
comment addressed
Owee, I'm MrMeeseeks, Look at me.
There seem to be a conflict, please backport manually. Here are approximate instructions:
- Checkout backport branch and update it.
$ git checkout 1.1.x
$ git pull
- Cherry pick the first parent branch of the this PR on top of the older branch:
$ git cherry-pick -m1 b7a31ebeb873f37939838b3b788dab0e4c41b8de
- You will likely have some merge/cherry-pick conflict here, fix them and commit:
$ git commit -am 'Backport PR #35712: PERF: RangeIndex.format performance'
- Push to a named branch :
git push YOURFORK 1.1.x:auto-backport-of-pr-35712-on-1.1.x
- Create a PR against branch 1.1.x, I would have named this PR:
"Backport PR #35712 on branch 1.1.x"
And apply the correct labels and milestones.
Congratulation you did some good work ! Hopefully your backport PR will be tested by the continuous integration and merged soon!
If these instruction are inaccurate, feel free to suggest an improvement.
simonjayhawkins pushed a commit to simonjayhawkins/pandas that referenced this pull request
simonjayhawkins added a commit that referenced this pull request
Co-authored-by: Terji Petersen contribute@tensortable.com