Add /benchmark github command to comparison benchmark between base and pr commit by gruuya · Pull Request #9461 · apache/datafusion (original) (raw)

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation11 Commits6 Checks0 Files changed

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

gruuya

Which issue does this PR close?

Partially progresses #5504.

Rationale for this change

Enable automated comparative PR benchmarking, and thus make performance regression less likely to be introduced into the code.

What changes are included in this PR?

Add two new workflows:

  1. Benchmarks, that is triggered by a PR comment /benchmark, and which runs the standard repo benchmarks against the base and head commits.
  2. A generic PR Comment workflow, that in this case should post a message along the lines of

Benchmark results

Benchmarks comparing e5404a1 and 3bf178a

Comparing main-e5404a1 and ci-benches-3bf178a
--------------------
Benchmark tpch.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃ main-e5404a1 ┃ ci-benches-3bf178a ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │     445.21ms │          1446.96ms │  3.25x slower │
│ QQuery 2     │      60.82ms │          1062.72ms │ 17.47x slower │
│ QQuery 3     │     147.76ms │          1153.42ms │  7.81x slower │
│ QQuery 4     │      89.77ms │          1094.23ms │ 12.19x slower │
│ QQuery 5     │     208.05ms │          1215.64ms │  5.84x slower │
│ QQuery 6     │     109.18ms │           109.30ms │     no change │
│ QQuery 7     │     295.69ms │          1312.57ms │  4.44x slower │
│ QQuery 8     │     200.32ms │          1207.25ms │  6.03x slower │
│ QQuery 9     │     304.16ms │          1316.10ms │  4.33x slower │
│ QQuery 10    │     244.40ms │          1254.85ms │  5.13x slower │
│ QQuery 11    │      66.08ms │          1068.05ms │ 16.16x slower │
│ QQuery 12    │     127.32ms │          1132.51ms │  8.89x slower │
│ QQuery 13    │     179.09ms │          1191.68ms │  6.65x slower │
│ QQuery 14    │     129.67ms │           132.54ms │     no change │
│ QQuery 15    │     198.01ms │          1201.44ms │  6.07x slower │
│ QQuery 16    │      53.18ms │          1059.45ms │ 19.92x slower │
│ QQuery 17    │     323.30ms │           340.44ms │  1.05x slower │
│ QQuery 18    │     465.16ms │          1464.97ms │  3.15x slower │
│ QQuery 19    │     235.39ms │           235.56ms │     no change │
│ QQuery 20    │     193.32ms │          1214.60ms │  6.28x slower │
│ QQuery 21    │     458.16ms │          1349.42ms │  2.95x slower │
│ QQuery 22    │      54.80ms │          1059.73ms │ 19.34x slower │
└──────────────┴──────────────┴────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                 ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (main-e5404a1)         │  4588.85ms │
│ Total Time (ci-benches-3bf178a)   │ 22623.43ms │
│ Average Time (main-e5404a1)       │   208.58ms │
│ Average Time (ci-benches-3bf178a) │  1028.34ms │
│ Queries Faster                    │          0 │
│ Queries Slower                    │         19 │
│ Queries with No Change            │          3 │
└───────────────────────────────────┴────────────┘

Are these changes tested?

Sort of.

There is a big catch-22 here at play:

In the end I decided to go with a issue_comment-based (covers pull requests too) benchmarking workflow (which also needs to be on the main branch to be run, and then make that trigger the PR commenting workflow via a workflow_run event. I figured since there has to be a two-step motion to test this out anyway we might as well try not to spam and add noise between steps (plus I envision the PR benchmarking to be triggered by a comment anyway).

Are there any user-facing changes?

None (there will be dev-facing changes with the follow-up).

@gruuya

@gruuya

@gruuya

@gruuya gruuya marked this pull request as ready for review

March 7, 2024 18:22

@alamb alamb mentioned this pull request

Mar 8, 2024

5 tasks

alamb

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @gruuya -- I think the basic pattern of this PR looks good to me. I wonder how we could test it... Maybe I could merge it to my fork and run it there 🤔

The basic idea looks great -- thank you. The major concern I have is that this PR seems to run the benchmark on github runners, as I understand it

The downside of using github runners is that I think they do not have stable base performance. I think they are shared and they also may vary from one run to the next (e.g. different processor). If the idea is to benchmark just the code change, keeping the runtime environment the same I think is important. For example if the tests report that performance slows down on a PR but the problem really was that the github runner was overloaded, that will be super confusing

I think I had in my mind that we would have a dedicated benchmarking machine / VM running somewhere and run the benchmarks on that VM. I am pretty sure InfluxData would be willing to cover the cost of this VM (or we can make a shared collection)

The beneift of this approach would be that the benchmark enviroment would be more controlled (the only change would be the software change), though it has monetary cost as well won't scale up/down with our jobs

@gruuya

The major concern I have is that this PR seems to run the benchmark on github runners, as I understand it

True, that is correct. My assumption was that any instabilities in the base performance would not vary greatly during a single run as both benchmarks are run within the same job in a relatively short time interval, but I guess this is not a given.

In addition, for this type of benchmarking (PR-vs-main) we're only interested in relative comparisons, and so the longitudinal variance component, which is undoubtedly large wouldn't come into play (unlike for the tracking of main perf across time).

That said the present workflows should be easily extendable to use self-hosted runners once those become available I believe. This would also bring additional benefits, such as shorter benchmark runs, both in terms of persisting the test data (e.g. downloading hits.parquet takes about ~12 minutes in CI), as well as the runs themselves (means we could use beefier SFs, i.e. 10 or even 100).

I wonder how we could test it... Maybe I could merge it to my fork and run it there 🤔

Oh that's a good idea, I believe I can test it out on our fork and provide the details here, thanks!

@gruuya

@gruuya

Ok I went ahead and tested it on our fork: splitgraph#1

Here it is catching a deliberate regression I introduced: splitgraph#1 (comment)

Here it is after removing the regression (and polishing the comment message a bit): splitgraph#1 (comment)

There are a couple of false positives in that last one, though it is running SF1, so with shorter run times it is more sensitive to any variations (though I believe such variations can be seen when running locally as well).

@alamb alamb changed the titleTry running a basic comparison benchmark between base and pr commit Add /benchmark github command to comparison benchmark between base and pr commit

Mar 13, 2024

alamb

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @gruuya -- I think this is a really neat step forward and we can refine this functionality over time.

I think we should also add a section in the benchmark documentation about this feature to make it easier to find.
https://github.com/apache/arrow-datafusion/tree/main/benchmarks#datafusion-benchmarks

We could do that as a follow on PR

Ok I went ahead and tested it on our fork: splitgraph#1

Here it is catching a deliberate regression I introduced: splitgraph#1 (comment)

Here it is after removing the regression (and polishing the comment message a bit): splitgraph#1 (comment)

Those results are very cool

There are a couple of false positives in that last one, though it is running SF1, so with shorter run times it is more sensitive to any variations (though I believe such variations can be seen when running locally as well).

Yes, you are correct, I have also seen similar variations locally.

alamb

@gruuya @alamb

Co-authored-by: Andrew Lamb andrew@nerdnetworks.org

@gruuya @alamb

Co-authored-by: Andrew Lamb andrew@nerdnetworks.org

@alamb alamb mentioned this pull request

Mar 13, 2024

5 tasks

@gruuya

I think we should also add a section in the benchmark documentation about this feature to make it easier to find.
https://github.com/apache/arrow-datafusion/tree/main/benchmarks#datafusion-benchmarks

We could do that as a follow on PR

Yup, makes sense! I'd also like to bump SF to 10 there to reduce the noise a bit, and perhaps add parquet/sorting benches as well (assuming those things don't take too long).

Eventually when we have a self-hosted runner we can add a selection of ClickBench queries too.

Thanks!

@alamb

Yup, makes sense! I'd also like to bump SF to 10 there to reduce the noise a bit, and perhaps add parquet/sorting benches as well (assuming those things don't take too long).

I think ClickBench would be a good choice too (it doesn't take too long and is an excellent benchmark for aggregation / filtering performance)

So how about we merge this PR once CI has passed and then file follow on tickets for the remaining tasks?

@gruuya

So how about we merge this PR once CI has passed and then file follow on tickets for the remaining tasks?

Sounds good to me!

Dandandan

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome, thank you @gruuya !

@alamb alamb mentioned this pull request

Mar 29, 2024

@alamb

FWI this code was removed in #11165