Tweak the default PartialOrd::{lt,le,gt,ge}
by scottmcm · Pull Request #106065 · rust-lang/rust (original) (raw)
Micro-optimize Ord::cmp for primitives
I originally started looking into this because in MIR, PartialOrd::cmp
is huge and even for trivial types like u32
which are theoretically a single statement to compare, the PartialOrd::cmp
impl doesn't inline. A significant contributor to the size of the implementation is that it has two comparisons. And this actually follows through to the final x86_64 codegen too, which is... strange. We don't need two cmp
instructions in order to do a single Rust-level comparison. So I started tweaking the implementation, and came up with the same thing as rust-lang#64082 (which I didn't know about at the time), I ran llvm-mca
on it per the issue which was linked in the code to establish that it looked better, and submitted it for a benchmark run.
The initial benchmark run regresses basically everything. By looking through the cachegrind diffs in the perf report then the perf annotate
for regressed functions, I was able to identify one source of the regression: Ord::min
and Ord::max
no longer optimize well. Tweaking them to bypass Ord::cmp
removed some regressions, but not much.
Diving back into the cachegrind diffs and disassembly, I found one huge widespread issue was that the codegen for Span
's hash_stable
regressed because span_data_to_lines_and_cols
no longer inlined into it, because that function does a lot of Range<BytePos>::contains
. The implementation of Range::contains
uses PartialOrd
multiple times, and we had massively regressed the codegen of Range::contains
. The root problem here seems to be that PartialOrd
is derived on BytePos
, which is a simple wrapper around a u32
. So for BytePos
, PartialOrd::{le, lt, ge, gt}
use the default impls, which go through PartialOrd::cmp
, and LLVM fails to optimize these combinations of methods with the new Ord::cmp
implementation. At a guess, the new implementation makes LLVM totally loses track of the fact that <Ord for u32>::cmp
is an elaborate way to compare two integers.
So I have low hopes for this overall, because my strategy (which is working) to recover the regressions is to avoid the "faster" implementation that this PR is based around. If we have to settle for an implementation of Ord::cmp
which is on its own sub-optimal but is optimized better in combination with functions that use its return value in specific ways, so be it. However, one of the runs had an improvement in coercions
. I don't know if that is jitter or relevant. But I'm still finding threads to pull here, so I'm going to keep at it.
For the moment I am hacking up the implementations on BytePos
instead of modifying the code that derive(PartialOrd, Ord)
expands to because that would be hard, and it would also mean that we would just expand to more code, perhaps regressing compile time for that reason, even if the generated assembly is more efficient.
Hacking up the remainder of the PartialOrd
/Ord
methods on BytePos
took us down to 3 regressions and 6 improvements, which is interesting. All the improvements are in coercions
, so I'm sure this improved something but whether it matters... hard to say. Based on the findings of @joboet,
I'm going to cherry-pick rust-lang#106065 onto this branch, because that strategy seems to improve PartialOrd::lt
and PartialOrd::ge
back to the original codegen, even when they are using our new Ord::cmp
impl. If the remaining perf regressions are due to de-optimizing a PartialOrd::lt
not on BytePos
, this might be a further improvement.
Okay, that cherry-pick brought us down to 2 regressions but that might be noise. We still have the same 6 improvements, all on coercions
.
I think the next thing to try here is modifying the implementation of derive(PartialOrd)
to automatically emit the modifications that I made to BytePos
(directly implementing all the methods for newtypes). But even if that works, I think the effect of this change is so mixed that it's probably not worth merging with current LLVM. What I'm afraid of is that this change currently pessimizes matching on Ordering
, and that is the most natural thing to do with an enum. So I'm not closing this yet, but I think without a change from LLVM, I have other priorities at the moment.
r? @ghost