[llvm-dev] mischeduler (pre-RA) experiments (original) (raw)
Florian Hahn via llvm-dev llvm-dev at lists.llvm.org
Mon Nov 27 07:57:16 PST 2017
- Previous message: [llvm-dev] mischeduler (pre-RA) experiments
- Next message: [llvm-dev] LLVM (Cool/Warm) DOT Printers for Profiling
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hi,
On 23/11/2017 10:53, Jonas Paulsson via llvm-dev wrote:
Hi,
I have been experimenting for a while with tryCandidate() method of the pre-RA mischeduler. I have by chance found some parameters that give quite good results on benchmarks on SystemZ (on average 1% improvement, some improvements of several percent and very little regressions). Basically, I add a "latency heuristic boost" just above processor resources checking: tryCandidate() { ... // Avoid increasing the max pressure of the entire region. if (DAG->isTrackingPressure() && tryPressure(TryCand.RPDelta.CurrentMax, Cand.RPDelta.CurrentMax, TryCand, Cand, RegMax, TRI, DAG->MF)) return; /// INSERTION POINT ... } I had started to experiment with adding tryLatency() in various places, and found this to be the best spot for SystemZ/SPEC-2006. This gave noticeable improvements immediately that were to good to ignore, so I started figuring out things about the regressions that of course also showed up. Eventually I have come up after many iterations a combined heuristic that reads: if (((TryCand.Latency >= 7 && "Longest latency of any SU in DAG" < 15) ||_ _"Number of SUnits in DAG" > 180) && tryLatency(TryCand, Cand, *Zone)) return; In English: do tryLatency either if the latency of the candidate is >= 7 and the DAG has no really long latency SUs (lat > 15), or alternatively always if the DAG is really big (>180 SUnits).
Thanks for those experiments! I made similar observations when trying to tune the scheduling heuristics for AArch64/ARM cores. For example, I put this patch up for review, that makes scheduling for latency more aggressive https://reviews.llvm.org/D38279. It gave +0.74% on SPEC2017 score on Cortex-A57. But I never really pushed any further on this so far.
The thing I found is that it seems like when deciding to schedule for latency during bottom-up scheduling we use CurrZone.getCurrCycle() to get the number of issued cycles, which is then added to the remaining latency. Unless I miss something, the cycle will get bumped by one after scheduling an instruction, regardless of the latency. It seems like CurrZone.getScheduledLatency() would more accurately represent to latency scheduled currently, but I am probably missing something.
The test case I was looking into on AArch64 was, where the long latency instruction SDIV was not scheduled as early as possible.
define hidden i32 @foo(i32 %a, i32 %b, i32 %c, i32* %d) local_unnamed_addr #0 { entry: %xor = xor i32 %c, %b %ld = load i32, i32* %d %add = add nsw i32 %xor, %ld %div = sdiv i32 %a, %b %sub = sub i32 %div, %add ret i32 %sub }
Cheers, Florian
- Previous message: [llvm-dev] mischeduler (pre-RA) experiments
- Next message: [llvm-dev] LLVM (Cool/Warm) DOT Printers for Profiling
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]