(original) (raw)

Hello,

I would like to submit two papers that use LLVM to the Related Publications section.

Both papers focus on code isolation applied to perform piecewise compiler optimizations.
The code isolation process is performed by CERE, an open source tool based on LLVM.

The second paper is an extended version of the first one.

1) Piecewise Holistic Autotuning of Compiler and Runtime Parameters

@inproceedings{popov2016piecewise,
title={Piecewise Holistic Autotuning of Compiler and Runtime Parameters},
author={Popov, Mihail and Akel, Chadi and Jalby, William and de Oliveira Castro, Pablo},
booktitle={European Conference on Parallel Processing},
pages={238--250},
year={2016},
organization={Springer}
}

2) Piecewise holistic autotuning of parallel programs with CERE

@article{popov2017piecewise,
title={Piecewise holistic autotuning of parallel programs with CERE},
author={Popov, Mihail and Akel, Chadi and Chatelain, Yohan and Jalby, William and de Oliveira Castro, Pablo},
journal={Concurrency and Computation: Practice and Experience},
volume={29},
number={15},
year={2017},
publisher={Wiley Online Library}
}

Do not hesitate if you have any questions or if you need any additional documents.

Thank you,
Mihail Popov

-----------------------------------------------------------------------------------

PAPERS SUMMARY:

Piecewise Holistic Autotuning of Compiler and Runtime Parameters

Abstract. Current architecture complexity requires fine tuning of compiler
and runtime parameters to achieve full potential performance. Autotuning
substantially improves default parameters in many scenarios
but it is a costly process requiring a long iterative evaluation.
We propose an automatic piecewise autotuner based on CERE (Codelet
Extractor and REplayer). CERE decomposes applications into small
pieces called codelets: each codelet maps to a loop or to an OpenMP
parallel region and can be replayed as a standalone program.
Codelet autotuning achieves better speedups at a lower tuning cost. By
grouping codelet invocations with the same performance behavior, CERE
reduces the number of loops or OpenMP regions to be evaluated. Moreover
unlike whole-program tuning, CERE customizes the set of best
parameters for each specific OpenMP region or loop.
We demonstrate CERE tuning of compiler optimizations, number of
threads and thread affinity on a NUMA architecture. On average over the
NAS 3.0 benchmarks, we achieve a speedup of 1.08× after tuning. Tuning
a single codelet is 13× cheaper than whole-program evaluation and
estimates the tuning impact on the original region with a 94.7% accuracy.
On a Reverse Time Migration (RTM) proto-application we achieve
a 1.11× speedup with a 200× cheaper exploration.

Piecewise Holistic Autotuning of Parallel Programs with CERE

Current architecture complexity requires fine tuning of compiler
and runtime parameters to achieve best performance. Autotuning
substantially improves default parameters in many scenarios but it is a
costly process requiring long iterative evaluations.
We propose an automatic piecewise autotuner based on CERE (Codelet
Extractor and REplayer). CERE decomposes applications into small
pieces called codelets: each codelet maps to a loop or to an OpenMP
parallel region and can be replayed as a standalone program.
Codelet autotuning achieves better speedups at a lower tuning cost. By
grouping codelet invocations with the same performance behavior, CERE
reduces the number of loops or OpenMP regions to be evaluated. Moreover
unlike whole-program tuning, CERE customizes the set of best parameters
for each specific OpenMP region or loop.
We demonstrate the CERE tuning of compiler optimizations, number
of threads, thread affinity, and scheduling policy on both NUMA and
heterogeneous architectures. Over the NAS benchmarks, we achieve an
average speedup of 1.08× after tuning. Tuning a codelet is 13× cheaper
than whole-program evaluation and predicts the tuning impact with a
94.7% accuracy. Similarly, exploring thread configurations and scheduling
policies for a Black-Scholes solver on an heterogeneous big.LITTLE
architecture is over 40× faster using CERE.