[InstrProfiling] Lightweight Instrumentation (original) (raw)

RFC: Lightweight Instrumentation

Hi all,

Our team at Facebook would like to propose a lightweight variant of IR instrumentation PGO for use in the mobile space. IRPGO is a proven technology in LLVM that can boost performance for server workloads. However, the larger binary resulting from instrumentation significantly limits its use for mobile applications. In this proposal, we introduce a few changes to IRPGO to reduce the instrumented binary size, making it suitable for PGO on mobile devices.

This proposal is driven by the same need behind the earlier MIP (machine IR profile) prototype. But unlike MIP where there is significant divergence from IRPGO, this proposed lightweight instrumentation fits into the existing IRPGO framework with a few extensions to achieve a smaller instrumented binary.

We’d like to share the new design and results from our prototype and get feedback.

Best,
Ellis, Kyungwoo, and Wenlei

Motivation

In the mobile space, profile guided optimization can also have an outsized impact on performance just like PGO for server workloads, but conventional instrumentation comes with a large binary size and code size increase as high as 50%, which limits its use for mobile application for two reasons:

OverviewThe size overhead from IRPGO mainly comes from two things: 1) metadata for mapping raw counts back to IR/CFG, which has to stay with the binary. 2) the increased .text size due to insertion of instrumented code and less effective optimization after instrumentation. Two extensions are proposed to reduce the size overhead from each of the above:

Extractable MetadataWith today’s IRPGO, the instrumentation runtime dumps out a profraw profile at the end of training. The runtime creates a header and appends data from the __llvm_prf_data, __llvm_prf_cnts, and __llvm_prf_names sections to create a profraw profile. The __llvm_prf_data section contains references to each function’s profile data (in __llvm_prf_cnts) and name (in __llvm_prf_names) so they are needed to correlate profile data to the functions they instrument.

Some kind of metadata to correlate counts back to IR (specifically CFG blocks) is unavoidable. One way to reduce binary size is to make such metadata extractable so they don’t have to be shipped to mobile devices. We could make __llvm_prf_data and __llvm_prf_names extractable, but the cost will be non-trivial and it will be a breaking change. On the other hand, debug info is extractable from binary and it already does a very good job of maintaining mapping between address and source location / symbols. Sample PGO depends entirely on debug info for profile correlation. So we picked debug info as the alternative for extractable metadata.

In our proposed instrumentation, we create a special global struct, e.g., __profc__Z3foov, to hold counters for a particular function. The __llvm_prf_cnts data section holds all of these structs and serves as placeholder for raw profile counters. In our final instrumented binary, we only have probe instructions and raw profile data without any instrumentation metadata, i.e., there are no __llvm_prf_names or __llvm_prf_data sections but we still have a __llvm_prf_cnts section. At runtime, we dump the __llvm_prf_cnts section to a file without any processing after profiling. To differentiate from IRPGO, the output from runtime is called proflite and we can add another VARIANT_MASK_ flag to the Version field of the profile header. At llvm-profdata post-processing time, we use debug info to correlate our raw profile data as follows. First we identify an instrumented function and look for its special global struct that holds counters (__profc__Z3foov) in the debug info. The debug info can tell us the address of that symbol in the binary and we can compute its offset from the __llvm_prof_cnts section. Then we can use that offset to read the function entry and block counters from the proflite file. Finally we populate profdata output for each function following the existing format.

Value profile is not going to be supported with extractable metadata right now, though we believe it can also be added following a similar scheme.

To improve debug info quality for profile correlation, -fdebug-info-for-profiling from AutoFDO can be used. Additionally, we could also use pseudo-probe from CSSPGO as the alternative metadata which is also fully extractable.

We propose a new flag -fprofile-generate-correlate=[profdata|debug-info|pseudo-probe] to choose what metadata to use for profile correlation. Either we correlate with today’s IRPGO metadata and keep them in their own sections (__llvm_prf_data and __llvm_prf_names), with debug info, or with pseudo-probe.

Coarse-grained InstrumentationIn addition to reducing metadata size ( __llvm_prf_names and __llvm_prf_data), we can also tune down .text size and __llvm_prf_cnts size. We do this by 1) only instrumenting function entries instead of each block and 2) lowering precision by tracking single byte coverage data rather than 8 byte counters. This is a trade-off between profile quality and binary size.

Function profile vs block profile and counting mode vs coverage mode can all be selected independently using our proposed flag -fprofile-generate-mode=[func-cov|block-cov|func-cnt|block-cnt], and they can work with both extractable metadata as well as IRPGO‘s correlation method. func-cov and block-cov use single byte booleans for coverage data while func-cnt and block-cnt use 8 byte counters. block-cnt represents today’s IRPGO which is the default.

When using a profile generated from modes other than block-cnt, additional profile inference is needed before the counts can be consumed by optimizations. Such inference is done during profile loading and so it’s transparent to optimizations.

WorkflowSince these are extensions that share the same underlying PGO framework, the workflow for lightweight PGO is very similar to existing IRPGO.

The diagram below has the PGO workflow today (shown in red) in comparison with the workflow for lightweight instrumentation (shown in green). We first create an instrumentation build that produces a raw profile at runtime. Then we use the llvm-profdata tool to convert that raw profile to a profile that the compiler can consume in the PGO build. The main difference for lightweight instrumentation is that we create an instrumentation build with debug info and we use that debug info to create our final profile.

image.png

## Prototype & ResultsWe have a proof of concept using dwarf as the extractable metadata and single byte function coverage instrumentation. We measured code size by building Clang with and without instrumentation using -Oz and no value profiling. Our lightweight instrumented Clang binary is only +4 MB (+3.48%) larger than a non-instrumented binary. We compare this with today’s PGO instrumentation Clang binary which is +54 MB (+46.96%) larger. If we used debug info to correlate normal instrumentation (without value profiling) instead of just function coverage then we would expect to see an overhead of +43.2 MB (+37.5%). We don’t have performance data on clang experiments using the prototype since not all components are implemented. However, an alternative implementation earlier (similar to MIP) delivered good performance boost for mobile applications.

table-large.jpg