[llvm-dev] RFC: Sanitizer-based Heap Profiler (original) (raw)
Kostya Serebryany via llvm-dev llvm-dev at lists.llvm.org
Thu Jul 9 13:37:38 PDT 2020
- Previous message: [llvm-dev] RFC: Sanitizer-based Heap Profiler
- Next message: [llvm-dev] RFC: Sanitizer-based Heap Profiler
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Thu, Jul 9, 2020 at 12:36 PM Teresa Johnson <tejohnson at google.com> wrote:
On Wed, Jul 8, 2020 at 6:30 PM Kostya Serebryany <kcc at google.com> wrote:
On Wed, Jun 24, 2020 at 4:58 PM Teresa Johnson <tejohnson at google.com> wrote: Hi all, I've included an RFC for a heap profiler design I've been working on in conjunction with David Li. Please send any questions or feedback. For sanitizer folks, one area of feedback is on refactoring some of the *ASAN shadow setup code (see the Shadow Memory section). I have no strong opinion on the refactoring of the shadow set up code. Getting such code in one common place may or may not make the future maintenance cheaper. It was partially motivated by my looking to see what version of shadow setup to use, as they initially looked somewhat different, only to discover after carefully tracing through all the values that they were essentially identical but just structured differently (at least between asan and hwasan). This is somewhat philosophical, but to me it seems better to refactor when possible. One question about granularity: tcmalloc's allocation granularity is 8, so by using 64- or 32- byte granules you lose some of the precision. It depends what you mean by precision. For a single allocation, I end up accumulating the access counts across all corresponding shadows (i.e. if >64 bytes). Although I also added in a measure of utilization, which is at the less precise 64-byte granularity. I'm aligning everything to 32 bytes, which with the 32-byte header means there is no aliasing of shadows between different allocations. That being said...
I wonder if we lose something by using a malloc that is different from the production one (8 byte aligned / no header vs 32-byte aligned with 32-byte header). I mean, you will get correct results about individual allocations, but you will lose the information about the cases when up to 8 allocations share a cache line.
(It's probably outside of your goals anyway, so probably ignore this)
How important is that? Another option is to use 8- or 16-byte granularity and 4-byte (saturated?) counters. We've discussed using smaller granularities with saturating math, along with sampling to reduce the saturation. Lots of options to play with -- which is why my recommendation is to start with a callback-based instrumentation (just use tsan compiler pass for that), figure out exactly what you need and then either implement the one best instrumentation inline, or force the callbacks to be inlined with (Thin)LTO. I'm not convinced yet that using callbacks makes this difficult to play with, as it can be configured pretty easily from options, and the inline instrumentation part has been pretty trivial thus far. We'd also have to build in the counter update consolidation in either case (either with less counter updates or with fewer/aggregated callbacks).
I am probably just nitpicking about callbacks. It's not hard either way. It's just that I've been using the tsan instrumentation as the basis of multiple similar experiments and found it to be easier than hacking the compiler for N-th time.
OTOH, with smaller granularity you lose most of the "counter update consolidation" optimization. Some of it (at least where we would consolidate updates to different fields using alignment guarantees). But we'd still be able to aggregate across potentially aliasing calls, including out of loop bodies, where the updates are to the same field.
Eagerly looking forward to using this profiler! Thanks! Teresa --kcc
Thanks, Teresa
RFC: Sanitizer-based Heap Profiler Summary This document provides an overview of an LLVM Sanitizer-based heap profiler design. Motivation The objective of heap memory profiling is to collect critical runtime information associated with heap memory references and information on heap memory allocations. The profile information will be used first for tooling, and subsequently to guide the compiler optimizer and allocation runtime to layout heap objects with improved spatial locality. As a result, DTLB and cache utilization will be improved, and program IPC (performance) will be increased due to reduced TLB and cache misses. More details on the heap profile guided optimizations will be shared in the future. Overview The profiler is based on compiler inserted instrumentation of load and store accesses, and utilizes runtime support to monitor heap allocations and profile data. The target consumer of the heap memory profile information is initially tooling and ultimately automatic data layout optimizations performed by the compiler and/or allocation runtime (with the support of new allocation runtime APIs). Each memory address is mapped to Shadow Memory <https://en.wikipedia.org/wiki/Shadowmemory>, similar to the approach used by the Address Sanitizer <https://github.com/google/sanitizers/wiki/AddressSanitizer> (ASAN). Unlike ASAN, which maps each 8 bytes of memory to 1 byte of shadow, the heap profiler maps 64 bytes of memory to 8 bytes of shadow. The shadow location implements the profile counter (incremented on accesses to the corresponding memory). This granularity was chosen to help avoid counter overflow, but it may be possible to consider mapping 32-bytes to 4 bytes. To avoid aliasing of shadow memory for different allocations, we must choose a minimum alignment carefully. As discussed further below, we can attain a 32-byte minimum alignment, instead of a 64-byte alignment, by storing necessary heap information for each allocation in a 32-byte header block. The compiler instruments each load and store to increment the associated shadow memory counter, in order to determine hotness. The heap profiler runtime is responsible for tracking allocations and deallocations, including the stack at each allocation, and information such as the allocation size and other statistics. I have implemented a prototype built using a stripped down and modified version of ASAN, however this will be a separate library utilizing sanitizercommon components. Compiler A simple HeapProfiler instrumentation pass instruments interesting memory accesses (loads, stores, atomics), with a simple load, increment, store of the associated shadow memory location (computed via a mask and shift to do the mapping of 64 bytes to 8 byte shadow, and add of the shadow offset). The handling is very similar to and based off of the ASAN instrumentation pass, with slightly different instrumentation code. Various techniques can be used to reduce the overhead, by aggressively coalescing counter updates (e.g. given the 32-byte alignment, accesses known to be in the same 32-byte block, or across possible aliases since we don’t care about the dereferenced values). Additionally, the Clang driver needs to set up to link with the runtime library, much as it does with the sanitizers. A -fheapprof option is added to enable the instrumentation pass and runtime library linking. Similar to -fprofile-generate, -fheapprof will accept an argument specifying the directory in which to write the profile. Runtime The heap profiler runtime is responsible for tracking and reporting information about heap allocations and accesses, aggregated by allocation calling context. For example, the hotness, lifetime, and cpu affinity. A new heapprof library will be created within compiler-rt. It will leverage support within sanitizercommon, which already contains facilities like stack context tracking, needed by the heap profiler. Shadow Memory There are some basic facilities in sanitizercommon for mmap’ing the shadow memory, but most of the existing setup lives in the ASAN and HWASAN libraries. In the case of ASAN, there is support for both statically assigned shadow offsets (the default on most platforms), and for dynamically assigned shadow memory (implemented for Windows and currently also used for Android and iOS). According to kcc, recent experiments show that the performance with a dynamic shadow is close to that with a static mapping. In fact, that is the only approach currently used by HWASAN. Given the simplicity, the heap profiler will be implemented with a dynamic shadow as well. There are a number of functions in ASAN and HWASAN related to setup of the shadow that are duplicated but very nearly identical, at least for linux (which seems to be the only OS flavor currently supported for HWASAN). E.g. ReserveShadowMemoryRange, ProtectGap, and FindDynamicShadowStart (in ASAN there is another nearly identical copy in PremapShadow, used by Android, whereas in HW ASAN the premap handling is already commoned with the non-premap handling). Rather than make yet another copy of these mechanisms, I propose refactoring them into sanitizercommon versions. Like HWASAN, the initial version of the heap profiler will be supported for linux only, but other OSes can be added as needed similar to ASAN. StackTrace and StackDepot The sanitizer already contains support for obtaining and representing a stack trace in a StackTrace object, and storing it in the StackDepot which “efficiently stores huge amounts of stack traces”. This is in the sanitizercommon subdirectory and the support is shared by ASAN and ThreadSanitizer. The StackDepot is essentially an unbounded hash table, where each StackTrace is assigned a unique id. ASAN stores this id in the alloccontextid field in each ChunkHeader (in the redzone preceding each allocation). Additionally, there is support for symbolizing and printing StackTrace objects. ChunkHeader The heap profiler needs to track several pieces of information for each allocation. Given the mapping of 64-bytes to 8-bytes shadow, we can achieve a minimum of 32-byte alignment by holding this information in a 32-byte header block preceding each allocation. In ASAN, each allocation is preceded by a 16-byte ChunkHeader. It contains information about the current allocation state, user requested size, allocation and free thread ids, the allocation context id (representing the call stack at allocation, assigned by the StackDepot as described above), and misc other bookkeeping. For heap profiling, this will be converted to a 32-byte header block. Note that we could instead use the metadata section, similar to other sanitizers, which is stored in a separate location. However, as described above, storing the header block with each allocation enables 32-byte alignment without aliasing shadow counters for the same 64 bytes of memory. In the prototype heap profiler implementation, the header contains the following fields: // Should be 32 bytes struct ChunkHeader { // 1-st 4 bytes // Carry over from ASAN (available, allocated, quarantined). Will be // reduced to 1 bit (available or allocated). u32 chunkstate : 8; // Carry over from ASAN. Used to determine the start of user allocation. u32 frommemalign : 1; // 23 bits available // 2-nd 4 bytes // Carry over from ASAN (comment copied verbatim). // This field is used for small sizes. For large sizes it is equal to // SizeClassMap::kMaxSize and the actual size is stored in the // SecondaryAllocator's metadata. u32 userrequestedsize : 29; // 3-rd 4 bytes u32 cpuid; // Allocation cpu id // 4-th 4 bytes // Allocation timestamp in ms from a baseline timestamp computed at // the start of profiling (to keep this within 32 bits). u32 timestampms; // 5-th and 6-th 4 bytes // Carry over from ASAN. Used to identify allocation stack trace. u64 alloccontextid; // 7-th and 8-th 4 bytes // UNIMPLEMENTED in prototype - needs instrumentation and IR support. u64 datatypeid; // hash of type name }; As noted, the chunk state can be reduced to a single bit (no need for quarantined memory in the heap profiler). The header contains a placeholder for the data type hash, which is not yet implemented as it needs instrumentation and IR support. Heap Info Block (HIB) On a deallocation, information from the corresponding shadow block(s) and header are recorded in a Heap Info Block (HIB) object. The access count is computed from the shadow memory locations for the allocation, as well as the percentage of accessed 64-byte blocks (i.e. the percentage of non-zero 8-byte shadow locations for the whole allocation). Other information such as the deallocation timestamp (for lifetime computation) and deallocation cpu id (to determine migrations) are recorded along with the information in the chunk header recorded on allocation. The prototyped HIB object tracks the following: struct HeapInfoBlock { // Total allocations at this stack context u32 alloccount; // Access count computed from all allocated 64-byte blocks (track total // across all allocations, and the min and max). u64 totalaccesscount, minaccesscount, maxaccesscount; // Allocated size (track total across all allocations, and the min and max). u64 totalsize; u32 minsize, maxsize; // Lifetime (track total across all allocations, and the min and max). u64 totallifetime; u32 minlifetime, maxlifetime; // Percent utilization of allocated 64-byte blocks (track total // across all allocations, and the min and max). The utilization is // defined as the percentage of 8-byte shadow counters corresponding to // the full allocation that are non-zero. u64 totalpercentutilized; u32 minpercentutilized, maxpercentutilized; // Allocation and deallocation timestamps from the most recent merge into // the table with this stack context. u32 alloctimestamp, dealloctimestamp; // Allocation and deallocation cpu ids from the most recent merge into // the table with this stack context. u32 alloccpuid, dealloccpuid; // Count of allocations at this stack context that had a different // allocation and deallocation cpu id. u32 nummigratedcpu; // Number of times the lifetime of the entry being merged had its lifetime // overlap with the previous entry merged with this stack context (by // comparing the new alloc/dealloc timestamp with the one last recorded in // the entry in the table. u32 numlifetimeoverlaps; // Number of times the alloc/dealloc cpu of the entry being merged was the // same as that of the previous entry merged with this stack context u32 numsamealloccpu; u32 numsamedealloccpu; // Hash of type name (UNIMPLEMENTED). This needs instrumentation support and // possibly IR changes. u64 datatypeid; } HIB Table The Heap Info Block Table, which is a multi-way associative cache, holds HIB objects from deallocated objects. It is indexed by the stack allocation context id from the chunk header, and currently utilizes a simple mod with a prime number close to a power of two as the hash (because of the way the stack context ids are assigned, a mod of a power of two performs very poorly). Thus far, only 4-way associativity has been evaluated. HIB entries are added or merged into the HIB Table on each deallocation. If an entry with a matching stack alloc context id is found in the Table, the newly deallocated information is merged into the existing entry. Each HIB Table entry currently tracks the min, max and total value of the various fields for use in computing and reporting the min, max and average when the Table is ultimately dumped. If no entry with a matching stack alloc context id is found, a new entry is created. If this causes an eviction, the evicted entry is dumped immediately (by default to stderr, otherwise to a specified report file). Later post processing can merge dumped entries with the same stack alloc context id. Initialization _For ASAN, an asaninit function initializes the memory allocation tracking support, and the ASAN instrumentation pass in LLVM creates a global constructor to invoke it. The heap profiler prototype adds a new _heapprofinit function, which performs heap profile specific initialization, and the heap profile instrumentation pass calls this new init function instead by a generated global constructor. It currently _additionally invokes asaninit since we are leveraging a modified ASAN runtime. Eventually, this should be changed to initialize refactored common support. _Note that asan init is also placed in the .preinitarray when it is available, so it is invoked even earlier than global constructors. _Currently, it is not possible to do this for heapprofinit, as it calls timespecget in order to get a baseline timestamp (as described in the ChunkHeader comments the timestamps (ms) are actually offsets from the baseline timestamp, in order to fit into 32 bits), and system calls cannot be made that early (dlinit is not complete). Since the constructor priority is 1, it should be executed early enough that there are very few allocations before it runs, and likely the best solution is to simply ignore any allocations before initialization. Dumping For the prototype, the profile is dumped as text with a compact raw format to limit its size. Ultimately it should be dumped in a more compact binary format (i.e. into a different section of the raw instrumentation based profile, with llvm-profdata performing post-processing) which is TBD. HIB Dumping As noted earlier, HIB Table entries are created as memory is deallocated. At the end of the run (or whenever dumping is requested, discussed later), HIB entries need to be created for allocations that are still live. Conveniently, the sanitizer allocator already contains a mechanism to walk through all chunks of memory it is tracking ( ForEachChunk). The heap profiler simply looks for all chunks with a chunk state of allocated, and creates a HIB the same as would be done on deallocation, adding each to the table. A HIB Table mechanism for printing each entry is then invoked. By default, the dumping occurs: - on evictions - full table at exit (when the static Allocator object is destructed) For running in a load testing scenario, we will want to add a mechanism to provoke finalization (merging currently live allocations) and dumping of _the HIB Table before exit. This would be similar to the llvmprofiledump facility used for normal PGO counter dumping. Stack Trace Dumping There is existing support for dumping symbolized StackTrace objects. A wrapper to dump all StackTrace objects in the StackDepot will be added. This new interface is invoked just after the HIB Table is dumped (on exit or via dumping interface). Memory Map Dumping In cases where we may want to symbolize as a post processing step, we may need the memory map (from /proc/self/smaps). Specifically, this is needed to symbolize binaries using ASLR (Address Space Layout Randomization). There is already support for reading this file and dumping it to the specified report output file (DumpProcessMap()). This is invoked when the profile output file is initialized (HIB Table construction), so that the memory map is available at the top of the raw profile. Current Status and Next Steps As mentioned earlier, I have a working prototype based on a simplified stripped down version of ASAN. My current plan is to do the following: 1. Refactor out some of the shadow setup code common between ASAN and HWASAN into sanitizercommon. 2. Rework my prototype into a separate heapprof library in compiler-rt, using sanitizercommon support where possible, and send patches for review. 3. Send patches for the heap profiler instrumentation pass and related clang options. 4. Design/implement binary profile format -- Teresa Johnson | Software Engineer | tejohnson at google.com | -- Teresa Johnson | Software Engineer | tejohnson at google.com | -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200709/307c710b/attachment-0001.html>
- Previous message: [llvm-dev] RFC: Sanitizer-based Heap Profiler
- Next message: [llvm-dev] RFC: Sanitizer-based Heap Profiler
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]