[RFC] Add a new Structure Layout Optimization pass (original) (raw)

Background

An array of structures in memory is laid out in a way that is not very efficient when cache accesses are taken into account. In the following example, when the element ‘a’ of successive structures are accessed in a loop, the cache misses often after a few accesses.

What we are trying to add here is an optimization that would better utilize the cache and results in misses less often.

Structure Peeling

In the above image, we see that the elements of different structures are now rearranged so that all the different elements from all the structure in the original array are now laid out together. This would mean that whenever an element ‘a’ is accessed, the subsequent elements ‘a’ from the subsequent original array elements are loaded to the cache, resulting in misses that occur less often. This kind of layout rearrangement can also be beneficial in enabling loop optimizations like vectorization.

Proposed Implementation

There are multiple research papers with implementation details that have been adapted by a few compilers. But they are heavily reliant on type escape analyses which do not provide for a good basis when dealing with a language like C in which types can be freely casted to and from. Even if we were to attempt to implement such an analysis, it would be very limiting. Moreover, with LLVM adapting opaque pointers this has been made impossible to perform reliably as well. So we take an alternate approach of analysing allocations and determining if the contents can be rearranged.

Analysis

The objective of the analysis is to determine which allocations can be rearranged as described above. This analysis does not limit itself to structures and would work equally well in situations where the memory accesses follow the pattern of strided accesses within the boundaries of the allocation.

We start with the allocation and look at its users and their users recursively. We are particularly interested in all the GEPs, loads and stores - we use them to essentially reconstruct a view of the structure by collecting all the sizes present at different offsets of the allocation that are being accessed.

We check, for each user of the allocation, if it is a:

GEP - collect its constant offset and variable offset multiplier. The variable offset multiplier of the GEP following the first GEP should all be the same - this means that we are indexing into GEPs of the same structure
Load/Store - this user would be following a GEP, hence a constant offset would be present. Collect the size being accessed at this offset. This size should be the same for any further access (load/store) at the same offset.

If the conditions are met, this allocation is safe for rearrangement.

Transformation

The transformation is just a rewriting of GEPs so that the accesses now happen in a way that rearranges the elements in memory with every write. The uses of the GEPs are subsequently replaced with the new GEPs.

Implementation

Link to the code on my fork - Add basic Structure Layout Optimization · rahulana-quic/llvm-project@79e8dc9 · GitHub

This is a very limited implementation intended to showcase the kind of analysis and the transformation that needs to be performed - more complex cases will be handled shortly based on the feedback for the proposal.

Example

For the above C example, the diff with and without the proposed transformation:

Looks very interesting! From your initial implementation do you have any data/rough-estimates on speedups you might be seeing?

Can you point me to these papers and the compiler implementations? Many thanks!

pogo59 August 7, 2024, 11:13am 3

I agree this sort of thing can make a huge difference in performance. The biggest problem is that for ABI purposes languages typically nail down memory layout?

Hi @rahulana-quic,
Nice post, related to this, I was working on optimizing the layout of allocations in LLVM IR.
I have a few PRs here that use the Attributor framework.
I made a new AA (AAAllocationInfo) that does exactly what you described in the implementation. It uses AAPointerInfo to get the offsets and corresponding instructions that cause an access.
The implementation is more fine grained since we don’t look at types. We look at the allocation on a byte level.
There are 2 PRs open right now:
1.) For each unused byte, we remove it and concatenate the allocation. Load/store offsets are changed following this. [Related PR]

2.) For each byte we re-order it in the allocation based on access patterns, we use a greedy heuristic for now but it is open to further research. [Related PR]

Since we don’t look at types but directly at the access per byte, optimizing layout becomes more generalized and fine grained. Structure peeling might be interesting in this case to think about.

pinskia August 7, 2024, 3:57pm 5

wmatt August 7, 2024, 10:13pm 6

Thanks for interesting post!

A few questions:

How do you deal with the pointer from the allocation escaping? Is your analysis inter-procedural? If the pointer escapes, I am not certain that the conditions you listed are enough to guarantee safety, or am I missing something?

What do you mean exactly by type escape analysis? I think most passes of this nature use some form of alias analysis, or a combination of alias-analysis and escape analysis. But I am not sure what type escape analysis is?

What’s the reason for not using alias analysis? If you are going through all the uses of an allocation, do you already assume that the allocation has no aliasing? A GEP that you are changing could operate on a pointer that is aliased to a different allocation as well?

Do you create a new type for the new layout or update the DebugInfo? If you only change the GEPS, I think this can mess with the Debug info in some cases, which might be an issue.

As someone mentioned as well, the compiler is strictly speaking not allowed to change the layout of a struct AFAIK, at least in C and C++ due to things like binary serialization of structs. Not to say that this shouldn’t be experimented with and can’t lead to performance gains, just as an FYI. Some ABI questions/issues have to be resolved for this.

That’s a good point, the implementation that I have uses AAPointerInfo to check if a pointer escapes or doesn’t. If a pointer escapes, the pass gives up on modifying that allocation.
In the test posted , the pointer actually escapes and the allocation should not be modified.

AAAllocationInfo does the optimization per allocation and tracks each byte in the allocation. It also doesn’t modify an allocation if it escapes. This is better than changing the global type since with a global change in type, you have to adhere to that layout for all instantiations of that type.
With a per instantiation approach we can be much more fine-grained.

If you optimize at the byte level and not the type, that’s mitigated I think.

I think @wmatt is referring to the fact that the layout of a struct is part of the ABI so changing it might break code across compilation units. But I think we can limited this to internal linkage code?

ok, limiting to internal linkage make sense.

This pass is implemented to be interprocedural and should be enabled at LTO for more visibility (thus limiting what would otherwise be escapes). This can also be limited to internal linkage code as pointed out before, taking care of the ABI considerations.

I am not sure if this would be a good thing when we want to move around what is essentially a unit of a type inside an allocation. I still need to understand more about your approach though, thanks for pointing me there.

The approaches do an escape analysis and then reject entire types based on the escaping variables.

Rather, we would be able to track the places where the address is taken and decide whether or not to proceed with the transformation based on that. Most of this still needs to be implemented but it has to addressed for sure.

Can you please give me an example for such a case?

This is a great point. The debug info for the allocation will be messed up after the transformation but for the loads/stores, it should be easy to repair.

The global should have been internal, I’ll fix that.

Internal linkage is required for globals. For locals and dynamic values there is no linkage. As long as you can “see” all the accesses, you can modify the access patterns. For globals it’s the same, but anything other than internal/private linkage prevents us from “seeing” all accesses in the first place.

The pass, as in the one in the first post, is not interprocedural as far as I can tell. @vidsinghal’s work is, since it’s based on AAPointerInfo.

In C/C++ or LLVM-IR, allocations do not have (meaningful) types. I think I had this discussion when there was an RFC a while ago.

Use AAPointerInfo. Your pass doesn’t deal with many things, e.g., you don’t distinguish if the store uses the pointer as the pointer operand or the value operand. AAPointerInfo deals with the interprocedural part, with various uses, etc.

I will look into using this. Is there any information avaiable on AAPointerInfo other than the code that you can point me to?

Not much. Look at what @vidsinghal did. AAPointerInfo basically tracks a pointer and lists all accesses to it (interprocedurally, based on byte-wise bins). The debug info of the Attributor helps understand what it does.

+1.

The research papers employ a “type-based approach”: if a struct type is a candidate, then all instances of that type will be transformed. This works for the MCF app from the SPEC CPU benchmark suite, but the main criticism is that this is too coarse grain, the wrong abstraction, and not general. Instead, objects or allocations should be considered. This has been the feedback to earlier attempts to implement this in both the LLVM and GCC community. Also this approach relies on type information that is communicated from the front-end to the middle-end, and as you mentioned LLVM IR does not provide the right information to recover type information.

One way to think about the allocation based approach is to think about it as SROA on steroids but instead of working on stack objects it should also work on heap allocated objects. An allocation defines an allocation space, and we need to create bins in this space that correspond to the data/struct accesses. I think that if you try to recover this information from IR by looking at GEPs, loads/stores, etc., then that is the same problem as trying to recover type information on LLVM IR, which we know is not reliable possible. AAPointerInfo might be an alternative but has many open questions too regarding identification of the struct/field data accesses, as it e.g. doesn’t deal with loops. Therefore, my suggestion is that some hybrid approach would be best:

allocations should be considered as candidates,
but instead of recovering type information it should be communicated from clang to llvm with meta-data somehow to identify the types and fields of the allocation.

My concern by just looking at GEPs and loads/store is that we might be able to handle a few small cases only, but that it might not be very reliable and general. Are you looking at MCF, or other applications?

Once a candidate has been determined, the escape analysis question needs to be answered. I believe some of the “practical checks” that the type based research papers mention still need to employed, like pointer casts, and then there is the sizeof operator, and some clib functions are “safe” escapes, but my suspicion was that LTO’s internalisation (for globals) and LLVM’s CaptureTracking would be sufficient to determine whether a pointer escapes.

In summary, there are a lot big open questions here, the allocation based approach is unexplored territory (not described in literature) and I think we need different designs for:

recovering type information,
legality of the transformation (escape analysis in LLVM),
and as mentioned I think we need to have some good and bigger motivating examples to test these ideas.

AAallocation info does this per byte instead of per “type”. In the IR the allocation is just an array of bytes. This is a more general approach.

You can use AAallocationinfo that I have implemented.
It does not support loops yet which is what you really need but I was thinking of implementing it. Once AAPointerInfo can analyze loops, you can “split” the allocation per byte rather than per “type”.

I will be extending AAPointerInfo to analyze loops in the future.

wmatt August 8, 2024, 5:28pm 20

Thanks for elaborating on this, this is good to hear that this can be solved!

On another note, I still think it’s worth discussing how we can keep valid debug info around when changing the layout and access patterns. This might be a bit more challenging for peeling optimizations and I am not sure how to fix this, but in a simpler case when changing the layout of the struct, the debug info should reflect this.