Allocating different live ranges to same physicals (original) (raw)

April 23, 2024, 11:17pm 1

Hello @MatzeB , @qcolombet, @arsenm, @jayfoad

I’m currently tackling a problem that involves assigning the same set of physical registers to all instructions that belong to a specific category or partition. This categorization occurs at compile time rather than being predetermined. The main criterion for these instructions is that they share the same opcode. However, it’s not necessary for all instructions with the same opcode to be grouped into this category. This is a functional requirement, which stems from the target architecture and must be adhered to during or after the register allocation phase.

Example, say we have following code before regalloc
bb.1
OP1 %1 = …
OP2 %2 = …
OP3 %4 = %1, %2
OP4 …= %4

…
bb.2
OP5 %5 = …
OP6 %6 = …
OP3 %7 = %5, %6
OP7 …= %7…

Now, suppose we have physical regs r1-r4, the requirement is to generate following after regalloc
bb.1:
OP1 %r1 = …
OP2 %r2 = …
OP3 %r3 = %r1, %r2
OP4 …= %r3

…
bb.2:
OP5 %r1 = …
OP6 %r2 = …
OP3 %r3 = %r1, %r2
OP7 …= %r3

Different methods that I have considered so far to achieve this:

Create COPY instructions to assign the same virtual registers across categorized instructions, ensuring they map to the same physical registers
This would additionally require that the virtual doesn’t get coalesced or split or spilled.

However, this approach results in multiple disconnected components within the live virtual register range and verifier complains about that. It appears that typically, independent live ranges aren’t expected to share the same virtual register. Is this a perf specific requirement, or if we overlook this aspect (i.e. have a a target-specific check to override this), would it lead to complications during the register allocation process?
Also, it seems that not all reg alloc like fast regalloc ends up assigning same physicals to same virtuals.

Another method I tried was using tied operands. Unfortunately, this only works within the same instruction for tying a destination to a source operand, which doesn’t meet our needs for different instructions. This limitation, inherent from its original design for two-address instructions, raises the question of whether it should be extended to accommodate these new requirements.
Lastly, introduce pre-register allocation hints followed by a post-register allocation pass to address any mismatches. However, I’m unsure about the reliability of this approach given existing register allocation processes.

Have these issues been encountered before? I’m seeking any insights or recommendations on a suitable strategy to address this problem. Thanks!

Thanks
Divya

nvjle April 25, 2024, 1:58am 2

Depending on exactly when/where you are doing this, you’re likely violating SSA. Trying to create independent live ranges with the same virtual register would imply multiple definitions, so this scheme wouldn’t make much sense (if I understood what you wrote above). I suppose this could be like the TwoAddressInstructionPass and be run at just the right time (coming out of SSA, but before regalloc), but the rest of the process won’t likely get the expected result (see below).

Right-- there aren’t many assumptions you can make about which physicals are ultimately assigned to virtual registers. And even if you could, those assumptions would be entirely different between the different allocators (not to mention fragile).

IIRC, the existing hint mechanisms (SimpleHint/RegAllocHints) are all “soft” in that they may get satisfied (best effort). A new API could be introduced to insert “hard” constraints/requirements that must be satisfied. But that would require a bit of target-independent work in the allocator(s), and probably a few other areas like coalescing. Though I think this would be a useful feature beyond just this use case.

A post-regalloc register-renaming-like pass would be another way to address this. But there would be some complexities. For example, it is possible that all of the registers used in one instance of a group are not available where the second occurs. That would imply spilling or splitting those live ranges around the group, or maybe trying to scavenge the same set of available registers at all sites. Or renaming the conflicting ranges themselves.

Some generic questions (I couldn’t infer from your description):

Are these canned/fixed/predetermined sequences of instructions as opposed to arbitrary blocks of code?
Are these sequences surrounded by delimiters/brackets of some sort (or otherwise easy to identify as a group)?
Are the sequences relatively “short”, say 1 to a handful of instructions?
What is the relationship among the groups-- are they always in close proximity (e.g., adjacent) or can they appear arbitrarily in the function with respect to each other?

If I assume most of the above, then maybe the very simplest way to get something working without any target-independent changes (or additional passes) is as follows. Select a small number of physical registers and create singleton register classes for each of those (N classes for N distinct operands). These RCs would be used on the operands in the special instruction definitions to fully constrain the choice of registers to one. They’d also have to be given high allocation priority, given assumptions above.

For example, let us say hypothetical group kind A is a single instruction (extends easily to a few instructions, or a group with begin/end bracket, or a pseudo representing the group) OP3 %rX = %rY, %rZ. Presumably this instruction is general purpose and has a TD description allowing a full set of registers for each operand in an ordinary context. For the special context, we’d introduce a second TD opcode OP3_Fixed (say) for the instruction, but give its operands singleton register classes (RCX, RCY, RCZ) rather than a full general-purpose class. Then, during ISel (or whatever pass will generate these), use OP3_Fixed for the special cases.

This scheme should force the allocator to use exactly the single register for each operand for each instance of OP3_Fixed, so that all instances end up identical. One drawback being that since we’re using register classes in TD, the chosen registers are statically fixed at compiler build time, potentially constraining the overall allocation. But this would not be as bad as, say, just reserving registers outright which would not be available for general allocation at all.

Another approach might be to construct a simple pre-allocator which concerns itself only with these groups, without any particular constraint on the registers chosen (beyond the weird hardware constraints). This could result in a better overall allocation. Take a look at RegAllocBasic.cpp for a simple example of how to use LLVM’s toolkit of register allocation components.The remaining full-blown assignment would be left to Greedy or one of the other full allocators.

Hey!

If your sequence of instructions is pre-defined, you may be better off with using pseudo-instructions with hardcoded register requirements.
E.g.,

<yourHardCodedRegister1> = COPY <inputArgument1InVReg>
...
<yourHardCodedResults> = YOUR_PSEUDO <yourHardCodedRegisters>
<outputArgument1InVReg> = COPY <yourHardCodedResult1> 
...

The hardcoding would happen as part of your TableGen description and you wouldn’t need to worry about regalloc at all.

Now, if you really want to use regalloc for this problem, you’ll need to come up with your own pre allocator because, as you saw, there is no guarantee that even a particular vreg ends up in one physical-reg (because of splitting).

Thanks @nvjle and @qcolombet

I suppose this could be like the TwoAddressInstructionPass and be run at just the right time

Yes, this task should be performed at a much later stage, specifically after the SSA form is no longer needed.

Are these canned/fixed/predetermined sequences of instructions as opposed to arbitrary blocks of code?

No, they are not fixed and are determined at compile time based on some instruction properties.

Regarding other questions, the instruction can be anywhere, there are no delimiters/brackets, and the compiler cannot make any assumptions about their position in the code…

I have decided to implement a pre-allocator that will identify these groups of instructions and allocate physical registers to them. I anticipate some performance differences based on whether instructions from different groups are assigned the same or different registers. For now, I plan to assign the same registers, as their live ranges are guaranteed not to overlap and using this approach will make available more physical registers for any live range spanning multiple such instructions.