[RFC][LTO] Handling math libcalls with LTO (original) (raw)
Overview
LLVM has been moving towards supporting all standard C math library calls as intrinsics. These are either introduced either by use of the builtin functions like __builtin_sin, or through builtin transformations when legal, such as with -fno-math-errno enabled. These intrinsics are then handled by the backend to ideally target the most optimal implementation for the target.
The issue with these math intrinsics is that they are defined to lower to the corresponding libcall if the backend does not have a specialized lowering in place. Unlike other builtins, which can be selectively added with features like __has_builtin, the math builtins are always considered available. This is done with the assumption that there is a libc.so library that will resolve those libcalls after they have been lowered.
Problem
This causes a problem when doing LTO builds. For example, let’s say we have the symbol sin in a static library and try to link it into our module which had its call to sin transformed into llvm.sin.f64. Because llvm.sin.f64 does not link against the sin symbol, it is not extracted from the static library. If instead we used --whole-archive to force the library to extract, the sin function would be trivially internalized and then deleted by DCE before it reaches the backend, again leading to an undefined symbol.
The current ‘solution’ to this problem is to forcibly extract every single function that’s considered a valid libcall for the target. We then prevent these functions from being internalized by forcing them to be considered ‘used’ in the module. This ensures that the functions are both extracted and live long enough to be available for the code generation step.
This is obviously very inefficient. For targets like the GPU, the overhead of linking in potentially hundreds of math functions takes several seconds of link time for a trivial program. Furthermore, this prevents inlining because the functions will only be called directly after they are lowered to libcalls. For this reason, the state-of-the-art for the NVPTX and AMDGPU targets is to eagerly rename all math functions by force-including headers in clang. Calls to the builtins will just result in an unhandled intrinsic.
Possible solutions
There’s three parts of this problem I believe.
- We need to extract the appropriate symbol from the static libraries
- The math function cannot be optimized out once extracted
- No further builtin transformations should happen once linked.
I believe the simplest solution to this problem is to simply eagerly lower intrinsic calls to their libcall counterparts during LTO linking. In order for this to work properly, we would need to schedule this pass the moment the bitcode files are read and their symbols registered in ld.lld. We could create a custom pass manager off of the target machine before we begin processing the file.
This would then use the standard symbol resolution in the linker, correctly link the definitions, and make sure they’re still present because they’re actually used. The only possibility is that builtin transformations could be applied (i.e. transform sin and cos to sincos) and generate calls to functions that were not extracted. However, I believe that we should verify these transformations with the rtlib and libcall interfaces to determine if they are legal.
If eagerly processing these in the linker is not an option, the alternative would be to process them in the IRLinker phase, but that is less clean as we would need to teach ld.lld that llvm.sin.f64 extracts sin for example.
I’d like to hear from the LTO and linker experts for what the best solution to this problem is. This prevents us from heavily simplifying GPU math library handling as well as benefiting from libcall recognition (i.e. sin(0.0) → 0).
@jdoerfert @MaskRay @teresajohnson @JonChesterfield @AlexVlx @Artem-B
As you mentioned, this doesn’t address passes that (want to) introduce sin calls (late).
It’s odd to have a libcall sin, recognize it, optimize it, create it, until you don’t have it anymore.
How would this work if you do more than one LTO link?
What’s the real cost of keeping all GPU math functions around?
That was the “plan” the last two times this came up.
jhuber6 March 1, 2025, 1:09pm 3
We already gate libcall transformations behind whether or not they’re supported in the target machine. I think it’s reasonable to call it a bug to introduce new ones once we have a target. Alternatively, we could put nobuiltin on all the call sites, though there might be cases where new calls could be introduced.
The idea is that ld.lld preprocesses these when it first loads a bitcode file. So it should behave normally.
Very high, I had some initial experience with the current solution that extracts all libcalls when I started providing libm.a for the GPU. A trivial link job took three seconds when it would normally takes a few milliseconds at most. Additionally, these definitions would stick around even in the final executable. (Could probably skirt around this with --gc-sections, but it’s still not ideal)
What I’m saying is that with your “linker logic” approaches, the result of “supported libcalls” changes during the compilation. That is not something we have right now. Support for sin would depend on the position in the compilation, basically pre/post libm.a linking.
I mean, I have foo.o with math calls, then link it together with libm.a into foo.a with LTO. Should/will we replace llvm.sin/sin from a.o or keep them around?
If we keep them around, we require the user to provide libm.a later again, right?
If we replace them, we can’t optimize the math stuff after we link foo.a into something else, right?
I believe there is an inherent tension if there are multiple links and we want to do this at link time.
Doing the replace whenever we can is the conservative and correct solution, but that might be too conservative.
Given that we are talking about special handling for a list of symbols (or symbols with some annotation), we can “easily” ensure this is not the case.
How much math code is in there? It is surprising to me that a simple llvm-link would cost that much on the scale (I imagine) libm is.
jhuber6 March 1, 2025, 10:35pm 5
The point is that all math functions must be extracted and kept alive until the backend runs. That means we’re generating assembly for 100+ math functions that likely aren’t ever used.
The LLVM intrinsics are emitted by clang I believe, if there is an LLVM transformation that introduces calls to them it should follow the TLI / RTlib interface and make that transformation illegal given that information.
I’m hoping to solve this with the least amount of friction, and I think the easiest solution is just to lower these before the linker does symbol resolution. That prevents us needing a lot of custom logic.
Read any of the old proposals and you’ll see we never meant for them to survive past a specific point in the pipeline, basically just before the backend. Hence my earlier statement:
Given that we are talking about special handling for a list of symbols (or symbols with some annotation), we can “easily” ensure this is not the case.
I do not disagree with this statement, and I fail to express my actual concerns. I’ll try again:
You are planning to make llvm.sin and sin available by having TLI return “available” on them. That is needed to get clang to generate llvm.sin and other passes to optimize them. All passes should honor that. Now the catch is that passes after the replacement would either see “sin is supported”, while it is not (anymore), or the TLI response changes at the replacement point. Neither is particularly great. The old proposals all kept sin around until we knew it wouldn’t be needed anymore, the beginning of the backend. There is no ambiguity, as that point, not like the linker invocation, is only visited a single time. Does that explain my issues better?