[libc][GSoC 2025] Profiling and testing of the LLVM libc GPU math (original) (raw)

February 10, 2025, 9:53pm 1

Description: The LLVM C library provides implementations of math functions. We want to profile these against existing implementations, such as CUDA’s libdevice, ROCm’s device libs, and OpenCL’s libclc. Last year we worked on some interfaces to support these tests, now they need to be refined and filled out for the interesting functions.

Additionally, we want to verify the accuracy of these functions when run on the GPU via brute force testing. The goal is to verify that the implementations are correct and at least conformant to the error ranges in the OpenCL standard. This will require a set of unit tests written in the offload/ project, ideally using the new API that @callumfare is working on.

Expected Results:

Final performance results similar to Old results but with the more optimized functions and higher accuracy.
A test suite that can do brute force testing to confirm that the implementations are conformant.

Project Size: Small

Requirement: Basic C & C++ skills + access to a GPU, some math knowledge

Difficulty: Easy / Medium

Confirm Mentor: Joseph Huber, Tue Ly

Hi, I’m interested in this project. I’m a grad student experienced in developing high-performance operators and contributing to math libraries for heterogeneous computing platforms, but I currently only have access to NVIDIA’s GPU. Is that sufficient for this project?

jhuber6 February 13, 2025, 1:49pm 3

That’s fine, they should be roughly equivalent for this case and any changes made will be picked up by one of the AMDGPU build bots we have.

This project looks really exciting, and I’d love to contribute.
I have experience with C, C++, and LLVM, and I’ve already worked on a LLVM issues (currently tackling another one). I also have a strong background in performance optimization, data structures, and algorithmic problem-solving.

I’m particularly interested in GPU programming and would love to learn math function optimization and accuracy testing. Looking forward to learning more.

I’m very excited about this project because I’ve already done some deep work on high-performance computing projects (optimizing Linpack and deploying it on the RISCV-Vector platform). At the same time, I am very familiar with C language and Python, and have done some benchmark design (familiar with pytest, prefetto and other tools). I believe these experiences will help me better participate in this project!

This project is very interesting!

Over the last few weeks, I’ve been working on a proof of concept, MPFR for Mojo, that shows how we can use the MPFR library as a “gold standard” to test the correctness of mathematical functions implemented in Mojo, the new programming language Chris Lattner and his team at Modular are creating.

The same machinery I’m implementing in this PoC can be used for both unit and exhaustive testing. I also implemented a routine to convert an MPFR value to lower-precision floating-point types (e.g., bfloat16) to avoid double-rounding errors (the naive path MPFR value → float32 → bfloat16 can be problematic).

I haven’t included GPU support in this PoC’s roadmap yet mainly because GPU programming in Mojo is very new and is evolving extremely fast. But I’d love to work in this domain - not only testing mathematical functions on GPUs but also implementing them there. I’m curious if type promotion and lookup tables are also good strategies for GPU-specific implementations.

I have some questions:

I don’t yet have solid knowledge in C/C++, benchmarking, GPU programming, or numerical computing. Could that be a problem? I do have some experience in the intersection of these areas: contributions to TensorFlow/JAX ecosystem (e.g., this) and personal projects like Specials and MPFR for Mojo. I’ve learned everything I needed studying papers and other people’s code.
What is the expected commitment to this project, in terms of weeks and hours per week?

jhuber6 March 2, 2025, 4:17am 7

Very cool. The LLVM libm which the GPU compilation uses is tested against MPFR as well and is expected to be correctly rounded. The GPU implementations prefer some loss of precision in exchange for performance (though this can be configured), but it lets us know the proper ULP if we have a reference. MPFR is very slow however, so for brute-force testing it takes a long time. For that reason, I was actually assuming we could just use the LLVM libm as the gold standard, since it’s tested against MPFR and would perform orders of magnitude faster.

Basic C/C++ knowledge would definitely, the other stuff you could probably get up-to-speed on in the months prior to the formal start if needed.

Tough to say, I listed it as small since it’s basically writing a test and benchmarking suite for some functions. I’ll just guesstimate and say eight hours.

I think it’d be appropriate to use the LLVM libm as the gold standard if we were interested in checking if the GPU implementations are correctly rounded. But, as you said, the GPU implementations prefer some loss of precision in exchange for performance, so we’d like to compute the ULP error and compare it against a reference. The question is: would our estimate of the true ULP error be (very) poor if we use the LLVM libm to compute the expected output (that would be rounded to the target or a higher-precision format) and then the error (subject to, for example, catastrophic cancellation)?

MPFR is indeed slow. But in my opinion, the slowness problem may have been made worse by the way MPFR is used in MPFRWrapper and related testing machinery. In particular, I think they fail to observe the following hint to get the best efficiency out of MPFR: “you should avoid allocating and clearing variables. Reuse variables whenever possible, allocate or clear outside of loops, pass temporary variables to subroutines instead of allocating them inside the subroutines” [link]. If I’m not mistaken, each iteration of an exhaustive test loop allocates and clears an MPFR variable.

jhuber6 March 2, 2025, 2:08pm 9

The LLVM libm implementations are expected to be correctly rounded save for a few exceptions. I.e. their ULP against MPFR is zero.

Most likely, it could probably be sped up with some caching but I haven’t really looked at that code much. Maybe @lntue would be interested.

Consider the following counterexample. It will show us two things: (i) being correctly rounded does not imply a ULP error of zero; and (ii) substituting the expected real value Z with its floating-point representation x in a given format can lead to a poor estimate of the ULP error.

Without loss of generality, assume that the working format is float. Let x and y be the following floating-point values in that format:

x = 1.0f; and
y = 1.0f + ldexpf(1.0f, -23), the smallest floating-point value in the working format that is greater than x.

Consider the real number Z = 1 + 2^(-24), which is exactly the midpoint between x and y. By definition, Z is not representable in the working format. Under round-to-nearest-ties-to-even mode, x is the correctly rounded representation of Z. Still, the difference between Z and x is 0.5 ulp. Here I’m adopting the Goldberg definition of ulp(Z) extended to all reals Z.

Furthermore, note that the difference between Z and y is also 0.5 ulp. However, if we replace Z with x in that last difference, the result increases to 1.0 ulp.

lntue March 10, 2025, 2:55pm 11

Indeed, to show something correctly rounded for RNE mode, we need either < 0.5 ulp or = 0,5 and some extra check in the exact midpoint case. On the other hand, for GPU target with non-correctly-rounded requirement, I’m thinking about testing < 2 ulp compared to the correctly rounded ouput of the same precision of the CPU implementation should be enough. And it will be way faster than MPFR.

On GPUs, is round to nearest, ties to even mode the only rounding mode that matters? Or should we test math functions under other rounding modes?

jhuber6 March 12, 2025, 12:43pm 13

I would not worry about the other rounding modes right now. Technically AMDGPU can support a floating point environment since they have configuration registers to change the rounding mode. NVPTX ties the rounding mode to the instruction so we would need completely different functions, which honestly isn’t a bad way to handle it…

emangia March 15, 2025, 12:52pm 14

Hello, this is one of the 2 projects I’d be interested in contributing to. Here’s a bit about me…

I’m a first year M.Sc. student in Engineering Mathematics. I’ve had experience mainly with C, C++ and Python as programming languages and tools such as CUDA, OpenMP, MPI and a bit of OpenCL.

Recently, I’ve been busy with more maths focused courses, but I’ve had a Parallel Computing course and I’ll start a Compiler Construction course next week. I should mention it’d be the first time for me contributing to an open source project.

Currently, I’m also doing an internship in the machine learning field, but I’ll be quite free by June, ideally, with fresh knowledge about compilers. In the meantime, I’ll try to get up to speed about this project.

If you think there could be a match, feel free to write me a private message.

@jhuber6 @lntue

I’d like to clarify a few additional points before writing my proposal.

Regarding performance tests

You mention that in 2024, you worked on some interfaces to support performance tests. Could you elaborate on which aspects these interfaces need to be refined and what your expectations are for gsoc2025?
The idea description mentions testing “interesting functions” from the LLVM libc GPU math. Could you indicate which functions are a priority and how many you expect to be tested in gsoc2025?
During the project, could I count on your guidance for building and running the performance tests? I tried following the Building the GPU C library and Using the GPU C library documents, but I received the following error:

clang: error: no library 'libomptarget-nvptx-sm_80.bc' found in the default clang lib directory or in LIBRARY_PATH

Regarding the exhaustive accuracy tests

Which functions are a priority and how many do you expect to be tested for gsoc2025?
During the project, could I count on your guidance for building and running the exhaustive accuracy tests? For example, I tried following the How to add a new math function to LLVM-libc document to run specific unit tests for libc math functions. While I was able to test logbf, a similar command for logf returned the following error:

ninja: error: unknown target 'libc.test.src.math.logf_test.__unit__', did you mean 'libc.test.src.math.logbf_test.__unit__'?

jhuber6 March 27, 2025, 1:35pm 16

There’s code that lives at llvm-project/libc/benchmarks/gpu at main · llvm/llvm-project · GitHub which tries to get very accurate cycle counts for functions. It probably needs to be updated a bit.

This is just a hand-wavy way to say that we don’t need to be too concerned with the performance of every single trivial function like fmin or floor. However we probably do want some testing.

Of course, getting this to build is the first step. I’m unsure why you’re getting that message, as I deleted the arch specific device runtime in clang-20. Are you building off of the main branch? All development should be done at the top of the tree. Otherwise, it’s possible you’re mixing a system-level clang and the one you installed.

So, the offloading tests are a little unique because they have some weird dependencies. We will probably not be building them inside of libc but offload/. Exhaustive tests are only run very occasionally (due to the cost), so we don’t want it to be in the same directory. Ideally, this will also allow us to test the API @callumfare is trying to get set up in https://github.com/llvm/llvm-project/pull/122106 while we’re at it. Though you might find it a little difficult at first to work with an offloading API directly.

Looking forward to receiving your proposal, thanks.

Hello @jhuber6

My name is Jerry, and I’m an undergraduate computer engineering student at The University of British Columbia in Canada. I’m extremely excited about the opportunity to contribute to this project, as I have been passionate about HPC and heterogeneous programming—and contributing to LLVM has been a long-time goal of mine.

A bit about my background:

Merge 2 PRs to refine coding style for LLVM Backend (1, 2)
One of the main contributor of caribou (an open source framework for geo-distributed deployment of serverless workflows)
I have hands-on experience writing analysis and transformation passes and supporting new instructions for SystemZ in LLVM (from my experience as a Compiler Developer at IBM).
I’m proficient in C/C++ and Python, and I’ve worked with benchmarking tools like Valgrind.

I’m currently diving into CUDA’s libdevice and ROCm’s device libraries. In addition to these, I was wondering if there are other specific libraries or interfaces (for example, aspects of OpenCL’s libclc) that you’d like me to focus on?
Regarding the new API that @callumfare is developing for the offload/ project, could you share any insights on its current stability or any particular areas that need more attention? Beyond achieving comparable performance to previous results, are there particular performance metrics or profiling strategies you’d recommend that align well with the GPU offload testing?
What would be your recommendation for first steps to approach this project?

I’m looking forward to learn more and contribute effectively to this project. I appreciate any guidance you can offer regarding these points.

jhuber6 April 6, 2025, 2:20pm 18

Just a reminder that the deadline for submissions is the 8th for anyone interested in this project.