[libc][GSoC 2024] Performance and testing in the GPU libc (original) (raw)

February 17, 2024, 9:28pm 1

Description: The GPU port of the LLVM C library is an experimental target to provide standard system utilities to GPU programs. As this is a new target, it would be worthwhile to extend testing and get specific performance numbers for various C library functions when called from the GPU. This would include microbenchmarking individual functions, comparing them to existing alternatives where applicable, and porting the LLVM test suite to run on the GPU.

Expected Results:

Writing benchmarking infrastructure in the LLVM C library. Measurements would include resource usage, wall time, and clock cycles.
Performance results for various functions to be published online
Running the LLVM test suite directly on the GPU where applicable.
If time permits, checking alternate methods for I/O via remote procedure calls.

Project Size: Small/Medium

Requirement: Moderate to advanced C++ knowledge, Knowledge of GPU programming and architecture, Profiling experience, GPU access

Difficulty: Easy/Medium

Confirm Mentor: Joseph Huber, Johannes Doerfert

boo15869 February 19, 2024, 8:33pm 2

Hi, I had a couple quick questions -

IIUC, the LLVM test suite is expected to run (when applicable) on the GPU without modification. Would there be any case where a test only uses features supported by the GPU port but cannot be run successfully on GPUs?
The post on the LLVM open projects page mentions implementation of malloc. Is a new implementation within the scope of this project? Or is this project intended to focus on mostly benchmarking/profiling?
Thanks!

jhuber6 February 19, 2024, 9:09pm 3

Sure, let me clarify some things.

The general idea would be to run the tests unmodified if possible. However, there are plenty of things we cannot currently run on the GPU. For example, thread_local variables, varargs (will be supported soon), or recursive global initializers. The hope is that we could simply detect a sufficiently large subset of the tests that fit the criteria.

Sorry for the confusion, it was worded poorly. The goal of the project is to create infrastructure that allows us to easily microbenchmark GPU code, maybe similar to Google benchmark. Implementing a good general purpose malloc that runs on GPU architectures is a comparatively much more difficult project. I’ve begun doing some basic work toward that end but don’t have anything finished yet. It was mentioned as a use-case for said benchmarking code.

shiltian February 19, 2024, 9:34pm 4

IIUC, the LLVM test suite is expected to run (when applicable) on the GPU without modification. Would there be any case where a test only uses features supported by the GPU port but cannot be run successfully on GPUs?

There is existing work that “successfully” got LLVM test suite running on GPUs. The infrastructure and utilities introduced in the LLVM GPU libc project, combined with techniques in that existing work, as well as another GSoC project, can almost get most of the LLVM test suite run on a GPU w/o any code change.

boo15869 February 20, 2024, 2:06am 5

The paper was a really interesting read! Though I’m still a bit confused on the need for new testing infrastructure - why doesn’t running the test-suite with clang-gpu (Like in the paper) work? From the bit I’ve read on benchmarking LLVM, it seems like most benchmarks are run with perf, including the libc math benchmarks linked in the root post.

jhuber6 February 20, 2024, 9:39pm 6

I believe it worked in the paper, but this was mostly for proof-of-concept. I would like a working configuration that runs the test suite on the GPU to be officially supported basically. I don’t think anything in the standard LLVM benchmarking will be relevant to the GPU case, however we may make use of things like rocprof or nsys to measure certain events.

boo15869 February 21, 2024, 2:35am 7

Got it, I’ve just got a couple more clarifying questions.

For GPU libc performance testing, where you thinking about something similar to the existing code used to test math function diffs, but extended to cover the rest of the libc functions on both CPU and GPU? Or did you have something else in mind?
For the entire LLVM test-suite, would an officially supported configuration mean some combination of Cmake options used to configure before running something like llvm-lit -v?

I’m still a little new, so please let me know if anything doesn’t sound right!

jhuber6 February 21, 2024, 2:01pm 8

Yes, we have a libc/benchmarks directory. The code written for the GPU will likely need to be highly specialized since it will need to use different tools.

I’ve done some cursory looks at this in the past. The llvm test suite supports cross compiling and using “emulators”. This means we would compile all the code to an executable and then the infrastructure will use one of the GPU loaders to execute it. Compilation and running would look something like this, but done by llvm-lit.

clang test.c --target=amdgcn-amd-amdhsa -mcpu=native -flto
amdhsa-loader a.out

shiltian February 21, 2024, 2:34pm 9

Compilation is done by CMake so CMake has to set up target and architecture accordingly (or just via clang-gpu that wraps the actual command) and running is via llvm-lit.

boo15869 February 23, 2024, 2:35am 10

I think I’m interested in this project and although I have some understanding of GPU architecture/programming, I can’t say I have much experience with profiling in general. You mentioned rocprof and nsys earlier, are there any other tools or resources that you would recommend I take a look at?

From what I understand so far, the benchmarking portion of this project will mostly consist of working on C++ code from libc/benchmark but adding specific tools that profile the GPU performance and running benchmarks on individual libc functions, correct? Then the test-suite portion would mainly involve creating a way to get llvm-lit to use the new infrastructure to compile/run the test-suite?

jhuber6 February 23, 2024, 4:22am 11

Glad you’re interested. I’m not the foremost expert on profiling either, but I would recommend scanning through the user documentation from somewhere like User Guide — nsight-systems 2024.1 documentation or ROCProfilerV1 User Manual — rocprofiler 2.0.0 Documentation.

So, we will likely want some utilities that go through standard GPU profiling tools like the above, and ones that just do microbenchmarking directly on the GPU. I have an old, abandoned patch that attempted to do the latter through expert usage of inline assembly hacks https://reviews.llvm.org/D158320. The idea there would be to get cycle-accurate counts, similar to what something like llvm-mca would spit out for the architecture. Function stimulus would likely want to vary between different levels of divergence. Traditional profiling tools will likely be better at picking up hardware events or resource usage.

Do you have access to a GPU you can use for development? Unfortunately I cannot get AMD to provide server access to external students so you will likely be on your own on that front. I believe the LLVM foundation has some basic access to computing resources and I can test things on your behalf in the worst case.

boo15869 February 24, 2024, 3:03am 12

I don’t have a GPU right now, but I can ask the professors at my school to see if any of the labs have something I could use. Would it make more sense to ask for an Nvidia or AMD GPU?

jhuber6 February 24, 2024, 3:44am 13

Both would be ideal if you could manage it as the GPU libc targets AMD and NVPTX. I work at AMD, so of course I’d encourage you to try an AMD GPU, but just go with whatever you can get access to. it would be quite difficult to do this work without a reliable modern-ish GPU, so hopefully you can find one.

boo15869 February 24, 2024, 3:19pm 14

Got it, I can make sure to ask for both.

While looking through some more websites, I read the Open MP Parallelism Paper referenced in the first paper @shiltian linked earlier. From what I could tell, it looks like the paper tries to take an unmodified user program and effectively wrap it in an OpenMP target offload for the GPU to run, instead of only offloading specific parts like you usually do with OpenMP. IIUC, there was noticeable slowdown because of factors like RPC and the sequential initialization code running slower on GPU threads copmared to GPU threads. If I read it correctly, what is the benefit of trying to run entire programs on the GPU over offloading? Is it to improve developer experience at the expense of some performance?

jhuber6 February 24, 2024, 3:49pm 15

The main utility of running programs directly on the GPU is testing both applications and the GPU backends. This is how the GPU libc runs tests, here’s an example of the NVPTX builder running the tests on an NVIDIA card, printing and everything Buildbot. Because this is run entirely on the GPU it shares the same source with the CPU tests. Also, CPU applications don’t look a lot like GPU applications. There have been a good amount of backend bugs running these kinds of workloads exposes.

As for purposes for the GPU C llibrary, I think it’s mostly to improve the developer experience to allow people to use standard C functions in CUDA, HIP, OpenMP, whatever. And it will also help with building the C++ library at some point. Here’s a talk I did at last year’s LLVM developer’s conference if you’re more interested with that part https://www.youtube.com/watch?v=_LLGc48GYHc.

boo15869 February 24, 2024, 5:56pm 16

Ah, I totally didn’t think of that! It makes much more sense now.

boo15869 February 26, 2024, 5:41pm 17

My school has offered access to Nvidia GPUs via a couple HPC clusters that I can use by sending slurm jobs to. I’m checking to see if I can build LLVM and if I can find a good workflow, but would there be potentially be any problems with this approach (i.e. maybe influencing benchmark scores)?

BTW, it looks like libc runs into a CMake error when building on redhat, which I think one of the clusters uses. Is there any specific reason for that?

jhuber6 February 26, 2024, 8:52pm 18

Not really, you should be able to launch an interactive job with slurm to make your life easier. Obviously timing will vary highly depending on the card, but as long as we have a way to generate consistent results it should be simple enough to run it on whatever machines we care about. Once we have something working I could test it out on more GPUs.

Unsure, could you share the error message? I haven’t built the CPU version of the C library in awhile.

boo15869 February 27, 2024, 1:46am 19

Yeah, the message is Unsupported libc target operating system redhat that seems to be getting triggered LLVMLibCArchitectures.cmake, then in the main libc cmake file after I tried to allow redhat as an option in LLVMLibCArchitectures.cmake. Possibly LIBC_TARGET_OS is getting set to redhat instead of linux? I think this may be a local issue and I’ll keep looking

jhuber6 February 27, 2024, 2:34am 20

There’s already some hacks around the autodetection there. It probably expects the triple to look like x86_64-unknown-linux-gnu and it’s trying to parse out linux. Unsure what the triple RedHat is using here.