[RFC] Proposal for Offload Execution Test Suite (original) (raw)

January 6, 2025, 10:32pm 1

I would like to contribute a pet-project of mine to form the basis of a new test suite and infrastructure for testing the execution of programs on hardware
accelerators (GPUs, NPUs, FPGAs, whatever…).

The current implementation is on my personal GitHub as offload-test-suite, and uses several self-hosted GitHub Actions runners as its test farm. I would like to contribute all the code in that repo to form a new LLVM repository, and I’ll continue to operate the self-hosted runners to continue running the tests.

The code in the repository is under the LLVM license except for some third-party code under different licenses. I’ve separated the third-party code from the rest of the LLVM-licensed code for clarity.

Because this is a testing framework I do not expect we would ever distribute the binaries as part of any LLVM toolchain release, so the differently licensed code should have no impact on LLVM’s licensing. I will momentarily send an email to the LLVM Foundation Board for their review.

There is also a second repository that stores rendered images used for image comparison tests. I broke the images out into their own optional repository since I expect it will become a large volume of data which not all tests will require.

What does the existing framework do?

The existing project contains infrastructure built off LLVM to provide a lit-based testing workflow for executing GPU programs through three GPU programming APIs (DirectX 12, Vulkan, and Metal).

The current abstraction over the GPU API is designed to support other types of accelerator APIs, and (I believe) can be refactored/extended to support OpenCL, HSA, CUDA or any other similar programming model.

The test suite contains just a handful of test cases which are written to run against both the current HLSL compiler, and Clang. I have GitHub actions runners that execute the tests on a matrix of test configurations under Windows and macOS.

This framework is lightly inspired by Google’s Amber, but with a huge influence of LLVM’s testing methodology and tools. One key difference between this framework and Amber is that I chose to use LLVM’s LIT with split-file and a declarative YAML representation instead of defining any unique grammar or parsed formats.

The project contains the source code for three tools: api-query, imgdiff, and offloader:

api-query supports querying an execution API for feature support and is used to drive configuring LIT available features.
imgdiff is an image comparison tool which compares two PNG images.
offloader is the primary work horse which reads initializes memory and executes programs on the accelerator using the specified API.

What’s missing?

Currently the framework is extremely limited in what it can handle. It only works for simple compute kernels. It has no support for rendering workloads or workloads that depend on running multiple kernels. It has limited facilities for result verification, and will need to grow support for floating point comparisons with rounding tolerance.

The imgdiff tool currently implements the CIE76 color difference algorithm, which is better than nothing, but not the best (I chose it due to its simplicity). It can also compute the root mean square of the image differences, and a few other useful metrics for quantifying image differences. The only test of this is a fractal generation which exhibits pretty wide variation on different hardware due to cumulative rounding errors. It has some configurability via a YAML “rules” input file to adjust how sensitive the match needs to be. There is a lot more work we could do to improve this tool including better color distance, better difference quantification, more meaningful “difference” image generation, etc.

The main hope in moving this project into LLVM early is to leverage cross-company collaboration to fill out the missing features and expand test coverage.

Why isn’t this just `llvm-test-suite`?

The llvm-test-suite solves a problem that is more complex in some areas and less complex in others, which drove it in a very different direction from what I believe accelerator testing needs.

Specifically, because the act of compiling C & C++ code varies so much across different platforms (header search paths, compiler flags, etc), the llvm-test-suite relies on meta build systems (CMake, autoconf) and additional scripting to help figure out how to build test programs.

At the same time, executing a compiled C++ program on your host CPU is quite simple, so the execution and results verification can be much simpler.

For accelerator programming models the inverse is the case. Programs are often much simpler to compile for a given API. On the other hand, executing a compiled program for an accelerator API requires a dedicated loader that configures the device, manages memory, and controls execution.

The testing matrix for accelerators is also different. As is illustrated in the testing that I’m running today it is often useful to test the same program multiple times on different acceleration devices and APIs from the same host environment. This testing framework builds a single tool to act as the loader which has separate backends for different APIs, and can target multiple devices (as available) with each API.

Next Steps

Assuming community consensus I’d like to move the existing code and test infrastructure to under the LLVM organization, and continue development there. I will continue hosting the GitHub actions runners for the foreseeable future.

cc: @bogner, @jdoerfert, @arsenm, @jhuber6

jhuber6 January 10, 2025, 9:37pm 2

I’m always interested in more ways to test things on the GPU. As far as LLVM testing goes, there’s basically two different approaches we use.

Just use llvm-lit with some run lines that compile and run the output. This works mainly for single source languages (OpenMP, CUDA, SYCL) but you could probably make it work for anything if you abused tools enough.
Treat the GPU as a direct target and execute a main kernel that does a simple unit test. Use an ‘emulator’ to then execute the image and see if it passes. This is what libc does, it works very well for making tests portable between all targets, but is very limited for GPUs. Mainly this is because a single kernel does not synchronize between blocks.
Just compile a test file like normal and see if it returns zero, that’s the llvm external test suite.

From a brief scan, it seems like what we’re doing is basically like what we already do in the OpenMP test suite (run lines and source code), with some YAML configs to help auto generating the tests.

Granted I don’t really have much experience outside of GPU compute, but what infrastructure would need to be added beyond what we already do with lit testing?

Doesn’t the test suite contain CUDA programs?
GPU libc tests come with a loader per architecture, those and offload/test compile on different architectures by creating one executable per.
I guess don’t understand why we need a “new” approach.

arsenm January 13, 2025, 3:53am 4

I’ve never gotten those to successfully build since they had strange system setup expectations. I also think we need some kind of GPU unit testing beyond run some arbitrary program (e.g. more targeted unit tests for specific backend functionality or optimization). We do need some kind of framework for dealing with the typical boilerplate of get me sample inputs, and compare reference outputs.

The libc tests aren’t really a substitute for GPU testing. We’re not going to test all float inputs for a function single threaded for example

shiltian January 13, 2025, 4:01am 5

FWIW, I did a research work on getting the entire llvm-test-suite work/run on a GPU. It is similar to what @jhuber6 is doing in the GPU libc project.

It does contain HIP test as external tests.

beanz January 13, 2025, 7:10pm 6

The big things that this test framework adds over what we’ve done before is that the loader is a single entry for multiple runtime APIs supporting multiple GPU vendors. The loader also manages setting up GPU memory and reading back from the GPU to drive FileCheck tests on results.

There is also some tooling to support image outputs and driving lit features based on runtime API capabilities… Those are all simple additions building off idiomatic LLVM design practices.

There is still a bunch of work to do in order to support rendering workloads which have their own mess of complexity. I haven’t started that work, but it is high on my wish-list.

I guess I’d start by saying this isn’t really a new approach. I think everything about this from an approach standpoint is in line with how we’ve done testing elsewhere.

There are some reasons I don’t think the test suite is ideal for testing offload programs written in the models used by HLSL, OpenCL and other shading languages. The biggest is that the test-suite is designed around using CMake to configure and build a bunch of executables and LIT to run them.

For the programming models we’re supporting CMake is a mess of extra complexity for building when we’re really just building one CPU executable instead we need a way to specify how to initialize the accelerator’s memory and kick off execution.

Most of the tests I have written so far are extremely simple and follow the flow:

Compile a program
Execute it with the loader
Verify the results with FileCheck

The loader initializes GPU memory based on a YAML file, copies memory back to the CPU and re-emits the memory to YAML.

Conceptually this isn’t different from some of the other single source LIT-based tests, but it integrates radically different from what we do in the llvm-test-suite today. It would actually fit much better with the way we run the libcxx or libc tests in LLVM.

You might then ask why not just put this into LLVM proper?

I don’t think this warrants being included in the monorepo because it won’t be used by a lot of people, and IMO it is nice to have separate due to the the third-party code dependencies.

@arsenm,s comment about the libc tests not being a substitute for GPU testing is also a big deal for us. We need to be able to expose and test GPU programming features that just don’t have a clear analog in C/C++.

beanz March 1, 2025, 9:54pm 7

I want to bump this back up. We’re still interested in contributing this under the LLVM umbrella and using it for Clang HLSL execution testing.

I’m not sure if there is clear feedback that we need to incorporate into the design or integration plan to make it fit better.

(cc @tstellar, @petrhosek, & @dblaikie)

I guess I don’t understand what the actual problem here is. CMake for CUDA/OpenMP/… offload tests works fine (via CXXFlags), CMake for “direct GPU” programs should work fine as well, via explicit targets. What am I missing? The host code still needs to be built for the respective target arch of the test suite. We further likely want to define “offload targets” in the test suite configuration, which will determine what tests can run at all and how we compile them.

What I’m trying to say is, your loader + YAML approach is fine. We should just build and run as the rest of the TS for the host, and with an additional “offload target archs” option for the devices. Maybe I simply fail to see the issue here.

beanz March 2, 2025, 2:29am 9

CMake’s support for HLSL is not good. It either only works with the Visual Studio generator, or requires the Vulkan SDK. It also has no support for the Metal shader converter to target macOS. So we’d end up building a mess of our own CMake to drive it, which just seems like a headache with more downsides than upsides.

By not relying on CMake we can run the test suite for multiple compilers and target APIs without needing to generate new build directories.

As I’ve built the suite this is already possible without relying on CMake by using LIT features and generating multiple lit suite configurations for the test suite from the same build. As implemented it allows testing DirectX, Vulkan, and Metal with both DXC and Clang from a single CMake invocation with different top-level LIT targets.

I’ve got a few small improvements I’m planning to simplify our run line construction, but overall this is way simpler for us to write tests, and since some tests depend on specific command line options we have that control without a lot of custom CMake hackery.

I believe the HLSL and OpenCL programming models have much simpler compiler command interfaces than the other more C+±aligned models.

Maybe the answer here is that we shouldn’t try to use a testing infrastructure like this for CUDA/HIP/OpenMP, but I don’t think the infrastructure that we have for those runtimes is really ideal for what we need.

OK, let’s assume for a second CMake can’t easily support what you want to do.
We still don’t want to end up with a test suite for which you need to run build command A + run command A if you want tests A, and build command B + run command B, if you want tests B. Does that make sense?

What if you wrap your compilation / run model into what we have?
So run build command A for subfolder Metal simply calls your build script, passes along what seems helpful from the CMake config the user picked, but otherwise simply executes custom build steps. Same for the run.

beanz March 2, 2025, 9:41pm 11

I must have explained something poorly. As currently setup there is a poorly named check-hlsl target that runs all the tests across all hardware configurations that your device supports. There are also API and compiler-specific targets that can run a subset (e.g. check-hlsl-d3d12, check-hlsl-mtl, and check-hlsl-vk). The subset targets are useful in particular for devices that support multiple possible runtimes to make it easy to break out what works and doesn’t.

We don’t have “subfolder Metal”, most of our tests should run on all of the 3 supported APIs, and we use LIT feature checks to denote when things are unsupported or behave differently on different APIs.

We have almost nothing that needs to be passed from CMake down to the test suite, and nothing from CMake for actually building the HLSL. You can see the full options we pull from CMake here.

Since CMake doesn’t support HLSL out of the box, in order to do what you’re suggesting we would need to use a mess of custom commands and other nonsense to generate the shader binaries, then we would need to feed those into the LIT-driven test cases to execute the tests. This seems way more complicated than what we have, and I don’t understand what benefit it provide. Can you explain why you think this is a better approach? In my experience, maintaining CMake is significantly more difficult than LIT configurations.

Let’s start by cross-compiling the test suite. Are we saying that it is simply not an option for the driver and consequently the HLSL tests? Similarly, change the input size, use “perf” (or similar) for accurate performance tracking etc? Basically all the “features” (TEST_SUITE_RUN_UNDER, TEST_SUITE_SUBDIRS, TEST_SUITE_COLLECT_STATS, …) the test suite exposes right now.

beanz March 2, 2025, 11:27pm 13

HLSL and OpenCL target virtual ISAs, and rely on driver-provided compilers for final ISA lowering. Cross-compiling the test suite doesn’t mean the same thing for those kinds of programming models as it might for programming models where you’re building for specific hardware.

You can absolutely cross-compile the test suite’s host tools, and just as with any other LLVM-based cross-compilation you can copy the lit configuration over to the cross-target and run it, but the host tools are not portable to GPUs, so the utility of cross-targeting is limited (although not none, because I have built the Windows test framework on macOS).

The more interesting stats to capture in this model are things like code size and compiler performance. The test-suite’s mechanism for measuring compiler performance (like under CTMark) leaves a lot to be desired and I certainly wouldn’t advocate that it is the right approach. Execution performance of the generated GPU programs depends so much on the GPU driver’s provided compiler that while not meaningless it isn’t really the same kind of measurement you get from building the LLVM test suite for a GPU.

I don’t really foresee us having tests where we change input sizes since you generally can’t change the input size without also changing aspects of how the program is run or how the GPU pipeline data is initialized. If there were a use for that we would need a method to generate the input pipeline YAML data, and that could just as easily run under LIT as CMake.

Right now, changing the input size means picking up a different input file and reference output. That would not be any different. small.yaml, large.yaml, and you got different input sizes, no?

If I enable “performance profiling” for the TS I want to get numbers I can compare, put in lnt, etc. For the GPU tests that would not mean we run perf on the driver, but it would likely mean we ask the driver to provide us with a performance number. It could use some chrono timers around the kernel launch, use event timers in streams, run under/with vendor-prof, whatever, as long as it spits out a number that is meaningful and can be compared run-after-run. All that said, there is also compile time profiling, stats, etc., which again might mean something different but should be “honored”.

I don’t understand this. What additional nonsense is needed? I am not suggesting it all has to go through CMake, but if it is connected we can reuse the same options, tools, etc.
You have a way/script to build your test X, let’s call it build_X. Now we simply register it as a CMake target X and say here is the build command: build_X and the run command run_X.

beanz March 3, 2025, 1:51am 15

None of what you’re suggesting is in conflict with the approach I’ve taken in implementing this suite. It just requires work to build, and the existing test-suite can’t do it either for these programs so we need to build it one way or another.

I think you may be over-estimating the complexity of the test programs. These are single-source programs. Neither SPIR-V nor DXIL have widely used separate compilation and linkage models, so even as we add more complicated programs they’ll almost exclusively be single translation-unit.

To put these builds into CMake we would need to build up custom command infrastructure to drive the compiler, we would need a bunch of CMake files to set all the command flags and targets. Then in order to use LIT to drive the execution (so that it can feed into LNT) we need to map the built shader paths into all the test files to run each of the tests, or we have to build a separate infrastructure like the test-suite does to generate lit config files.

If we want to support multiple compilers building the tests through CMake we either need to use multiple build directories or we add another dimension of complexity (the suite currently supports running tests built with DXC and Clang from the same configuration).

One big limitation of the approach used by the llvm-test-suite is that it does not work with the Visual Studio generator. Since most of the developers working on HLSL support work for Microsoft and on Windows as their primary development platform that is a significant problem.

Maybe we can turn this conversation more concrete. For example if you look at one of the tests with more complex verification like the mandelbrot test.

The run lines to build and run the test are pretty simple:

# RUN: split-file %s %t
# RUN: %dxc_target -T cs_6_0 -Fo %t.o %t/source.hlsl
# RUN: %if Metal %{ mv %t.o %t.dxil %}
# RUN: %if Metal %{ metal-shaderconverter %t.dxil -o=%t.o %}
# RUN: %offloader %t/pipeline.yaml %t.o -r Tex -o %t/output.png
# RUN: imgdiff %t/output.png %goldenimage_dir/hlsl/Basic/Mandelbrot.png -rules %t/rules.yaml

We could pull this out and capture it in CMake, but I think that makes it significantly harder to author and maintain tests. It also isn’t necessary to do any of the things you’ve listed. So what is the benefit?

The entire lit configuration for the test suite is under 100-lines:

And the CMake is under 200 lines:

An idea that I brought up on RFC: move llvm-toolchain-integration-test-suite under the llvm umbrella - #7 by petrhosek and that’s also relevant here is whether it’d be possible to integrate this test suite into GitHub - llvm/llvm-test-suite? I’m not sure if that’s the right direction and I’d be interested to hear your opinion. I included this topic on the agenda for tomorrow’s Infrastructure Area Team meeting, it’d be great if you could join.

beanz March 20, 2025, 7:24pm 17

Sorry my week has been insane and I missed this until just now.

Probably the biggest issue with integrating this under the llvm-test suite, is that our team relies on the Visual Studio generator extensively, which does not work with the llvm-test-suite. This would be a non-trivial amount of work to adjust the llvm-test-suite to resolve.

If merging this under the existing llvm-test-suite is a requirement I suspect our business and workflow needs would necessitate us keeping this outside LLVM because I cannot justify funding the work to make the test suite work under Visual Studio, and our requirement is inflexible.

We discussed this topic in the Infrastructure Area Team meeting and while consolidating all test suites under a single repository would have some advantages, namely fewer repositories and potential for sharing infrastructure, we also acknowledged that there are some notable downsides and GitHub - llvm/llvm-test-suite would need a significant refactoring/cleanup first. Given that, the Infrastructure Area Team agrees with creation of a new repository under LLVM · GitHub unless there are any objections from the rest of the community.

[RFC] Proposal for Offload Execution Test Suite (original) (raw)

What does the existing framework do?

What’s missing?

Why isn’t this just llvm-test-suite?

Next Steps

Why isn’t this just `llvm-test-suite`?