Incoming backends: Vulkan, Kompute, SYCL · ggml-org/llama.cpp · Discussion #5138 (original) (raw)

ref:

There are 3 new backends that are about to be merged into llama.cpp. The tentative plan is do this over the weekend. Due to the large amount of code that is about to be merged, I'm creating this discussion for a quick communication channel between the maintainers in case problems arise.

The main goal after merging the backends is to make the CI green which would give some level of confidence that the existing stuff has not been broken. Even if the new backends don't function completely as expected, this would be acceptable as the idea is to improve over these with time. However, we want the CPU, CUDA and Metal backends to remain stable.

I'm thinking to do the merges all at once (in a batch) and sync everything back to the ggml and whisper.cpp repos.

If you have any general comments / questions we can discuss here. I will keep high attention to the discussion until we finalize the merges. Will also put it in the readme for awareness. We can discuss code specifics in the respective PRs as usual and keep this discussion focused on high-priority stuff (if needed)

You must be logged in to vote

I've mentioned it elsewhere, and this is only tangentially related, but currently GPU offload with OpenCL backend is pretty broken right now, ever since the backend rework. Model architectures for Phi, Mixtral and Falcon all no longer working as Slaren explained here: #2059 (comment)

I'm not sure what operations each of these new backends will support, but I would just like to +1 Slaren's suggestion where "weights not supported by a backend are kept on the CPU instead", which would hopefully allow graceful performance degradation versus just segfaulting.

You must be logged in to vote

1 reply

@ggerganov

The OpenCL needs a complete overhaul as a ggml backend, similar to what is done with the referenced backends here. The OpenCL matrix multiplication offloading was a poor man's hack that resulted in some performance gains and was nice to have at the start, but we cannot keep working around it. It has to either be reimplemented properly as a backend or we will eventually drop support for OpenCL all together, even more so that we are now about to add Vulkan.

Keeping the un-supported weights on the CPU would be nice-to-have, but as it was mentioned - it is low priority at the moment.

It would be interesting to see how does the Vulkan backend work on Android.

I wish the Vulkan API gains support for NPUs in modern hardware chipsets, if it doesn't. (ahem Samsung galaxy AI s24)..

You must be logged in to vote

5 replies

@StuartIanNaylor

Isn't it a matter of vendors writing Vulkan drivers for supported ops?
Its a shame there isn't a Linux NNapi as Android has, but Vulkan and likely dropping OPenCL might be the way to go, but even ArmNN uses OpenCL and its a landscape with acne.

@qnixsynapse

@StuartIanNaylor I think vendors will write drivers once such "extensions" becomes part of the "official Vulkan specification". A proposal for such extension/extensions need to be filed which will get reviewed. Once it is accepted, then vendors will eventually adopt it.

@sorasoras

Or better.
Vendor accelerate some ops that use by machine learning directly when using Vulkan.

@StuartIanNaylor

@mschwaig

I tried to run the latest commit (e76627b) on a Pixel 7a with GrapheneOS using Termux and it segfaulted.

~/llama.cpp $ ./llama-bench -m phi-2.Q4_K_M.gguf
ggml_vulkan: Using Mali-G710 | fp16: 1 | warp size: 16
| model | size | params | backend | ngl | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------- | ---------------: |
Segmentation fault

The GPU being detected seems like a good sign to me. I have not yet tried that exact same model with the Vulkan backend on my desktop system though.

Here's the vulkaninfo output if anyone's interested.

Really nice, I'm curious if any of these vulkan implementations will work with raspberry pi 5. Would be nice to take advantage of its gpu. 🤔

You must be logged in to vote

6 replies

@Nindaleth

@0cc4m

That's very cool. If someone can send me the vulkaninfo output of the RPi5 I can provide more information about whether it should/could work.

@VelvetyWhite

@StuartIanNaylor

Likely not worth it, I have a RK3588 and the MaliG610 manages about 75% of the ML workload of all 4 cores.
Even with the MaliG610 as its only a MP4 but much faster than VC (Raspberry spec shows VC gflops on VC but doesn't show the effect of marshalling to the CPU ram space).
Running layers on the MaliG610 is slower but still could help the CPU but if your trying on a Pi don't blame Vulkan as the GPU isn't in the same league as its A76 cores.

@0cc4m

It could work, I think the VC Vulkan driver has all the parts that it needs.

can the sycl backend be used with AMD cards? also can the sycl backend let me use cpu and gpu?

You must be logged in to vote

0 replies

Some benchmarks of Vulkan and Kompute on 6750XT/5800X3D. (Model is SOLAR 10.7B Q4_1.)

model size params backend threads/ngl test t/s
llama 34B Q4_1 6.27 GiB 10.73 B ROCm 99 pp 512 673.96 ± 0.77
llama 34B Q4_1 6.27 GiB 10.73 B Vulkan 99 pp 512 209.35 ± 1.29
llama 34B Q4_1 6.27 GiB 10.73 B Kompute 8 pp 512 72.77 ± 0.05
llama 34B Q4_1 6.27 GiB 10.73 B OpenCL 99 pp 512 112.41 ± 0.79
llama 34B Q4_1 6.27 GiB 10.73 B CPU 8 pp 512 15.72 ± 3.43
llama 34B Q4_1 6.27 GiB 10.73 B ROCm 99 tg 128 40.30 ± 0.01
llama 34B Q4_1 6.27 GiB 10.73 B Vulkan 99 tg 128 17.52 ± 0.03
llama 34B Q4_1 6.27 GiB 10.73 B Kompute 8 tg 128 35.48 ± 0.04
llama 34B Q4_1 6.27 GiB 10.73 B OpenCL 99 tg 128 14.86 ± 0.04
llama 34B Q4_1 6.27 GiB 10.73 B CPU 8 tg 128 6.35 ± 0.00

You must be logged in to vote

4 replies

@cebtenzzre

Is it SOLAR 10.7B or llama 34B? And are you using 8 layers for Kompute, or 99? Which llama.cpp commit(s) are you testing on?

@Artefact2

Model is SOLAR 10.7B Q4_1. Kompute mis-reports in llama-bench as CPU with 8 threads, but it's fully offloading to the GPU.

@0cc4m

I think my Vulkan backend doesn't like non-k-quants at the moment. I've seen some improvements to the cuda quantized Matrix-Vector shaders added recently that could be ported to Vulkan.

@sorasoras

Your backend is quite a bit slower at Q2K than at Q4KM,Which is kind of interesting.

hmm, not sure if this already implemented or not , but inference ML task in Compute shader should improve performance .
also for arm based v8a some soc support Neom intrinsics technology .

You must be logged in to vote

1 reply

@sorasoras

I don't know what you mean but Vulkan api backend is basically inference on compute shader.

You must be logged in to vote

0 replies