Add GPU support to ggml · ggml-org/llama.cpp · Discussion #915 (original) (raw)

Intro

This issue is more suitable for the https://github.com/ggerganov/ggml repo, but adding it here for more visibility.

First, I don't see adding a GPU framework that is tightly integrated with ggml anytime soon because it usually comes with a lot of maintenance drawbacks, architecture changes and issues. However, there is an alternative approach that might be relatively easy to implement and I think would be a very cool way for new developers to join in and help.

Description

ggml produces computation graphs which are basically directed acyclic graphs (DAGs) that can be easily exported, iterated, etc. A graph contains the information about all necessary tensor operations and buffers needed to evaluate the model. The idea is to first add basic ggml functionality for exporting the graphs in some trivial text format that can be parsed as a second step by a separate ggml tool. Having the exported graphs, one can process them and construct hardware-specific code for evaluating them. This way, we keep implementing existing and new transformer models as we currently do - with a focus for CPU execution, but we gain the benefit of being able to export the computation graphs and translate them for GPU execution.

For example, a ggml-cuda tool can parse the exported graph and construct the necessary CUDA kernels and GPU buffers to evaluate it on a NVIDIA GPU. Another tool, for example ggml-mps, can do similar stuff but for Metal Performance Shaders. Or maybe even a ggml-webgpu tool.

This approach preserves the cross-platform nature of ggml and allows custom hardware support, via compiler-like translation of the exported computation graphs.

Still, the most difficult part of implementing the respective kernels for the targeted backend remains the biggest obstacle.

However, I think this decoupled approach of the implementation would make the development process much easier and can potentially allow for some interesting optimizations. My biggest fear of adding a tightly integrated GPU backend to ggml is that I don't know the important details for supporting the respective backend, which could lead to bad software design decisions that in turn can have negative side-effects even on the core CPU implementation.

With the proposed approach in this issue, we eliminate this risk and allow multiple independent implementations to be provided without any negative side effects on the core ggml implementation.

Another cool thing about this idea is that there could be separate leading developers for each backend.
So if you have a good knowledge and understanding about a certain hardware architecture, you are one step away from initiating the kernel "translation" process and making a very significant contribution to the project.

Guiding principles

I don't know all the specifics of a good GPU implementation, but I believe one could try to adopt the fundamental principles of ggml.

For example, there could be a single memory buffer allocated and all the tensors can be distributed within that memory buffer at certain offsets. Each graph operation will correspond to a kernel with source tensors as input and a destination tensor for output which will be all part of that single memory buffer allocated at the start of the execution.

Additionally, I think we don't need to explicitly add 3rd party dependencies (e.g. CUDA SDK, OpenCL, etc.) to ggml to achieve that. The new ggml translation tools will simply read a computation graph and generate code for a certain GPU backend, which will be up to the user to compile and run.

The existing CPU code for each tensor operation is your reference implementation. Ideally, you would always want to implement the same computation in the corresponding new kernel and after that, you can try to optimize it for the specifics of the hardware. This is especially true for the 4-bit kernels.

All computations and buffers remain on the GPU. Avoid back-and-forth copies of data to the CPU RAM at all cost.

Taking shortcuts and making custom hacks in favor of better performance is very welcome. "General-purpose" is "bad". For example, we can have a tool like ggml-cuda-llama which is a very custom ggml translator to CUDA backend which works only with LLaMA graphs and nothing else, but does some very LLaMA-specific optimizations. This is fine.

Keep things minimalistic and don't over-engineer. For example, a CUDA translation tool will output a single C++ (or some other language) file with all the kernels and backend initialization code embedded in it. A simple C-style function for evaluation can be exported so that we can call this from other code bases. The actual translation tool should also be implemented as a single source file in a preferred language. (this guiding principle has to be defined a bit better, but we will figure it out as we go)

The GPU "translators" will likely remain second-class citizens from ggml point of view and they will need to adapt to the core CPU implementation - not the other way around.

Why?

Currently, ggml is one of the few ML frameworks that provides efficient 4-bit quantization and demonstrates effective application for quantized transformer evaluation. The code is compact, easily comprehensible with very little bloat. I think ggml has a slight leading edge in this regard compared to other general purpose frameworks and if we utilize it now, it has the potential of becoming a very respectable machine learning framework in the future with a focus for on-device inference.

Note that there is a very large dose of "reinventing the wheel" in the outlined strategy. Therefore, if you want to get involved, it's very important to have the right mindset. Definitely do not approach this with: "this has already been done in another project" , "we should do all those things that project X does" or "this is not going to scale well for all those reasons", etc.

I think the right mindset to approach this is: "let's try to hack something fast, small and cool and see where it goes"

Update 28 May 2023: