GitHub - hybridgroup/yzma: Go with your own intelligence - Go applications that directly integrate llama.cpp for local inference using hardware acceleration. (original) (raw)

yzma lets you write Go applications that directly integrate llama.cpp for fully local inference using hardware acceleration.

Run the latest Vision Language Models (VLM) and Large/Small/Tiny Language Models (LLM) on Linux, macOS, or Windows.
Use any available hardware acceleration such as CUDA, Metal, or Vulkan for maximum performance.
yzma uses the purego and ffi packages so CGo is not needed.
Works with the newest llama.cpp releases so you can use the latest features, performance improvements, and bugfixes.

This example uses the SmolLM2-135M-GGUF model:

package main

import ( "fmt" "os" "path/filepath"

"github.com/hybridgroup/yzma/pkg/download"
"github.com/hybridgroup/yzma/pkg/llama"

)

var ( modelFile = "SmolLM2-135M.Q4_K_M.gguf" prompt = "Are you ready to go?" libPath = os.Getenv("YZMA_LIB") responseLength int32 = 12 )

func main() { llama.Load(libPath) llama.LogSet(llama.LogSilent())

llama.Init()

model, _ := llama.ModelLoadFromFile(filepath.Join(download.DefaultModelsDir(), modelFile), llama.ModelDefaultParams())
ctx, _ := llama.InitFromModel(model, llama.ContextDefaultParams())

vocab := llama.ModelGetVocab(model)

tokens := llama.Tokenize(vocab, prompt, true, false)

batch := llama.BatchGetOne(tokens)

sampler := llama.SamplerChainInit(llama.SamplerChainDefaultParams())
llama.SamplerChainAdd(sampler, llama.SamplerInitGreedy())

for pos := int32(0); pos < responseLength; pos += batch.NTokens {
    llama.Decode(ctx, batch)
    token := llama.SamplerSample(sampler, ctx, -1)

    if llama.VocabIsEOG(vocab, token) {
        fmt.Println()
        break
    }

    buf := make([]byte, 36)
    len := llama.TokenToPiece(vocab, token, buf, 0, true)

    fmt.Print(string(buf[:len]))

    batch = llama.BatchGetOne([]llama.Token{token})
}

fmt.Println()

}

Install yzma, then download the model using the yzma command line tool:

$ yzma model get -u https://huggingface.co/QuantFactory/SmolLM2-135M-GGUF/resolve/main/SmolLM2-135M.Q4_K_M.gguf

And run the Go program:

$ go run ./examples/hello/

"Yes, I'm ready to go."

Installation

You can use the convenient yzma command line tool to download the llama.cpp prebuilt libraries for your platform. You can also have your application self-download them automatically at installation time, including auto-detection for CUDA and ROCm.

See INSTALL.md for installation instructions for macOS, Linux, and Windows.

We also have specific information on running yzma on Raspberry Pi, NVIDIA Jetson Orin, and the Arduino UNO Q.

Examples

We have several examples of how you can use yzma in our examples directory.

Vision Language Model (VLM) Multimodal Example

This example uses the Qwen2.5-VL-3B-Instruct-Q8_0 VLM model to process both a text prompt and an image, then displays the result.

$ go run ./examples/vlm/ -model ~/models/Qwen2.5-VL-3B-Instruct-Q8_0.gguf -mmproj ~/models/mmproj-Qwen2.5-VL-3B-Instruct-Q8_0.gguf -image ./images/domestic_llama.jpg -p "What is in this picture?"

The image features a white llama standing in a fenced-in area, possibly a zoo or a farm. The llama is positioned in the center of the image, with its body facing the right side. The fenced area is surrounded by trees, creating a natural environment for the llama.

See the code here.

Small Language Model (SLM) Interactive Chat Example

You can use yzma to do inference on text language models. This example uses the qwen2.5-0.5b-instruct-fp16.gguf model for an interactive chat session.

$ go run ./examples/chat/ -model ./models/qwen2.5-0.5b-instruct-fp16.gguf Enter prompt: Are you ready to go?

Yes, I'm ready to go! What would you like to do?

Enter prompt: Let's go to the zoo

Great! Let's go to the zoo. What would you like to see?

Enter prompt: I want to feed the llama

Sure! Let's go to the zoo and feed the llama. What kind of llama are you interested in feeding?

See the code here.

Additional Examples

See the examples directory for more examples of how to use yzma.

yzma in action

Who is using yzma? Check out some of the tools, applications, examples, and blog posts and videos!

Models

yzma uses models in the GGUF format supported by llama.cpp. There are many models in GGUF format on Hugging Face (over 181k at last count):

https://huggingface.co/models?library=gguf&sort=trending

You can use the yzma command to download models for you!

For example, this downloads the gemma-3-1b-it-GGUF model:

$ yzma model get -u https://huggingface.co/ggml-org/gemma-3-1b-it-GGUF/resolve/main/gemma-3-1b-it-Q4_K_M.gguf

Check out the Model Usage page for more information.

Support

yzma currently has support for over 91% of llama.cpp functionality. See ROADMAP.md for the complete list.

You can use multimodal models (image/audio) and text language models with full hardware acceleration on Linux, macOS, and Windows.

OS	CPU	GPU
Linux	amd64, arm64	CUDA, Vulkan, HIP, ROCm, SYCL
macOS	arm64	Metal
Windows	amd64	CUDA, Vulkan, HIP, SYCL, OpenCL

Whenever there is a new release of llama.cpp, the tests for yzma are run automatically. This helps us stay up to date with the latest code and models.

Required versions of `llama.cpp`

Sometimes there are breaking changes to llama.cpp that require an update to yzma. Here are some of the known compatible versions:

llama.cpp	yzma
? - b8864	v1.12.0
b8865 - b9179	v1.13.0
b9180 - b9459	v1.14.1
b9460 - b9540	v1.15.0
b9541 - b9548	v1.16.0
b9549 - b9561	v1.16.1
b9562 - b9611	v1.17.0
b9616+	v1.17.1

Benchmarks

yzma is fast because it calls llama.cpp in the same process. No external servers needed!

For example, here is the Qwen3-VL-2B-Instruct Visual Language Model (VLM) performing multi-modal inference on an image and text prompt running on a Apple M4 Max with 128 GB RAM:

$ go test -run none -benchtime=10s -count=5 -bench BenchmarkMultimodalInference goos: darwin goarch: arm64 pkg: github.com/hybridgroup/yzma/pkg/mtmd cpu: Apple M4 Max BenchmarkMultimodalInference-16 10 1577948683 ns/op 788.9 tokens/s BenchmarkMultimodalInference-16 12 1243692014 ns/op 910.8 tokens/s BenchmarkMultimodalInference-16 7 1654741804 ns/op 737.2 tokens/s BenchmarkMultimodalInference-16 7 1568106947 ns/op 771.9 tokens/s BenchmarkMultimodalInference-16 10 1704669371 ns/op 706.1 tokens/s PASS ok github.com/hybridgroup/yzma/pkg/mtmd 76.644s

Want to see more benchmarks? Take a look at the BENCHMARKS.md document.

More Info

yzma is now ready to be used to build complete applications that incorporate language models directly into your Golang code.

Here are some advantages of yzma with llama.cpp:

Compile Go programs that use yzma with the normal go build and go run commands. No C compiler needed!
Use the llama.cpp libraries with whatever hardware acceleration is available for your configuration. CUDA, Vulkan, etc.
High performance from making function calls from within the same process. No external model servers!
Download llama.cpp precompiled libraries directly from Github, or include them with your application.
Update the llama.cpp libraries without recompiling your Go program, as long as llama.cpp does not make any breaking changes.

The idea is to make it easier for Go developers to use language models as part of "normal" applications without having to use containers or do anything other than the normal GOOS and GOARCH env variables for cross-complication.

yzma originally started with definitions from the https://github.com/dianlight/gollama.cpp package, but then has gone on to modify them rather heavily. Thank you!