GitHub - unslothai/llama.cpp: LLM inference in C/C++ (original) (raw)

llama

License: MIT Release Server Docker Winget

Manifesto / ggml / ops

LLM inference in C/C++

Recent API changes

Hot topics


Quick start

Getting started with llama.cpp is straightforward. Here are several ways to install it on your machine:

Once installed, you'll need a model to work with. Head to the Obtaining and quantizing models section to learn more.

Example command:

Use a local model file

llama-cli -m my_model.gguf

Or download and run a model directly from Hugging Face

llama-cli -hf ggml-org/gemma-3-1b-it-GGUF

Launch OpenAI-compatible API server

llama-server -hf ggml-org/gemma-3-1b-it-GGUF

Description

The main goal of llama.cpp is to enable LLM inference with minimal setup and state-of-the-art performance on a wide range of hardware - locally and in the cloud.

The llama.cpp project is the main playground for developing new features for the ggml library.

Models

Typically finetunes of the base models below are supported as well.

Instructions for adding support for new models: HOWTO-add-model.md

Text-only

Multimodal

(to have a project listed here, it should clearly state that it depends on llama.cpp)

Supported backends

Backend Target devices
Metal Apple Silicon
BLAS All
BLIS All
SYCL Intel GPU
OpenVINO [In Progress] Intel CPUs, GPUs, and NPUs
MUSA Moore Threads GPU
CUDA Nvidia GPU
HIP AMD GPU
ZenDNN AMD CPU
Vulkan GPU
CANN Ascend NPU
OpenCL Adreno GPU
IBM zDNN IBM Z & LinuxONE
WebGPU All
RPC All
Hexagon [In Progress] Snapdragon
VirtGPU VirtGPU APIR

Obtaining and quantizing models

The Hugging Face platform hosts a number of LLMs compatible with llama.cpp:

You can either manually download the GGUF file or directly use any llama.cpp-compatible models from Hugging Face or other model hosting sites, by using this CLI argument: -hf <user>/<model>[:quant]. For example:

llama-cli -hf ggml-org/gemma-3-1b-it-GGUF

By default, the CLI would download from Hugging Face, you can switch to other options with the environment variable MODEL_ENDPOINT. The MODEL_ENDPOINT must point to a Hugging Face compatible API endpoint.

After downloading a model, use the CLI tools to run it locally - see below.

llama.cpp requires the model to be stored in the GGUF file format. Models in other data formats can be converted to GGUF using the convert_*.py Python scripts in this repo.

The Hugging Face platform provides a variety of online tools for converting, quantizing and hosting models with llama.cpp:

To learn more about model quantization, read this documentation

llama-cli

A CLI tool for accessing and experimenting with most of llama.cpp's functionality.

> hi, who are you?

Hi there! I'm your helpful assistant! I'm an AI-powered chatbot designed to assist and provide information to users like you. I'm here to help answer your questions, provide guidance, and offer support on a wide range of topics. I'm a friendly and knowledgeable AI, and I'm always happy to help with anything you need. What's on your mind, and how can I assist you today?

> what is 1+1?

Easy peasy! The answer to 1+1 is... 2!

use the "chatml" template (use -h to see the list of supported templates)

llama-cli -m model.gguf -cnv --chat-template chatml

use a custom template

llama-cli -m model.gguf -cnv --in-prefix 'User: ' --reverse-prompt 'User:'

{"appointmentTime": "8pm", "appointmentDetails": "schedule a a call"}

The grammars/ folder contains a handful of sample grammars. To write your own, check out the GBNF Guide.
For authoring more complex JSON grammars, check out https://grammar.intrinsiclabs.ai/

llama-server

A lightweight, OpenAI API compatible, HTTP server for serving LLMs.

Basic web UI can be accessed via browser: http://localhost:8080

Chat completion endpoint: http://localhost:8080/v1/chat/completions

up to 4 concurrent requests, each with 4096 max context

llama-server -m model.gguf -c 16384 -np 4

the draft.gguf model should be a small variant of the target model.gguf

llama-server -m model.gguf -md draft.gguf

use the /embedding endpoint

llama-server -m model.gguf --embedding --pooling cls -ub 8192

use the /reranking endpoint

llama-server -m model.gguf --reranking

custom grammar

llama-server -m model.gguf --grammar-file grammar.gbnf

JSON

llama-server -m model.gguf --grammar-file grammars/json.gbnf

llama-perplexity

A tool for measuring the perplexity 1 (and other quality metrics) of a model over a given text.

[1]15.2701,[2]5.4007,[3]5.3073,[4]6.2965,[5]5.8940,[6]5.6096,[7]5.7942,[8]4.9297, ...

Final estimate: PPL = 5.4007 +/- 0.67339

llama-bench

Benchmark the performance of the inference for various parameters.

Output:

| model | size | params | backend | threads | test | t/s |

| ------------------- | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |

| qwen2 1.5B Q4_0 | 885.97 MiB | 1.54 B | Metal,BLAS | 16 | pp512 | 5765.41 Β± 20.55 |

| qwen2 1.5B Q4_0 | 885.97 MiB | 1.54 B | Metal,BLAS | 16 | tg128 | 197.71 Β± 0.81 |

build: 3e0ba0e60 (4229)

llama-simple

A minimal example for implementing apps with llama.cpp. Useful for developers.

Hello my name is Kaitlyn and I am a 16 year old girl. I am a junior in high school and I am currently taking a class called "The Art of

Contributing

Other documentation

Development documentation

Seminal papers and background on the models

If your issue is with model generation quality, then please at least scan the following links and papers to understand the limitations of LLaMA models. This is especially important when choosing an appropriate model size and appreciating both the significant and subtle differences between LLaMA models and ChatGPT:

XCFramework

The XCFramework is a precompiled version of the library for iOS, visionOS, tvOS, and macOS. It can be used in Swift projects without the need to compile the library from source. For example:

// swift-tools-version: 5.10 // The swift-tools-version declares the minimum version of Swift required to build this package.

import PackageDescription

let package = Package( name: "MyLlamaPackage", targets: [ .executableTarget( name: "MyLlamaPackage", dependencies: [ "LlamaFramework" ]), .binaryTarget( name: "LlamaFramework", url: "https://github.com/ggml-org/llama.cpp/releases/download/b5046/llama-b5046-xcframework.zip", checksum: "c19be78b5f00d8d29a25da41042cb7afa094cbf6280a225abe614b03b20029ab" ) ] )

The above example is using an intermediate build b5046 of the library. This can be modified to use a different version by changing the URL and checksum.

Completions

Command-line completion is available for some environments.

Bash Completion

$ build/bin/llama-cli --completion-bash > ~/.llama-completion.bash $ source ~/.llama-completion.bash

Optionally this can be added to your .bashrc or .bash_profile to load it automatically. For example:

$ echo "source ~/.llama-completion.bash" >> ~/.bashrc

Dependencies

  1. https://huggingface.co/docs/transformers/perplexity ↩