llama.cpp - Qwen (original) (raw)

Before starting, let’s first discuss what is llama.cpp and what you should expect, and why we say “use” llama.cpp, with “use” in quotes. llama.cpp is essentially a different ecosystem with a different design philosophy that targets light-weight footprint, minimal external dependency, multi-platform, and extensive, flexible hardware support:

It’s like the Python frameworks torch+transformers or torch+vllm but in C++. However, this difference is crucial:

To use llama.cpp means that you use the llama.cpp library in your own program, like writing the source code of Ollama, LM Studio, GPT4ALL, llamafile etc. But that’s not what this guide is intended or could do. Instead, here we introduce how to use the llama-cli example program, in the hope that you know that llama.cpp does support Qwen2.5 models and how the ecosystem of llama.cpp generally works.

In this guide, we will show how to “use” llama.cpp to run models on your local machine, in particular, the llama-cli and the llama-server example program, which comes with the library.

The main steps are:

  1. Get the programs
  2. Get the Qwen3 models in GGUF[1] format
  3. Run the program with the model

Note

llama.cpp supports Qwen3 and Qwen3MoE from version b5092.

Getting the Program

You can get the programs in various ways. For optimal efficiency, we recommend compiling the programs locally, so you get the CPU optimizations for free. However, if you don’t have C++ compilers locally, you can also install using package managers or downloading pre-built binaries. They could be less efficient but for non-production example use, they are fine.

Compile Locally

Here, we show the basic command to compile llama-cli locally on macOS or Linux. For Windows or GPU users, please refer to the guide from llama.cpp.

Installing Build Tools

To build locally, a C++ compiler and a build system tool are required. To see if they have been installed already, type cc --version or cmake --version in a terminal window.

Compiling the Program

For the first step, clone the repo and enter the directory:

git clone https://github.com/ggml-org/llama.cpp cd llama.cpp

Then, build llama.cpp using CMake:

cmake -B build cmake --build build --config Release

The first command will check the local environment and determine which backends and features should be included. The second command will actually build the programs.

To shorten the time, you can also enable parallel compiling based on the CPU cores you have, for example:

cmake --build build --config Release -j 8

This will build the programs with 8 parallel compiling jobs.

The built programs will be in ./build/bin/.

Package Managers

For macOS and Linux users, llama-cli and llama-server can be installed with package managers including Homebrew, Nix, and Flox.

Here, we show how to install llama-cli and llama-server with Homebrew. For other package managers, please check the instructions here.

Installing with Homebrew is very simple:

  1. Ensure that Homebrew is available on your operating system. If you don’t have Homebrew, you can install it as in its website.
  2. Second, you can install the pre-built binaries, llama-cli and llama-server included, with a single command:

Note that the installed binaries might not be built with the optimal compile options for your hardware, which can lead to poor performance. They also don’t support GPU on Linux systems.

Binary Release

You can also download pre-built binaries from GitHub Releases. Please note that those pre-built binaries files are architecture-, backend-, and os-specific. If you are not sure what those mean, you probably don’t want to use them and running with incompatible versions will most likely fail or lead to poor performance.

The file name is like llama-<version>-bin-<os>-<feature>-<arch>.zip.

There are three simple parts:

The <feature> part is somewhat complicated for Windows:

You don’t have much choice for macOS or Linux.

After downloading the .zip file, unzip them into a directory and open a terminal at that directory.

Getting the GGUF

GGUF[1] is a file format for storing information needed to run a model, including but not limited to model weights, model hyperparameters, default generation configuration, and tokenizer.

You can use the official Qwen GGUFs from our Hugging Face Hub or prepare your own GGUF file.

Using the Official Qwen3 GGUFs

We provide a series of GGUF models in our Hugging Face organization, and to search for what you need you can search the repo names with -GGUF.

Download the GGUF model that you want with huggingface-cli (you need to install it first with pip install huggingface_hub):

huggingface-cli download --local-dir

For example:

huggingface-cli download Qwen/Qwen3-8B-GGUF qwen3-8b-q4_k_m.gguf --local-dir .

This will download the Qwen3-8B model in GGUF format quantized with the scheme Q4_K_M.

Preparing Your Own GGUF

Model files from Hugging Face Hub can be converted to GGUF, using the convert-hf-to-gguf.py Python script. It does require you to have a working Python environment with at least transformers installed.

Obtain the source file if you haven’t already:

git clone https://github.com/ggml-org/llama.cpp cd llama.cpp

Suppose you would like to use Qwen3-8B you can make a GGUF file for the fp16 model as shown below:

python convert-hf-to-gguf.py Qwen/Qwen3-8B --outfile qwen3-8b-f16.gguf

The first argument to the script refers to the path to the HF model directory or the HF model name, and the second argument refers to the path of your output GGUF file. Remember to create the output directory before you run the command.

The fp16 model could be a bit heavy for running locally, and you can quantize the model as needed. We introduce the method of creating and quantizing GGUF files in this guide. You can refer to that document for more information.

Run Qwen with llama.cpp

Note

Regarding switching between thinking and non-thinking modes, while the soft switch is always available, the hard switch implemented in the chat template is not exposed in llama.cpp. The quick workaround is to pass a custom chat template equivalent to always enable_thinking=False via --chat-template-file.

llama-cli

llama-cli is a console program which can be used to chat with LLMs. Simple run the following command where you place the llama.cpp programs:

./llama-cli -hf Qwen/Qwen3-8B-GGUF:Q8_0 --jinja --color -ngl 99 -fa -sm row --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0 -c 40960 -n 32768 --no-context-shift

Here are some explanations to the above command:

llama-server

llama-server is a simple HTTP server, including a set of LLM REST APIs and a simple web front end to interact with LLMs using llama.cpp.

The core command is similar to that of llama-cli. In addition, it supports thinking content parsing and tool call parsing.

./llama-server -hf Qwen/Qwen3-8B-GGUF:Q8_0 --jinja --reasoning-format deepseek -ngl 99 -fa -sm row --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0 -c 40960 -n 32768 --no-context-shift

By default, the server will listen at http://localhost:8080 which can be changed by passing --host and --port. The web front end can be assessed from a browser at http://localhost:8080/. The OpenAI compatible API is at http://localhost:8080/v1/.

What’s More

If you still find it difficult to use llama.cpp, don’t worry, just check out other llama.cpp-based applications. For example, Qwen3 has already been officially part of Ollama and LM Studio, which are platforms for your to search and run local LLMs.

Have fun!