Kimi K2.7 Code - How to Run Locally | Unsloth Documentation (original) (raw)

For the complete documentation index, see llms.txt. This page is also available as Markdown.

  1. Models

🌘Kimi K2.7 Code - How to Run Locally

Step-by-step guide to running Kimi K2.7 Code on your own local device.

Kimi K2.7 Code is Moonshot AI’s agentic coding model, building on K2.6 to improve task completion while using ~30% fewer thinking tokens. The 1T-parameter (32B active) MoE model supports thinking only, vision and 256K context. It delivers SOTA open performance across vision, coding, agentic, long-context, and chat tasks. Full precision requires 605GB of disk space; Unsloth Dynamic 2-bit requires 325GB (-48%). Run Kimi-K2.7-Code-GGUF via Unsloth Studio or llama.cpp.

Unsloth Dynamic quants upcasts important layers to 8-bit and 1-bit needs 310GB+ VRAM/RAM setups**.** For lossless Kimi K2.6, use Q8 (UD-Q8_K_XL), which is only 10GB larger than Q4 (UD-Q4_K_XL). You can run Kimi K2.7 Code via a Mac Studio or DGX Station.

Table: Hardware requirements (units = total memory: RAM + VRAM, or unified memory)

Like Kimi-K2.6, UD-Q8_K_XL is lossless because Kimi uses int4 for MoE weights and BF16 for everything else, and Q8_K_XL follows that. Thus, we use the same Dynamic methodology for Kimi-K2.6 conversion. UD-Q4_K_XL is similar except the remaining tensors are Q8_0, so it is near full precision and requires 600GB RAM/VRAM. UD-Q8_K_XL is 'truly lossless'.

We followed jukofyork's finding that const float d = max / -7; instead of the default const float d = max / -8; during the quantization process only on the MoE layers. This bijection patch on INT4-native MoEs allows the Q4_0 quant-type to reduce absolute error from 1.8% to near 0% (epsilon). For example below is the histogram for Kimi-K2.7-Code, and you can see -8 is unused entirely:

Note we must keep other layers in BF16 as well and not smart "Q4_0". We show below the error plots for both versus the BF16 baseline. UD-Q8-K_XL is truly "lossless" with some machine epsilon difference when converting Q4_0 to BF16. So Q4_K_XL does have some quantization error due to Q8_0 being used, whilst Q8_K_XL is nearly lossless, except for BF16 rounding.

For Q4_K_XL, we also plot the per tensor error from Q8_0 vs BF16 as well. In general there is some error between Q8_K_XL (near lossless) vs Q4_K_XL, but not much.

Kimi K2.7 Code is thinking-only, with **preserve_thinking** always enabled. Instant mode is not supported.

If the model fits, you will get >100 tokens/s when using B200s. We recommend UD-Q2_K_XL (345GB) as a good size/quality balance. Best rule of thumb: RAM+VRAM ≈ the quant size; otherwise it’ll still work, just slower due to offloading.

Chat Template for Kimi K2.7-Code

Running tokenizer.apply_chat_template([{"role": "user", "content": "What is 1+1?"},]) gets:

If we also input tools as referenced in Tool Calling Guide, then we see the below:

🦥 Run Kimi-K2.7-Code in Unsloth Studio

Kimi K2.7 Code can run in Unsloth Studio, an open-source web UI for local AI. Unsloth Studio automatically offloads to RAM and detects multiGPU setups. With Unsloth Studio, you can run models locally on MacOS, Windows, Linux and:

Install and Launch Unsloth

To install, run in your terminal:

MacOS, Linux, WSL:

Windows PowerShell:

Launch Unsloth

MacOS, Linux, WSL and Windows:

Then open http://127.0.0.1:8888 (or your specific URL) in your browser.

Search and download Kimi K2.7-Code

Unsloth Studio automatically offloads to RAM and detects multiGPU setups. On first launch you will need to create a password to secure your account and sign in again later.

Then go to the Studio Chat tab and search for Kimi-K2.7 Code in the search bar and download your desired model and quant. Ensure you have enough compute the run the model.

Run Kimi-K2.7-Code

Inference parameters should be auto-set when using Unsloth Studio, however you can still change it manually. You can also edit the context length, chat template and other settings.

For more information, you can view our Unsloth Studio inference guide.

Example of Qwen3.6 running with tool-calling

🦙 Run Kimi K2.7 Code in llama.cpp

For this guide we'll be running the UD-Q2_K_XL quant which will require at least 345GB RAM. Feel free to change quantization type. GGUF: Kimi-K2.7-Code-GGUF

For these tutorials, we will using llama.cpp for fast local inference, especially if you have a CPU.

Obtain the latest llama.cpp on GitHub here. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference. For Apple Mac / Metal devices, set -DGGML_CUDA=OFF then continue as usual - Metal support is on by default.

You can now use llama.cpp directly to load and download models, just like ollama run. First, select the quantization type you want like Q2_K_XL. Also use export LLAMA_CACHE="folder" to force llama.cpp to save to a specific location. Note this download process might be very slow, so it's probably best to use the manual download process in the next section.

If you want to download the model manually, we can download the model via the code below (after installing pip install huggingface_hub). If downloads get stuck, see: Hugging Face Hub, XET debugging

Then run the model in conversation mode:

Then you will see the below:

Then use /image to load both images in and ask "What is this image":

and you will get something like below:

On the 2nd image of the sloth:

Which will get you:

You can view further below for benchmarks in table format:

<|im_user|>user<|im_middle|>What is 1+1?<|im_end|><|im_assistant|>assistant<|im_middle|><think>
<|im_system|>tool_declare<|im_middle|># Tools

## functions
namespace functions {
// Add two numbers.
type add_number = (_: {
  // The first number.
  a: string,
  // The second number.
  b: string
}) => any;
// Multiply two numbers.
type multiply_number = (_: {
  // The first number.
  a: string,
  // The second number.
  b: string
}) => any;
// Subtract two numbers.
type subtract_number = (_: {
  // The first number.
  a: string,
  // The second number.
  b: string
}) => any;
// Writes a random story.
type write_a_story = (_: {}) => any;
// Perform operations from the terminal.
type terminal = (_: {
  // The command you wish to launch, e.g `ls`, `rm`, ...
  command: string
}) => any;
// Call a Python interpreter with some Python code that will be ran.
type python = (_: {
  // The Python code to run
  code: string
}) => any;
}
<|im_end|><|im_user|>user<|im_middle|>What is 1+1?<|im_end|><|im_assistant|>assistant<|im_middle|><think>
curl -fsSL https://unsloth.ai/install.sh | sh
irm https://unsloth.ai/install.ps1 | iex
unsloth studio -H 0.0.0.0 -p 8888
apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build \
    -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-mtmd-cli llama-server llama-gguf-split
cp llama.cpp/build/bin/llama-* llama.cpp
wget https://raw.githubusercontent.com/unslothai/unsloth/refs/heads/main/images/unsloth%20made%20with%20love.png -O unsloth.png
wget https://files.worldwildlife.org/wwfcmsprod/images/Sloth_Sitting_iStock_3_12_2014/story_full_width/8l7pbjmj29_iStock_000011145477Large_mini__1_.jpg -O picture.png
export LLAMA_CACHE="unsloth/Kimi-K2.7-Code-GGUF"
./llama.cpp/llama-cli \
    -hf unsloth/Kimi-K2.7-Code-GGUF:UD-Q2_K_XL \
    --temp 1.0 \
    --top-p 0.95
hf download unsloth/Kimi-K2.7-Code-GGUF \
    --local-dir unsloth/Kimi-K2.7-Code-GGUF \
    --include "*mmproj-F16*" \
    --include "*UD-Q2_K_XL*" # Use "*UD-Q8_K_XL*" for full precision
./llama.cpp/llama-cli \
    --model unsloth/Kimi-K2.7-Code-GGUF/UD-Q2_K_XL/Kimi-K2.7-Code-UD-Q2_K_XL-00001-of-00008.gguf \
    --mmproj unsloth/Kimi-K2.7-Code-GGUF/mmproj-F16.gguf \
    --temp 1.0 \
    --top-p 0.95