GitHub - kvcache-ai/Mooncake: Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI. (original) (raw)

Mooncake is the serving platform for icon Kimi, a leading LLM service provided by icon Moonshot AI. Now both the Transfer Engine and Mooncake Store are open-sourced! This repository also hosts its technical report and the open-sourced traces.

πŸ”„ Updates

πŸŽ‰ Overview

Mooncake features a KVCache-centric disaggregated architecture that separates the prefill and decoding clusters. It also leverages the underutilized CPU, DRAM, and SSD resources of the GPU cluster to implement a disaggregated KVCache pool.

architecture

The core of Mooncake is its KVCache-centric scheduler, which balances maximizing overall effective throughput while meeting latency-related Service Level Objectives (SLOs). Unlike traditional studies that assume all requests will be processed, Mooncake faces challenges in highly overloaded scenarios. To mitigate these, we developed a prediction-based early rejection policy. Experiments show that Mooncake excels in long-context scenarios. Compared to the baseline method, Mooncake can achieve up to a 525% increase in throughput in certain simulated scenarios while adhering to SLOs. Under real workloads, Mooncake’s innovative architecture enables Kimi to handle 75% more requests.

🧩 Components

**Mooncake Core Component: Transfer Engine (TE)**The core of Mooncake is the Transfer Engine (TE), which provides a unified interface for batched data transfer across various storage devices and network links. Supporting multiple protocols including TCP, RDMA, CXL/shared-memory, and NVMe over Fabric (NVMe-of), TE is designed to enable fast and reliable data transfer for AI workloads. Compared to Gloo (used by Distributed PyTorch) and traditional TCP, TE achieves significantly lower I/O latency, making it a superior solution for efficient data transmission.

P2P Store and Mooncake StoreBoth P2P Store and Mooncake Store are built on the Transfer Engine and provide key/value caching for different scenarios. P2P Store focuses on sharing temporary objects (e.g., checkpoint files) across nodes in a cluster, preventing bandwidth saturation on a single machine. Mooncake Store, on the other hand, supports distributed pooled KVCache, specifically designed for XpYd disaggregation to enhance resource utilization and system performance.

Mooncake Integration with Leading LLM Inference SystemsMooncake has been seamlessly integrated with several popular large language model (LLM) inference systems. Through collaboration with the vLLM and SGLang teams, Mooncake now officially supports prefill-decode disaggregation. By leveraging the high-efficiency communication capabilities of RDMA devices, Mooncake significantly improves inference efficiency in prefill-decode disaggregation scenarios, providing robust technical support for large-scale distributed inference tasks. In addition, Mooncake has been successfully integrated with SGLang's Hierarchical KV Caching, vLLM's prefill serving, and LMCache, augmenting KV cache management capabilities across large-scale inference scenarios.

Elastic Expert Parallelism SupportMooncake adds elasticity and fault tolerance support for MoE model inference, enabling inference systems to remain responsive and recoverable in the event of GPU failures or changes in resource configuration. This functionality includes automatic faulty rank detection and can work with the EPLB module to dynamically route tokens to healthy ranks during inference.

Tensor-Centric EcosystemMooncake establishes a full-stack, Tensor-oriented AI infrastructure where Tensors serve as the fundamental data carrier. The ecosystem spans from the Transfer Engine, which accelerates Tensor data movement across heterogeneous storage (DRAM/VRAM/NVMe), to the P2P Store and Mooncake Store for distributed management of Tensor objects (e.g., Checkpoints and KVCache), up to the Mooncake Backend enabling Tensor-based elastic distributed computing. This architecture is designed to maximize Tensor processing efficiency for large-scale model inference and training.

πŸ–₯️ Supported Hardware

Mooncake supports heterogeneous accelerators, NICs, and specialized transport paths. The summary below focuses on runtime and transport coverage that is already exposed through build options, documented protocols, or dedicated examples in this repository.

Accelerator runtimes

Vendor / Platform Hardware / Runtime Current support in Mooncake How it is exposed
Huawei Ascend Ascend NPUs Supported -DUSE_ASCEND=ON, -DUSE_ASCEND_DIRECT=ON, -DUSE_UBSHMEM=ON, -DUSE_ASCEND_HETEROGENEOUS=ON; covers HCCL transport, Ascend Direct transport, UBShmem transport, and heterogeneous Ascend-GPU transport
Cambricon MLU + Neuware Supported -DUSE_MLU=ON; MLU memory detection, topology discovery, and registration reuse the standard rdma data path
Moore Threads MUSA GPUs Supported -DUSE_MUSA=ON; accelerator-aware data transfer with MUSA runtime integration
MetaX (Muxi) MACA GPUs Supported -DUSE_MACA=ON; source build support through the MACA SDK
T-Head PPU / Barex Supported T-Head PPU deployments are represented here through Barex-based transport support
NVIDIA CUDA GPUs / NVLink Supported -DUSE_CUDA=ON, -DUSE_INTRA_NVLINK=ON, -DUSE_MNNVL=ON; covers CUDA memory, GPUDirect RDMA, GPUDirect Storage, intra-node NVLink, and multi-node NVLink
AMD ROCm / HIP GPUs Supported -DUSE_HIP=ON; HIP transport for AMD GPU communication

Network and fabric support

Vendor / Fabric Hardware / Transport Current support in Mooncake How it is exposed
Alibaba Cloud eRDMA NICs Supported rdma data path with eRDMA devices such as erdma_0; the build also enables CONFIG_ERDMA
Standard RDMA ecosystem InfiniBand / RoCE NICs Supported Available through the standard rdma protocol path with topology-aware NIC selection
AWS Elastic Fabric Adapter (EFA) Supported -DUSE_EFA=ON; EFA transport built on libfabric SRD
Storage disaggregation NVMe-oF Supported Enabled with -DUSE_NVMEOF=ON
Memory pooling CXL Supported Enabled with -DUSE_CXL=ON
Baseline networking TCP/IP Supported tcp works in all environments

Specialized transport paths

Transport path Current support in Mooncake How it is exposed
Ascend HCCL transport Supported Enabled by -DUSE_ASCEND=ON; examples use hccl for Ascend NPU data movement
Ascend Direct transport Supported Enabled by -DUSE_ASCEND_DIRECT=ON; dedicated Ascend Direct examples and docs are included
UBShmem transport Supported Enabled by -DUSE_UBSHMEM=ON; Transfer Engine examples accept --protocol=ubshmem
Heterogeneous Ascend transport Supported Enabled by -DUSE_ASCEND_HETEROGENEOUS=ON; used for Ascend-GPU heterogeneous transfer
Barex transport Supported Enabled by -DUSE_BAREX=ON; documented as the barex advanced transport
Sunrise Transport Supported Included here as an additional specialized transport path to reflect current hardware support positioning
T-Head PPU / Barex Supported Barex-based transport coverage is available for T-Head PPU deployments

πŸ”₯ Show Cases

Use Transfer Engine Standalone (Guide)

Transfer Engine is a high-performance data transfer framework. Transfer Engine provides a unified interface to transfer data from DRAM, VRAM or NVMe, while the technical details related to hardware are hidden. Transfer Engine supports multiple communication protocols including TCP, RDMA (InfiniBand/RoCEv2/eRDMA/NVIDIA GPUDirect), AWS EFA, NVMe over Fabric (NVMe-of), NVLink, HIP, Barex, CXL, and Ascend-family transports. When built with the corresponding runtime, Transfer Engine can also detect and route accelerator memory on CUDA, MUSA, HIP, MACA, Cambricon MLU, and Ascend-enabled environments. For a complete list of supported protocols and configuration guide, see the Supported Protocols Documentation.

Highlights

Performance

With 40 GB of data (equivalent to the size of the KVCache generated by 128k tokens in the LLaMA3-70B model), Mooncake Transfer Engine delivers up to 87 GB/s and 190 GB/s of bandwidth in 4Γ—200 Gbps and 8Γ—400 Gbps RoCE networks respectively, which are about 2.4x and 4.6x faster than the TCP protocol.

P2P Store (Guide)

P2P Store is built on the Transfer Engine and supports sharing temporary objects between peer nodes in a cluster. P2P Store is ideal for scenarios like checkpoint transfer, where data needs to be rapidly and efficiently shared across a cluster.P2P Store has been used in the checkpoint transfer service of Moonshot AI.

Highlights

Mooncake Store (Guide)

Mooncake Store is a distributed KVCache storage engine specialized for LLM inference based on Transfer Engine. It is the central component of the KVCache-centric disaggregated architecture. The goal of Mooncake Store is to store the reusable KV caches across various locations in an inference cluster. Mooncake Store has been supported in SGLang's Hierarchical KV Caching, vLLM's prefill serving and is now integrated with LMCache to provide enhanced KVCache management capabilities.

Highlights

SGLang Integration (Guide)

SGLang officially supports Mooncake Store as a HiCache storage backend. This integration enables scalable KV cache retention and high-performance access for large-scale LLM serving scenarios.

Highlights

vLLM Integration (Guide v0.2)

To optimize LLM inference, the vLLM community is working on supporting disaggregated prefilling (PR 10502). This feature allows separating the prefill phase from the decode phase in different processes. The vLLM uses nccl and gloo as the transport layer by default, but currently it cannot efficiently decouple both phases in different machines.

We have implemented vLLM integration, which uses Transfer Engine as the network layer instead of nccl and gloo, to support inter-node KVCache transfer (PR 10884). Transfer Engine provides simpler interfaces and more efficient use of RDMA devices.

We will soon release the new vLLM integration based on Mooncake Store, which supports xPyD prefill/decode disaggregation.

Update[Dec 16, 2024]: Here is the latest vLLM Integration (Guide v0.2) that is based on vLLM's main branch.

Performance

By supporting Topology Aware Path Selection and multi-card bandwidth aggregation, Mean TTFT of vLLM with Transfer Engine is up to 25% lower than traditional TCP-based transports. In the future, we will further improve TTFT through GPUDirect RDMA and zero-copy.

Backend/Setting Output Token Throughput (tok/s) Total Token Throughput (tok/s) Mean TTFT (ms) Median TTFT (ms) P99 TTFT (ms)
Transfer Engine (RDMA) 12.06 2042.74 1056.76 635.00 4006.59
TCP 12.05 2041.13 1414.05 766.23 6035.36

More advanced features are coming soon, so stay tuned!

πŸš€ Quick Start

Before using Mooncake

Mooncake is designed and optimized for high-speed RDMA networks. Though Mooncake supports TCP-only data transfer, we strongly recommend users to evaluate the functionality and performance of Mooncake with RDMA network support.

The following need to be installed before running any component of Mooncake:

Use Python package

The simplest way to use Mooncake Transfer Engine is using pip:

For CUDA-enabled systems:

pip install mooncake-transfer-engine

pip install mooncake-transfer-engine-cuda13

For non-CUDA systems:

pip install mooncake-transfer-engine-non-cuda

Important

Use Docker image

Mooncake supports Docker-based deployment, see Build Guide in detail.

To produce an image that compiles Mooncake from source, builds the wheel via scripts/build_wheel.sh, and installs that wheel inside the container, use build-wheel.dockerfile:

docker build -f docker/mooncake.Dockerfile
--build-arg PYTHON_VERSION=3.10
--build-arg EP_TORCH_VERSIONS="2.9.1"
-t mooncake:from-source .

The resulting image already has a virtual environment at /opt/venv with the freshly built wheel installed. Launch it with GPU/RDMA access as needed, for example:

python3 scripts/check_hicache_hugepage_requirements.py
--tp-size 4
--hicache-size 64gb
--global-segment-size 8gb
--arena-pool-size 56gb
--available-hugetlb 512gb

sudo sysctl -w vm.nr_hugepages=262144 grep -E 'HugePages_Total|HugePages_Free|Hugepagesize' /proc/meminfo

docker run --gpus all
--network host
--ipc=host
--ulimit memlock=-1
--shm-size=128g
-e MC_STORE_USE_HUGEPAGE=1
-e MC_STORE_HUGEPAGE_SIZE=2MB
-e MOONCAKE_GLOBAL_SEGMENT_SIZE=8gb
-e MC_MMAP_ARENA_POOL_SIZE=56gb
-it mooncake:from-source /bin/bash

The 64gb / 56gb values above are tuned examples for large HiCache deployments, not allocator defaults. The arena is off by default. Setting MC_MMAP_ARENA_POOL_SIZE=... explicitly both enables and sizes the arena; if you enable it via gflag instead, the default pool size is 8gb. On smaller hosts, start with 8gb or 16gb and size upward with the helper. Set MC_DISABLE_MMAP_ARENA=1 (also accepts true, yes, or on) instead when you want the baseline direct-mmap() path. Like the arena size itself, this must be set before the first Mooncake mmap-buffer allocation in the process. Arena bring-up is a one-shot lazy init, so after a failed first attempt you need to restart the process to retry with corrected env / hugepage settings. Without MC_STORE_USE_HUGEPAGE=1, the arena may opportunistically try hugepages and then retry on regular pages if HugeTLB is unavailable. When MC_STORE_USE_HUGEPAGE=1 is present, Mooncake instead preserves the strict hugepage contract for both arena and direct-mmap() host-buffer allocation instead of silently downgrading to regular pages.

Note

Make sure you build the image from the repository root so that Git metadata and submodules are available inside the build context.

Build and use binaries

The following are additional dependencies for building Mooncake:

The build and installation steps are as follows:

  1. Retrieve source code from GitHub repo
    git clone https://github.com/kvcache-ai/Mooncake.git
    cd Mooncake
  2. Install dependencies
  3. Compile Mooncake and examples
    mkdir build
    cd build
    cmake ..
    make -j
    sudo make install # optional, make it ready to be used by vLLM/SGLang

For Cambricon MLU builds, configure CMake with -DUSE_MLU=ON. For example:

mkdir build cd build cmake .. -DUSE_MLU=ON -DNEUWARE_ROOT=/usr/local/neuware make -j

πŸ›£οΈ Incoming Milestones

πŸ“¦ Open Source Trace

{ "timestamp": 27482, "input_length": 6955, "output_length": 52, "hash_ids": [46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 2353, 2354] } { "timestamp": 30535, "input_length": 6472, "output_length": 26, "hash_ids": [46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 2366] }

The above presents two samples from our trace dataset. The trace includes the timing of request arrivals, the number of input tokens, the number of output tokens, and the remapped block hash. To protect our customers' privacy, we applied several mechanisms to remove user-related information while preserving the dataset's utility for simulated evaluation. More descriptions of the trace (e.g., up to 50% cache hit ratio) can be found in Section 4 of the technical report.

Update[Feb 21, 2025]: The updated traces used in our FAST'25 paper have been released! Please refer to the paper's appendix (found here) for more details.

πŸ“‘ Citation

Please kindly cite our paper if you find the paper or the traces are useful:

@article{qin2025mooncake_tos, author = {Qin Ruoyu and Li Zheming and He Weiran and Cui Jialei and Tang Heyi and Ren Feng and Ma Teng and Cai Shangming and Zhang Yineng and Zhang Mingxing and Wu Yongwei and Zheng Weimin and Xu Xinran}, title = {Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving}, year = {2025}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, issn = {1553-3077}, url = {https://doi.org/10.1145/3773772}, doi = {10.1145/3773772}, journal = {ACM Trans. Storage}, month = {nov}, keywords = {Machine learning system, LLM serving, KVCache}, }

@inproceedings{qin2025mooncake, author = {Ruoyu Qin and Zheming Li and Weiran He and Jialei Cui and Feng Ren and Mingxing Zhang and Yongwei Wu and Weimin Zheng and Xinran Xu}, title = {Mooncake: Trading More Storage for Less Computation {\textemdash} A {KVCache-centric} Architecture for Serving {LLM} Chatbot}, booktitle = {23rd USENIX Conference on File and Storage Technologies (FAST 25)}, year = {2025}, isbn = {978-1-939133-45-8}, address = {Santa Clara, CA}, pages = {155--170}, url = {https://www.usenix.org/conference/fast25/presentation/qin}, publisher = {USENIX Association}, month = {feb}, }

@article{qin2024mooncake_arxiv, title = {Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving}, author = {Ruoyu Qin and Zheming Li and Weiran He and Mingxing Zhang and Yongwei Wu and Weimin Zheng and Xinran Xu}, year = {2024}, url = {https://arxiv.org/abs/2407.00079}, }