GitHub - kvcache-ai/ktransformers: A Flexible Framework for Experiencing Heterogeneous LLM Inference/Fine-tune Optimizations (original) (raw)

KTransformers

A Flexible Framework for Experiencing Cutting-edge LLM Inference/Fine-tune Optimizations

🎯 Overview | πŸš€ Inference | πŸŽ“ SFT | πŸ”₯ Citation | πŸš€ Roadmap(2026Q2)

🎯 Overview

KTransformers is a research project focused on efficient inference and fine-tuning of large language models through CPU-GPU heterogeneous computing. The project now exposes two user-facing capabilities from the kt-kernel source tree: Inference and SFT.

πŸ”₯ Updates


πŸ“¦ Capabilities

πŸš€ Inference - High-Performance kt-kernel Serving

CPU-optimized kernel operations for heterogeneous LLM inference.

image

Key Features:

Quick Start:

cd kt-kernel pip install .

Use Cases:

Performance Examples:

Model Hardware Configuration Total Throughput Output Throughput
DeepSeek-R1-0528 (FP8) 8Γ—L20 GPU + Xeon Gold 6454S 227.85 tokens/s 87.58 tokens/s (8-way concurrency)

πŸ‘‰ Full Documentation β†’


πŸŽ“ SFT - Fine-Tuning with LLaMA-Factory

KTransformers Γ— LLaMA-Factory integration for ultra-large MoE model fine-tuning.

KTransformers SFT

Key Features:

Model GPU Memory Training Speed Hardware
DeepSeek-V3 ~80GB total 3.7 it/s 4x RTX 4090
DeepSeek-R1 ~80GB total 3.7 it/s 4x RTX 4090
Qwen3-30B-A3B ~24GB total 8+ it/s 1x RTX 4090

Quick Start:

cd /path/to/LLaMA-Factory pip install -e . pip install -r requirements/ktransformers.txt CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch
--config_file examples/ktransformers/accelerate/fsdp2_kt_int8.yaml
src/train.py
examples/ktransformers/train_lora/qwen3_5moe_lora_sft_kt.yaml

πŸ‘‰ **Quick Start β†’**πŸ‘‰ Full Documentation β†’


πŸ”₯ Citation

If you use KTransformers in your research, please cite our paper:

@inproceedings{10.1145/3731569.3764843, title = {KTransformers: Unleashing the Full Potential of CPU/GPU Hybrid Inference for MoE Models}, author = {Chen, Hongtao and Xie, Weiyu and Zhang, Boxin and Tang, Jingqi and Wang, Jiahao and Dong, Jianwei and Chen, Shaoyuan and Yuan, Ziwei and Lin, Chen and Qiu, Chengyu and Zhu, Yuening and Ou, Qingliang and Liao, Jiaqi and Chen, Xianglin and Ai, Zhiyuan and Wu, Yongwei and Zhang, Mingxing}, booktitle = {Proceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles}, year = {2025} }

πŸ‘₯ Contributors & Team

Developed and maintained by:

We welcome contributions! Please feel free to submit issues and pull requests.

πŸ’¬ Community & Support

πŸ“¦ KT original Code

The original integrated KTransformers framework has been archived to the archive/ directory for reference. The project now organizes the two capabilities above from the kt-kernel source tree for clearer documentation and maintenance.

For the original documentation with full quick-start guides and examples, see: