GitHub - ModelTC/lightllm: LightLLM is a Python-based LLM (Large Language Model) inference and serving framework, notable for its lightweight design, easy scalability, and high-speed performance. (original) (raw)
LightLLM is a Python-based LLM (Large Language Model) inference and serving framework, notable for its lightweight design, easy scalability, and high-speed performance. LightLLM harnesses the strengths of numerous well-regarded open-source implementations, including but not limited to FasterTransformer, TGI, vLLM, and FlashAttention.
English Docs | 中文文档 | Blogs
News
- [2025/05] LightLLM paper on constrained decoding accepted by ACL25 (Pre 3^33: Enabling Deterministic Pushdown Automata for Faster Structured LLM Generation)
- [2025/04] LightLLM paper on request scheduler published in ASPLOS’25 (Past-Future Scheduler for LLM Serving under SLA Guarantees)
- [2025/02] 🔥 LightLLM v1.0.0 release, achieving the fastest DeepSeek-R1 serving performance on single H200 machine.
Get started
Performance
Learn more in the release blogs: v1.0.0 blog.
FAQ
Please refer to the FAQ for more information.
Projects using LightLLM
We welcome any coopoeration and contribution. If there is a project requires LightLLM's support, please contact us via email or create a pull request.
- LazyLLM: Easyest and lazyest way for building multi-agent LLMs applications.
Once you have installedlightllm
andlazyllm
, and then you can use the following code to build your own chatbot:
from lazyllm import TrainableModule, deploy, WebModule
Model will be download automatically if you have an internet connection
m = TrainableModule('internlm2-chat-7b').deploy_method(deploy.lightllm)
WebModule(m).start().wait()
Documents: https://lazyllm.readthedocs.io/
Projects based on LightLLM or referenced LightLLM components:
- LoongServe, Peking University
- OmniKV, Ant Group
- vLLM (some LightLLM's kernel used)
- SGLang (some LightLLM's kernel used)
- ParrotServe, Microsoft
- Aphrodite (some LightLLM's kernel used)
- S-LoRA
Also, LightLLM's pure-python design and token-level KC Cache management make it easy to use as the basis for research projects.
Academia works based on or use part of LightLLM:
- ParrotServe (OSDI’24)
- SLoRA (MLSys’24)
- LoongServe (SOSP’24)
- ByteDance’s CXL (Eurosys’24)
- VTC (OSDI’24)
- OmniKV (ICLR’25)
- CaraServe, LoRATEE, FastSwitch ...
Community
For further information and discussion, join our discord server. Welcome to be a member and look forward to your contribution!
License
This repository is released under the Apache-2.0 license.
Acknowledgement
We learned a lot from the following projects when developing LightLLM.
- Faster Transformer
- Text Generation Inference
- vLLM
- SGLang
- flashinfer
- Flash Attention 1&2
- OpenAI Triton
Citation
We have published a number of papers around components or features of LightLLM, if you use LightLLM in your work, please consider citing the relevant paper.
Request scheduler: accepted by ASPLOS’25:
@inproceedings{gong2025past, title={Past-Future Scheduler for LLM Serving under SLA Guarantees}, author={Gong, Ruihao and Bai, Shihao and Wu, Siyu and Fan, Yunqian and Wang, Zaijun and Li, Xiuhong and Yang, Hailong and Liu, Xianglong}, booktitle={Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2}, pages={798--813}, year={2025} }