GitHub - InternLM/lmdeploy: LMDeploy is a toolkit for compressing, deploying, and serving LLMs. (original) (raw)


Latest News 🎉

2025 2024


Introduction

LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by the MMRazor and MMDeploy teams. It has the following core features:

Performance

v0 1 0-benchmark

Supported Models

LLMs VLMs
Llama (7B - 65B) Llama2 (7B - 70B) Llama3 (8B, 70B) Llama3.1 (8B, 70B) Llama3.2 (1B, 3B) InternLM (7B - 20B) InternLM2 (7B - 20B) InternLM3 (8B) InternLM2.5 (7B) Qwen (1.8B - 72B) Qwen1.5 (0.5B - 110B) Qwen1.5 - MoE (0.5B - 72B) Qwen2 (0.5B - 72B) Qwen2-MoE (57BA14B) Qwen2.5 (0.5B - 32B) Qwen3, Qwen3-MoE Baichuan (7B) Baichuan2 (7B-13B) Code Llama (7B - 34B) ChatGLM2 (6B) GLM4 (9B) CodeGeeX4 (9B) YI (6B-34B) Mistral (7B) DeepSeek-MoE (16B) DeepSeek-V2 (16B, 236B) DeepSeek-V2.5 (236B) Mixtral (8x7B, 8x22B) Gemma (2B - 7B) StarCoder2 (3B - 15B) Phi-3-mini (3.8B) Phi-3.5-mini (3.8B) Phi-3.5-MoE (16x3.8B) Phi-4-mini (3.8B) MiniCPM3 (4B) LLaVA(1.5,1.6) (7B-34B) InternLM-XComposer2 (7B, 4khd-7B) InternLM-XComposer2.5 (7B) Qwen-VL (7B) Qwen2-VL (2B, 7B, 72B) Qwen2.5-VL (3B, 7B, 72B) DeepSeek-VL (7B) DeepSeek-VL2 (3B, 16B, 27B) InternVL-Chat (v1.1-v1.5) InternVL2 (1B-76B) InternVL2.5(MPO) (1B-78B) InternVL3 (1B-78B) Mono-InternVL (2B) ChemVLM (8B-26B) CogVLM-Chat (17B) CogVLM2-Chat (19B) MiniCPM-Llama3-V-2_5 MiniCPM-V-2_6 Phi-3-vision (4.2B) Phi-3.5-vision (4.2B) GLM-4V (9B) Llama3.2-vision (11B, 90B) Molmo (7B-D,72B) Gemma3 (1B - 27B) Llama4 (Scout, Maverick)

LMDeploy has developed two inference engines - TurboMind and PyTorch, each with a different focus. The former strives for ultimate optimization of inference performance, while the latter, developed purely in Python, aims to decrease the barriers for developers.

They differ in the types of supported models and the inference data type. Please refer to this table for each engine's capability and choose the proper one that best fits your actual needs.

Quick Start Open In Colab

Installation

It is recommended installing lmdeploy using pip in a conda environment (python 3.8 - 3.12):

conda create -n lmdeploy python=3.8 -y conda activate lmdeploy pip install lmdeploy

The default prebuilt package is compiled on CUDA 12 since v0.3.0. For more information on installing on CUDA 11+ platform, or for instructions on building from source, please refer to the installation guide.

Offline Batch Inference

import lmdeploy with lmdeploy.pipeline("internlm/internlm3-8b-instruct") as pipe: response = pipe(["Hi, pls intro yourself", "Shanghai is"]) print(response)

Note

By default, LMDeploy downloads model from HuggingFace. If you would like to use models from ModelScope, please install ModelScope by pip install modelscope and set the environment variable:

export LMDEPLOY_USE_MODELSCOPE=True

If you would like to use models from openMind Hub, please install openMind Hub by pip install openmind_hub and set the environment variable:

export LMDEPLOY_USE_OPENMIND_HUB=True

For more information about inference pipeline, please refer to here.

Tutorials

Please review getting_started section for the basic usage of LMDeploy.

For detailed user guides and advanced guides, please refer to our tutorials:

Third-party projects

Contributing

We appreciate all contributions to LMDeploy. Please refer to CONTRIBUTING.md for the contributing guideline.

Acknowledgement

Citation

@misc{2023lmdeploy, title={LMDeploy: A Toolkit for Compressing, Deploying, and Serving LLM}, author={LMDeploy Contributors}, howpublished = {\url{https://github.com/InternLM/lmdeploy}}, year={2023} }

License

This project is released under the Apache 2.0 license.