Welcome to vLLM — vLLM (original) (raw)
Welcome to vLLM#
Easy, fast, and cheap LLM serving for everyone
vLLM is a fast and easy-to-use library for LLM inference and serving.
Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.
vLLM is fast with:
- State-of-the-art serving throughput
- Efficient management of attention key and value memory with PagedAttention
- Continuous batching of incoming requests
- Fast model execution with CUDA/HIP graph
- Quantization: GPTQ, AWQ, INT4, INT8, and FP8
- Optimized CUDA kernels, including integration with FlashAttention and FlashInfer.
- Speculative decoding
- Chunked prefill
vLLM is flexible and easy to use with:
- Seamless integration with popular HuggingFace models
- High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more
- Tensor parallelism and pipeline parallelism support for distributed inference
- Streaming outputs
- OpenAI-compatible API server
- Support NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs, Gaudi® accelerators and GPUs, IBM Power CPUs, TPU, and AWS Trainium and Inferentia Accelerators.
- Prefix caching support
- Multi-lora support
For more information, check out the following:
- vLLM announcing blog post (intro to PagedAttention)
- vLLM paper (SOSP 2023)
- How continuous batching enables 23x throughput in LLM inference while reducing p50 latency by Cade Daniel et al.
- vLLM Meetups
Documentation#
Models
Features
- Quantization
- LoRA Adapters
- Tool Calling
- Reasoning Outputs
- Structured Outputs
- Automatic Prefix Caching
- Disaggregated Prefilling (experimental)
- Speculative Decoding
- Compatibility Matrix
Deployment
- Security Guide
- Using Docker
- Using Kubernetes
- Using Nginx
- Using other frameworks
- External Integrations
Design Documents
- Architecture Overview
- Integration with HuggingFace
- vLLM’s Plugin System
- vLLM Paged Attention
- Multi-Modal Data Processing
- Automatic Prefix Caching
- Python Multiprocessing
V1 Design Documents
Developer Guide
- Contributing to vLLM
- Deprecation Policy
- Profiling vLLM
- Dockerfile
- Adding a New Model
- Vulnerability Management
API Reference