vLLM Blog (original) (raw)

Accelerating RLHF with vLLM, Best Practice from OpenRLHF

Transformers backend integration in vLLM

Llama 4 in vLLM

PTPC-FP8: Boosting vLLM Performance on AMD ROCm

Introducing AIBrix: A Scalable, Cost-Effective Control Plane for vLLM

Distributed Inference with vLLM

vLLM V1: A Major Upgrade to vLLM's Core Architecture

Introducing vLLM Inference Provider in Llama Stack

High Performance and Easy Deployment of vLLM in K8S with “vLLM production-stack”

Structured Decoding in vLLM: a gentle introduction

vLLM 2024 Retrospective and 2025 Vision

Installing and Developing vLLM with Ease

Serving LLMs on AMD MI300X: Best Practices

How Speculative Decoding Boosts vLLM Performance by up to 2.8x

vLLM v0.6.0: 2.7x Throughput Improvement and 5x Latency Reduction

vLLM’s Open Governance and Performance Roadmap

Announcing Llama 3.1 Support in vLLM

Notes on vLLM v.s. DeepSpeed-FastGen

vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention