vLLM Blog (original) (raw)

Apr 23, 2025

Accelerating RLHF with vLLM, Best Practice from OpenRLHF

Apr 11, 2025

Transformers backend integration in vLLM

Apr 5, 2025

Llama 4 in vLLM

Feb 24, 2025

PTPC-FP8: Boosting vLLM Performance on AMD ROCm

Feb 21, 2025

Introducing AIBrix: A Scalable, Cost-Effective Control Plane for vLLM

Feb 17, 2025

Distributed Inference with vLLM

Jan 27, 2025

vLLM V1: A Major Upgrade to vLLM's Core Architecture

Jan 27, 2025

Introducing vLLM Inference Provider in Llama Stack

Jan 21, 2025

High Performance and Easy Deployment of vLLM in K8S with “vLLM production-stack”

Jan 14, 2025

Structured Decoding in vLLM: a gentle introduction

Jan 10, 2025

vLLM 2024 Retrospective and 2025 Vision

Jan 10, 2025

Installing and Developing vLLM with Ease

Oct 23, 2024

Serving LLMs on AMD MI300X: Best Practices

Oct 17, 2024

How Speculative Decoding Boosts vLLM Performance by up to 2.8x

Sep 5, 2024

vLLM v0.6.0: 2.7x Throughput Improvement and 5x Latency Reduction

Jul 25, 2024

vLLM’s Open Governance and Performance Roadmap

Jul 23, 2024

Announcing Llama 3.1 Support in vLLM

Nov 14, 2023

Notes on vLLM v.s. DeepSpeed-FastGen

Jun 20, 2023

vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention