GitHub - NVIDIA/TensorRT-LLM: TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and support state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that orchestrate the inference execution in performant way. (original) (raw)

Latest News

L4_perf

TensorRT-LLM Overview

TensorRT-LLM is an open-sourced library for optimizing Large Language Model (LLM) inference. It provides state-of-the-art optimizations, including custom attention kernels, inflight batching, paged KV caching, quantization (FP8, FP4, INT4 AWQ, INT8 SmoothQuant, ...), speculative decoding, and much more, to perform inference efficiently on NVIDIA GPUs.

Recently re-architected with a PyTorch backend, TensorRT-LLM now combines peak performance with a more flexible and developer-friendly workflow. The original TensorRT-based backend remains supported and continues to provide an ahead-of-time compilation path for building highly optimized "Engines" for deployment. The PyTorch backend complements this by enabling faster development iteration and rapid experimentation.

TensorRT-LLM provides a flexible LLM API to simplify model setup and inference across both PyTorch and TensorRT backends. It supports a wide range of inference use cases from a single GPU to multiple nodes with multiple GPUs using Tensor Parallelism and/or Pipeline Parallelism. It also includes a backend for integration with the NVIDIA Triton Inference Server.

Several popular models are pre-defined and can be easily customized or extended using native PyTorch code (for the PyTorch backend) or a PyTorch-style Python API (for the TensorRT backend).

Getting Started

To get started with TensorRT-LLM, visit our documentation: