NVIDIA TensorRT-LLM (original) (raw)

What can I help you with?

NVIDIA TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build NVIDIA TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.

This is the starting point to try out TensorRT-LLM. Specifically, this Quick Start Guide enables you to quickly get setup and send HTTP requests using TensorRT-LLM.

Browse

This document provides step-by-step instructions on how to install TensorRT-LLM on Linux.

Browse

This document provides instructions for building TensorRT-LLM from the source code on Linux.

Browse

Clone the latest TensorRT-LLM branch, work with the code, participate in the development of the product, pull in latest changes, and view latest discussions.

Browse

This document provides an overview about TensorRT-LLM and how it accelerates and optimizes inference performance for the latest large language models (LLMs) on NVIDIA GPUs. Discover the major benefits that TensorRT-LLM provides and how it can help you.

Browse

This document provides the current status, software versions, fixed bugs, and known issues for TensorRT-LLM. All published functionality in the Release Notes has been fully tested and verified with known limitations documented.

Browse

This document lists the supported GPUs, models, and other hardware and software versions for the latest NVIDIA TensorRT-LLM release.

Browse

This document explains how TensorRT-LLM as toolkit, assembles optimized solutions to perform Large Language Model (LLM) inference.

Browse

This is the C++ API Runtime documentation for the TensorRT-LLM library.

Browse

This is the Python API Runtime documentation for the TensorRT-LLM library.

Browse

This is the Python API Layers documentation for the TensorRT-LLM library.

Browse

This is the Python API Functionals documentation for the TensorRT-LLM library.

Browse

This is the Python API Models documentation for the TensorRT-LLM library.

Browse

This is the Python API Plugin documentation for the TensorRT-LLM library.

Browse

This is the Python API Quantization documentation for the TensorRT-LLM library.

Browse

Learn how we used NVIDIA’s suite of solutions for optimizing LLM models and deploying in multi-GPU environments.

Browse

Learn about accelerated LLM model alignment using the NeMo Framework and inference optimization and deployment through NVIDIA's TensorRT-LLM and Triton Inference Server.

Browse

Learn how we are leveraging TensorRT-LLM to implement key features of our model-serving product and highlight useful features of TensorRT-LLM such as streaming of tokens, in-flight batching, paged-attention, quantization, and more.

Browse

Find more news and tutorials.

Browse

Join the NVIDIA Developer Program.

Browse

Explore TensorRT-LLM forums.

Browse

This document describes how to debug unit tests, execution errors, E2E models, and installation issues.

Browse