TensorRT - Get Started (original) (raw)
NVIDIA® TensorRT™ is an ecosystem of APIs for high-performance deep learning inference. The TensorRT inference library provides a general-purpose AI compiler and an inference runtime that deliver low latency and high throughput for production applications. TensorRT-LLM builds on top of TensorRT in an open-source Python API with large language model (LLM)-specific optimizations like in-flight batching and custom attention. TensorRT Model Optimizer provides state-of-the-art techniques like quantization and sparsity to reduce model complexity, enabling TensorRT, TensorRT-LLM, and other inference libraries to further optimize speed during deployment.
TensorRT 10.0 GA is a free download for members of the NVIDIA Developer Program.
Ways to Get Started With NVIDIA TensorRT
TensorRT and TensorRT-LLM are available on multiple platforms for free for development. Simplify the deployment of AI models across cloud, data center, and GPU-accelerated workstations with NVIDIA NIM for generative AI, and NVIDIA Triton™ Inference Server for every workload, both part of NVIDIA AI Enterprise.
TensorRT
TensorRT is available to download for free as a binary on multiple different platforms or as a container on NVIDIA NGC™.
Download Now Pull Container From NGC Documentation
Beginner
- Getting Started with NVIDIA TensorRT (video)
- Introductory blog
- Getting started notebooks (Jupyter Notebook)
- Quick-start guide
Intermediate
- Sample code (C++)
- BERT, EfficientDet inference using TensorRT (Jupyter Notebook)
- Serving model with NVIDIA Triton™ (blog, docs)
Expert
- Using quantization aware training (QAT) with TensorRT (blog)
- PyTorch-quantization toolkit (Python code)
- TensorFlow quantization toolkit (blog)
- Sparsity with TensorRT (blog)
TensorRT-LLM
TensorRT-LLM is available for free on GitHub.
Beginner
- Introduction on how TensorRT-LLM supercharges inference (blog)
- How to get started with TensorRT-LLM (blog)
TensorRT Model Optimizer
TensorRT Model Optimizer is available for free on NVIDIA PyPI, with examples and recipes on GitHub.
Beginner
- TensorRT Model Optimizer Quick-Start Guide
- Introduction on Model Optimizer (blog)
- Optimize Generative AI Inference With Quantization (video)
- Optimizing Diffusion models with 8-bit quantization (blog)
Ways to Get Started With NVIDIA TensorRT Frameworks
Torch-TensorRT and TensorFlow-TensorRT are available for free as containers on the NGC catalog or you can purchase NVIDIA AI Enterprise for mission-critical AI inference with enterprise-grade security, stability, manageability, and support. Contact sales or apply for a 90-day NVIDIA AI Enterprise evaluation license to get started.
Torch-TensorRT
Torch-TensorRT is available in the PyTorch container from the NGC catalog.
Pull Container From NGC Documentation
Beginner
- Getting started with NVIDIA Torch-TensorRT (video)
- Accelerate inference up to 6X in PyTorch (blog)
- Object detection with SSD (Jupyter Notebook)
Intermediate
- Post-training quantization with Hugging Face BERT (Jupyter Notebook)
- Quantization aware training (Jupyter Notebook)
- Serving model with Triton (blog, docs)
- Using dynamic shapes (Jupyter Notebook)
TensorFlow-TensorRT
TensorFlow-TensorRT is available in the TensorFlow container from the NGC catalog.
Pull Container From NGC Documentation
Beginner
- Getting started with TensorFlow-TensorRT (video)
- Leverage TF-TRT Integration for Low-Latency Inference (blog)
- Image classification with TF-TRT (video)
- Quantization with TF-TRT (sample code)
Explore More TensorRT Resources
Conversational AI
- Real-Time NLP With BERT (blog)
- Optimizing T5 and GPT-2 (blog)
- Quantize BERT with PTQ and QAT for INT8 Inference (sample)
- ASR With TensorRT (Jupyter Notebook)
- How to Deploy Real-Time TTS (blog)
- NLU With BERT Notebook (Jupyter Notebook)
- Real-Time Text-to-Speech (sample)
- Building an RNN Network Layer by Layer (sample code)
Image and Vision
- Optimize Object Detection (Jupyter Notebook)
- Estimating Depth With ONNX Models and Custom Layers (blog)
- Speeding Up Inference Using TensorFlow, ONNX, and TensorRT (blog)
- Object Detection With EfficientDet, YOLOv3 Networks (Python code samples)
- Using NVIDIA Ampere Architecture and TensorRT (blog)
- Achieving FP32 Accuracy in INT8 using Quantization-Aware Training (blog)
Stay up to date on the latest inference news from NVIDIA.