Quantization — TensorRT LLM (original) (raw)

The PyTorch backend supports FP8 and NVFP4 quantization. You can pass quantized models in HF model hub, which are generated by TensorRT Model Optimizer.

from tensorrt_llm._torch import LLM llm = LLM(model='nvidia/Llama-3.1-8B-Instruct-FP8') llm.generate("Hello, my name is")

Or you can try the following commands to get a quantized model by yourself:

git clone https://github.com/NVIDIA/TensorRT-Model-Optimizer.git cd TensorRT-Model-Optimizer/examples/llm_ptq scripts/huggingface_example.sh --model --quant fp8 --export_fmt hf