TensorRT-LLM — Model Optimizer 0.27.1 (original) (raw)

ModelOpt toolkit supports automatic conversion of ModelOpt exported LLM to the TensorRT-LLM checkpoint and the engines for accelerated inferencing.

This conversion is achieved by:

  1. Converting Huggingface, NeMo and ModelOpt exported checkpoints to the TensorRT-LLM checkpoint.
  2. Building TensorRT-LLM engine from the TensorRT-LLM checkpoint.

Export Quantized Model

After the model is quantized, the quantized model can be exported to the TensorRT-LLM checkpoint format stored as

  1. A single JSON file recording the model structure and metadata (config.json)
  2. A group of safetensors files, each recording the local calibrated model on a single GPU rank (model weights, scaling factors per GPU).

The export API (export_tensorrt_llm_checkpoint) can be used as follows:

from modelopt.torch.export import export_tensorrt_llm_checkpoint

with torch.inference_mode(): export_tensorrt_llm_checkpoint( model, # The quantized model. decoder_type, # The type of the model as str, e.g gpt, gptj, llama. dtype, # the weights data type to export the unquantized layers. export_dir, # The directory where the exported files will be stored. inference_tensor_parallel, # The number of GPUs used in the inference time for tensor parallelism. inference_pipeline_parallel, # The number of GPUs used in the inference time for pipeline parallelism. )

If the export_tensorrt_llm_checkpoint call is successful, the TensorRT-LLM checkpoint will be saved. Otherwise, e.g. the decoder_type is not supported, a torch state_dict checkpoint will be saved instead.

Model support matrix for the TensorRT-LLM checkpoint export

Model / Quantization FP16 / BF16 FP8 INT8_SQ INT4_AWQ
GPT2 Yes Yes Yes No
GPTJ Yes Yes Yes Yes
LLAMA 2 Yes Yes Yes Yes
LLAMA 3 Yes Yes No Yes
Mistral Yes Yes Yes Yes
Mixtral 8x7B Yes Yes No Yes
Falcon 40B, 180B Yes Yes Yes Yes
Falcon 7B Yes Yes Yes No
MPT 7B, 30B Yes Yes Yes Yes
Baichuan 1, 2 Yes Yes Yes Yes
ChatGLM2, 3 6B Yes No No Yes
Bloom Yes Yes Yes Yes
Phi-1, 2, 3 Yes Yes Yes Yes
Nemotron 8 Yes Yes No Yes
Gemma 2B, 7B Yes Yes No Yes
Recurrent Gemma Yes Yes Yes Yes
StarCoder 2 Yes Yes Yes Yes
Qwen-1, 1.5 Yes Yes Yes Yes

Convert to TensorRT-LLM

Once the TensorRT-LLM checkpoint is available, please follow the TensorRT-LLM build API to build and deploy the quantized LLM.