API Introduction — TensorRT-LLM (original) (raw)

The LLM API is a high-level Python API and designed for LLM workflows. This API is under development and might have breaking changes in the future.

Supported Models#

Model Preparation#

The LLM class supports input from any of following:

  1. Hugging Face Hub: Triggers a download from the Hugging Face model hub, such as TinyLlama/TinyLlama-1.1B-Chat-v1.0.
  2. Local Hugging Face models: Uses a locally stored Hugging Face model.
  3. Local TensorRT-LLM engine: Built by trtllm-build tool or saved by the Python LLM API.

Any of these formats can be used interchangeably with the LLM(model=<any-model-path>) constructor.

The following sections describe how to use these different formats for the LLM API.

Hugging Face Hub#

Using the Hugging Face Hub is as simple as specifying the repo name in the LLM constructor:

llm = LLM(model="TinyLlama/TinyLlama-1.1B-Chat-v1.0")

You can also directly load TensorRT Model Optimizer’s quantized checkpoints on Hugging Face Hub in the same way.

Local Hugging Face Models#

Given the popularity of the Hugging Face model hub, the API supports the Hugging Face format as one of the starting points. To use the API with Llama 3.1 models, download the model from the Meta Llama 3.1 8B model page by using the following command:

git lfs install git clone https://huggingface.co/meta-llama/Meta-Llama-3.1-8B

After the model download is complete, you can load the model:

llm = LLM(model=)

Using this model is subject to a particular license. Agree to the terms and authenticate with Hugging Face to begin the download.

Local TensorRT-LLM Engine#

There are two ways to build a TensorRT-LLM engine:

  1. You can build the TensorRT-LLM engine from the Hugging Face model directly with the trtllm-build tool and then save the engine to disk for later use. Refer to the README in the examples/llama repository on GitHub.
    After the engine building is finished, we can load the model:
    llm = LLM(model=)
  2. Use an LLM instance to create the engine and persist to local disk:
    llm = LLM()

Save engine to local disk

llm.save()
The engine can be loaded using the model argument as shown in the first approach.

Tips and Troubleshooting#

The following tips typically assist new LLM API users who are familiar with other APIs that are part of TensorRT-LLM: