C++ GPT Runtime — TensorRT-LLM (original) (raw)

TensorRT-LLM includes a C++ component to execute TensorRT engines built with the Python API as described in the TensorRT-LLM Architecture section. That component is called the C++ runtime.

The API of the C++ runtime is composed of the classes declared incpp/include/tensorrt_llm/runtime and implemented in cpp/tensorrt_llm/runtime.

Even if the different components described in that document mention GPT in their name, they are not restricted to this specific model. Those classes can be used to implement auto-regressive models like BLOOM, GPT-J, GPT-NeoX or LLaMA, for example.

Complete support of encoder-decoder models, like T5, will be added to TensorRT-LLM in a future release. An experimental version, only in Python for now, can be found in the examples/models/core/enc_dec folder.

Overview#

Runtime models are described by an instance of theModelConfigclass and a pointer to the TensorRT engine that must be executed to perform the inference. The environment is configured through theWorldConfig(that name comes fromMPI and its “famous”MPI_COMM_WORLD default communicator). The SamplingConfigclass encapsulates parameters that control thegeneration of new tokens.

Model Configuration#

The model configuration is an instance of theModelConfig class. That class encapsulates the following parameters (they are declared as private member variables and exposed through getters and setters):

World Configuration#

Familiarity withMPI, is not required to utilize the TensorRT-LMM C++ runtime. There are two main things you need to know:

The world configuration is an instance of theWorldConfigclass, which encapsulates the following parameters:

Sampling Parameters#

The SamplingConfigclass encapsulates parameters that control thegeneration of new tokens. A comparison of selecting decoding method is listed as the table below (X means it is not supported yet). Except for the beamWidth parameter, all the fields are optional and the runtime will use a default value if no values are provided by the user. For vector fields, the TensorRT-LLM runtime supports one value per sequence (that is, the vector contains batchSize values). If all the sequences use the same value for a given parameter, the vector can be limited to a single element (that is, size() == 1).

General

Sampling

Beam-search

Internal Components#

The TllmRuntime is in charge of the execution of the TensorRT engine. The TllmRuntime class is an internal component and you are not expected to use that class directly. The GptDecoder generates tokens from the logits. The GptDecoder can be used directly to implement a custom generation loop and for use cases that cannot be satisfied by the TRT-LLM implementation.