C++ GPT Runtime — TensorRT-LLM (original) (raw)

TensorRT-LLM includes a C++ component to execute TensorRT engines built with the Python API as described in the TensorRT-LLM Architecture section. That component is called the C++ runtime.

The API of the C++ runtime is composed of the classes declared incpp/include/tensorrt_llm/runtime and implemented in cpp/tensorrt_llm/runtime.

Even if the different components described in that document mention GPT in their name, they are not restricted to this specific model. Those classes can be used to implement auto-regressive models like BLOOM, GPT-J, GPT-NeoX or LLaMA, for example.

Complete support of encoder-decoder models, like T5, will be added to TensorRT-LLM in a future release. An experimental version, only in Python for now, can be found in the examples/enc_dec folder.

Overview#

Runtime models are described by an instance of theModelConfigclass and a pointer to the TensorRT engine that must be executed to perform the inference. The environment is configured through theWorldConfig(that name comes fromMPI and its “famous”MPI_COMM_WORLD default communicator). The SamplingConfigclass encapsulates parameters that control thegeneration of new tokens.

Model Configuration#

The model configuration is an instance of theModelConfig class. That class encapsulates the following parameters (they are declared as private member variables and exposed through getters and setters):

World Configuration#

Familiarity withMPI, is not required to utilize the TensorRT-LMM C++ runtime. There are two main things you need to know:

The world configuration is an instance of theWorldConfigclass, which encapsulates the following parameters:

Sampling Parameters#

The SamplingConfigclass encapsulates parameters that control thegeneration of new tokens. A comparison of selecting decoding method is listed as the table below (X means it is not supported yet). Except for the beamWidth parameter, all the fields are optional and the runtime will use a default value if no values are provided by the user. For vector fields, the TensorRT-LLM runtime supports one value per sequence (that is, the vector contains batchSize values). If all the sequences use the same value for a given parameter, the vector can be limited to a single element (that is, size() == 1).

General

Sampling

Beam-search

The Session#

The runtime session is deprecated in favor of the Executor API. It will be removed in a future release of TensorRT-LLM.

An example of how to use the GptSession to run a GPT-like auto-regressive model can be found incpp/tests/runtime/gptSessionTest.cpp.

Internal Components#

The GptSession class encapsulates two main components. TheTllmRuntime is in charge of the execution of the TensorRT engine. TheGptDecoderdoes the generation of the tokens from the logits. The TllmRuntime class is an internal component and you are not expected to use that class directly. The GptDecoder can be used directly to implement custom generation loop and for use cases that cannot be satisfied by the implementation inGptSession.

In-flight Batching Support#

In-flight batching is supported using separate decoders per request. The biggest difference compared to using a single decoder is in how the token generation from logits is managed. A batch is split into batchSizeindividual requests and kernels are issued using separated CUDA streams. This behavior may be revisited in a future release to maintain the structure of the batch and improve efficiency.

Know Issues and Future Changes#