GitHub - triton-inference-server/stateful_backend: Triton backend for managing the model state tensors automatically in sequence batcher (original) (raw)

Triton Stateful Backend

This repository contains the Stateful Backend for Triton Inference Server. You can learn more about backends in the backend repo. Ask questions or report problems in the main Triton issues page.

The backend code automatically manages the input and output states of a model. The states are associated with a sequence id and need to be tracked for inference requests associated with the sequence id. The backend currently handles ONNX models, however, it can be extended to other model types.

alt text

The example stateful model above has matching input and output tensors representing the model states. An output state tensor for a sequence id is passed as the input state during the next inference execution of the same sequence id. Therefore, we do not need to communicate the state tensors between server and client, and they can be kept on the GPU (or CPU) memory for GPU (or CPU) execution. The backend code stores the state tensors for all active sequences on the GPU (or CPU) memory and passes the stored state tensors as model input when the sequence id associated with the state tensors has an inference request.

The state tensors are provided in the model configuration file at the state_pairs section. For the example model in models/accumulate_fp32, the state tensor input and output pairs are specified in the parameters section as below:

   {
    key: "state_pairs"
    value: { string_value: "<<<Accumulate_In, Accumulate_Out>>>" }
   }

In general, each state pair must be surrounded by 3 pairs of angle brackets and the state pairs must be separated by a space ' ' e.g."<<<State_In_1, State_Out_1>>> <<<State_In_2, State_Out_2>>> ...".

During the model instance initialization, the stateful backend reserves GPU (or CPU) memory as large as max_candidate_sequences * sum_of_all_state_tensor_sizes to store model state tensors.

Building the backend

Run:

The backend binary will be produced in build/install/backends directory.

Alternatively, you can do the following steps to build manually:

  1. Build the custom docker image which we will use to build the backend:
NGC_VERSION=$(head -1 ./NGC_VERSION) # read the container version to use
docker build --tag triton-${NGC_VERSION}-backend -f docker/Dockerfile.backend --build-arg BASE_IMAGE_VERSION=${NGC_VERSION} .
  1. Create a container of the previously built image:
docker run --gpus all -it --rm --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 -v${PWD}:/workspace/stateful triton-${NGC_VERSION}-backend
  1. Inside the container, run the following:
mkdir -p /workspace/stateful/build && cd /workspace/stateful/build
cmake -DCMAKE_INSTALL_PREFIX:PATH=`pwd`/install ..
make -j
make install

Using the backend with Triton

  1. Build the backend. Run Triton server docker image, and copy the backend files to the triton backend folder. Delete existing onnxruntime backend and set the LD_LIBRARY_PATH variable:
NGC_VERSION=$(head -1 ./NGC_VERSION) # read the container version to use
docker run --gpus all -it --rm --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 -p8005:8005 -p8006:8006 -p8007:8007 -v${PWD}:/workspace/stateful_backend nvcr.io/nvidia/tritonserver:${NGC_VERSION}-py3
rm -rf /opt/tritonserver/backends/onnxruntime # Remove existing ORT backend to avoid having two copies
cp  -R /workspace/stateful_backend/build/install/backends/stateful ./backends/ # Copy the stateful backend
export LD_LIBRARY_PATH=/workspace/stateful_backend/build/custom-ort/lib/ # Add ORT to the LD_LIBRARY_PATH
tritonserver --grpc-port 8005 --model-repository /workspace/stateful_backend/models/ # Run the triton inference server
  1. Create an ONNX model that exposes input and output state tensors. The model should also have a mechanism to reset the initial values of state tensors for the beginning of the sequence. See the example model for a reference.
  2. Create a model config file that matches the ONNX model. The model config file only needs to have the standard Input and Outputs excluding the state tensors listed. The state pairs are listed in the parameters section. For the example ONNX model:
   {
    key: "state_pairs"
    value: { string_value: "<<<Accumulate_In, Accumulate_Out>>>" }
   }
  1. We also need a mapping between CONTROL_SEQUENCE_START to ResetSequenceboolean input tensor to reset the values of state tensors. If the boolean input tensor is set to true for an inference request, the input state values will be ignored and the model will use the initial values of the states stored in the ONNX model file as constants. This mapping allows the stateful backend to reset the state tensor values for the start of a sequence.
        {
          name: "ResetSequence"
          control [
            {
              kind: CONTROL_SEQUENCE_START
              int32_false_true: [ 0, 1 ]
            }
          ]
        }
  1. Incorporate the model file in Triton's Model Repository
        model_repository
        └── accumulate_fp32
            ├── 1
            │   └── accumulate_fp32.onnx
            └── config.pbtxt

Testing the backend

Run:

It will build the backend, start the tritonserver with the backend, run a simple client with the accumulate_fp32 model.

Example Triton model

models/accumulate_fp32 folder contains a simple Triton model with state tensors and reset state boolean input. The ONNX file contains a simple accumulation graph where the input tensor are summed over the last dimension and added to a running sum. Stateful Backend keeps track of the running sum value for all sequences and provides the output state (the running sum) as input to the model when the corresponding sequence has an inference request.

The model configuration file maps CONTROL_SEQUENCE_START signal toResetSequence model input to initialize the state values with 0 constants that are stored in the ONNX model. The files and folder structure can be used to serve similar stateful ONNX models.

Additional features

{  
 key: "enable_cuda_graph"  
 # Enable cuda graph based execution  
 # WARNING: ALWAYS CHECK CORRECTNESS WHEN ENABLING THIS  
 #  Default: 0, disabled  
 value: { string_value: "1" }  
},  

NOTE: There are several restrictions on properly using CUDA Graph-based execution:

Limitations