Triton + TensorRT-LLM (Llama 3.1 8B) – Feasibility of Stateful Serving + KV Cache Reuse + Priority Caching (original) (raw)

September 3, 2025, 4:38pm 1

triton_setup.txt (2.7 KB)

Hello everyone,

I’m working with Triton Inference Server + TensorRT-LLM backend serving the Llama-3.1-8B model.

Based on my current setup (Attached), My goals for this deployment are:

  1. Stateful serving – avoid sending long context with every request (true continuation across sequence_id).
  2. KV cache reuse across requests – leverage cached K/V tensors for efficiency.
  3. Priority-based KV caching – allow eviction of low-priority sequences if needed.

My question to the community

With this configuration:

  1. Is it feasible today to achieve true stateful continuati
    triton_setup.txt
    on (i.e., send prefix once, then continue generation without resending it) with Triton + TensorRT-LLM?
  2. Or is KV cache reuse currently limited to prefix caching (must resend the same prefix for reuse)?
  3. Is it feasible to use Priority based caching and Stateful behavior together?
  4. Are there example configs / references for enabling end-to-end stateful serving with LLMs in Triton?

schetlur September 5, 2025, 9:11pm 2

1 and 2. What you say in 2 is correct - you need to re-send the prefix. If reuse is possible, it will be automatically detected by TRT-LLM.

  1. Yes, you can specify priorities for KV cache block corresponding to each token here: TensorRT-LLM/triton_backend/all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt at 25389c9fe216c8a2a4713d42d5de880801576754 · NVIDIA/TensorRT-LLM · GitHub
  2. This does not need special configs - potential for reuse should be automatically detected and triggered. There is no need for any intervention from the user of trtllm triton backend.