Triton + TensorRT-LLM (Llama 3.1 8B) – Feasibility of Stateful Serving + KV Cache Reuse + Priority Caching (original) (raw)

September 3, 2025, 4:38pm 1

Hello everyone,

I’m working with Triton Inference Server + TensorRT-LLM backend serving the Llama-3.1-8B model.

Based on my current setup (Attached), My goals for this deployment are:

Stateful serving – avoid sending long context with every request (true continuation across sequence_id).
KV cache reuse across requests – leverage cached K/V tensors for efficiency.
Priority-based KV caching – allow eviction of low-priority sequences if needed.

My question to the community

With this configuration:

Is it feasible today to achieve true stateful continuati
triton_setup.txt
on (i.e., send prefix once, then continue generation without resending it) with Triton + TensorRT-LLM?
Or is KV cache reuse currently limited to prefix caching (must resend the same prefix for reuse)?
Is it feasible to use Priority based caching and Stateful behavior together?
Are there example configs / references for enabling end-to-end stateful serving with LLMs in Triton?

schetlur September 5, 2025, 9:11pm 2

1 and 2. What you say in 2 is correct - you need to re-send the prefix. If reuse is possible, it will be automatically detected by TRT-LLM.

Yes, you can specify priorities for KV cache block corresponding to each token here: TensorRT-LLM/triton_backend/all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt at 25389c9fe216c8a2a4713d42d5de880801576754 · NVIDIA/TensorRT-LLM · GitHub
This does not need special configs - potential for reuse should be automatically detected and triggered. There is no need for any intervention from the user of trtllm triton backend.