Triton + TensorRT-LLM (Llama 3.1 8B) – Feasibility of Stateful Serving + KV Cache Reuse + Priority Caching (original) (raw)
September 3, 2025, 4:38pm 1
triton_setup.txt (2.7 KB)
Hello everyone,
I’m working with Triton Inference Server + TensorRT-LLM backend serving the Llama-3.1-8B model.
Based on my current setup (Attached), My goals for this deployment are:
- Stateful serving – avoid sending long context with every request (true continuation across sequence_id).
- KV cache reuse across requests – leverage cached K/V tensors for efficiency.
- Priority-based KV caching – allow eviction of low-priority sequences if needed.
My question to the community
With this configuration:
- Is it feasible today to achieve true stateful continuati
triton_setup.txt
on (i.e., send prefix once, then continue generation without resending it) with Triton + TensorRT-LLM? - Or is KV cache reuse currently limited to prefix caching (must resend the same prefix for reuse)?
- Is it feasible to use Priority based caching and Stateful behavior together?
- Are there example configs / references for enabling end-to-end stateful serving with LLMs in Triton?
schetlur September 5, 2025, 9:11pm 2
1 and 2. What you say in 2 is correct - you need to re-send the prefix. If reuse is possible, it will be automatically detected by TRT-LLM.
- Yes, you can specify priorities for KV cache block corresponding to each token here: TensorRT-LLM/triton_backend/all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt at 25389c9fe216c8a2a4713d42d5de880801576754 · NVIDIA/TensorRT-LLM · GitHub
- This does not need special configs - potential for reuse should be automatically detected and triggered. There is no need for any intervention from the user of trtllm triton backend.