KV Cache Manager — TensorRT LLM (original) (raw)

In Transformer-based models, the KV (Key-Value) Cache is a mechanism used to optimize decoding efficiency, particularly during autoregressive generation tasks. Since KV Cache requires memory to store, it is also an important resource. In TensorRT LLM, KV Cache is managed by the KVCacheManager.

For details of the TensorRT LLM KVCacheManager implementation see KV Cache Management.

KV Cache Manager Introduction#

KVCacheManager is a type of resource manager, inheriting from BaseResourceManager. Therefore, it implements the interfaces declared by BaseResourceManager.

Note: As the project evolves, these interfaces may change.

Interfaces#

The interfaces from BaseResourceManager include:

There are also two interfaces designed for CapacityScheduler:

In addition to the BaseResourceManager interfaces, KVCacheManager has interfaces related to the ModelEngine in use. For PyTorchModelEngine, common interfaces include:

There are also interfaces for warming up PyTorchModelEngine, especially when using CUDA graphs:

Customize KV Cache Manager#

To customize KVCacheManager, implement all the necessary interfaces. Then, integrate it into the PyExecutor. For the PyTorch backend, the relevant code is in pytorch_model_registry.py. In the create_pytorch_model_based_executor function, the KVCacheManager is instantiated as follows:

kv_cache_manager = KVCacheManager(
    executor_config.kv_cache_config,
    tensorrt_llm.bindings.internal.batch_manager.CacheType.SELF,
    num_layers=model_engine.model.config.num_hidden_layers,
    num_kv_heads=model_engine.model.config.num_key_value_heads,
    head_dim=head_dim,
    tokens_per_block=tokens_per_block,
    max_seq_len=max_seq_len,
    max_batch_size=max_num_requests,
    mapping=mapping,
    dtype=kv_cache_dtype,
)

For local testing or proof of concept, update these lines to use your implementation. Then, test it to ensure the PyExecutor runs with your customized KVCacheManager.