KV Cache Reuse (a.k.a. prefix caching) — NVIDIA NIM for Large Language Models (LLMs) (original) (raw)

How to use#

Enabled by setting the environment variable NIM_ENABLE_KV_CACHE_REUSE to 1. See configuration documentation for more information.

When to use#

In scenarios where more than 90% of the initial prompt is identical across multiple requests—differing only in the final tokens—implementing a key-value cache could substantially improve inference speed. This approach leverages a high degree of similarity in the prompts, allowing for efficient reuse of computational resources and minimizing processing time for the variations at the end.

For example, when a user asks questions about a large document, the large document repeats among requests but the question at the end of the prompt is different. When this feature is enabled, there is typically about a 2x speedup in time-to-first-token (TTFT).

Example:

Large table input followed by a question about the table
Same large table input followed by a different question about the table
Same large table input followed by a different question about the table
and so forth…

KV Cache reuse will speed up TTFT starting on the second request and following.

You can use the following script to demonstrate the speedup:

import time import requests import json

Define your model endpoint URL

API_URL = "http://0.0.0.0:8000/v1/chat/completions"

Function to send a request to the API and return the response time

def send_request(model, messages, max_tokens=15): data = { "model": model, "messages": messages, "max_tokens": max_tokens, "top_p": 1, "frequency_penalty": 1.0 }

headers = {
    "accept": "application/json",
    "Content-Type": "application/json"
}

start_time = time.time()
response = requests.post(API_URL, headers=headers, data=json.dumps(data))
end_time = time.time()

output = response.json()
print(f"Output: {output['choices'][0]['message']['content']}")
print(f"Generation time: {end_time - start_time:.4f} seconds")
return end_time - start_time

Test function demonstrating caching with a long prompt

def test_prefix_caching(): model = "your_model_name_here"

# Long document to simulate complex input
LONG_PROMPT = """# Table of People\n""" + \
"| ID  | Name          | Age | Occupation    | Country       |\n" + \
"|-----|---------------|-----|---------------|---------------|\n" + \
"| 1   | John Doe      | 29  | Engineer      | USA           |\n" + \
"| 2   | Jane Smith    | 34  | Doctor        | Canada        |\n" * 50  # Replicating rows to make the table long

# First query (no caching)
messages_1 = [{"role": "user", "content": LONG_PROMPT + "Question: What is the age of John Doe?"}]
print("\nFirst query (no caching):")
send_request(model, messages_1)

# Second query (prefix caching enabled)
messages_2 = [{"role": "user", "content": LONG_PROMPT + "Question: What is the occupation of Jane Smith?"}]
print("\nSecond query (with prefix caching):")
send_request(model, messages_2)

if name == "main": test_prefix_caching()