Azure API Management policy reference - azure-openai-semantic-cache-lookup (original) (raw)

APPLIES TO: All API Management tiers

Use the azure-openai-semantic-cache-lookup policy to perform cache lookup of responses to Azure OpenAI Chat Completion API requests from a configured external cache, based on vector proximity of the prompt to previous requests and a specified similarity score threshold. Response caching reduces bandwidth and processing requirements imposed on the backend Azure OpenAI API and lowers latency perceived by API consumers.

Note

Supported Azure OpenAI Service models

The policy is used with APIs added to API Management from the Azure OpenAI Service of the following types:

API type Supported models
Chat completion gpt-3.5gpt-4gpt-4ogpt-4o-minio1o3
Embeddings text-embedding-3-large text-embedding-3-smalltext-embedding-ada-002
Responses (preview) gpt-4o (Versions: 2024-11-20, 2024-08-06, 2024-05-13)gpt-4o-mini (Version: 2024-07-18)gpt-4.1 (Version: 2025-04-14)gpt-4.1-nano (Version: 2025-04-14)gpt-4.1-mini (Version: 2025-04-14)gpt-image-1 (Version: 2025-04-15)o3 (Version: 2025-04-16)o4-mini (Version: `2025-04-16)

Note

Traditional completion APIs are only available with legacy model versions and support is limited.

For current information about the models and their capabilities, see Azure OpenAI Service models.

Policy statement

<azure-openai-semantic-cache-lookup
    score-threshold="similarity score threshold"
    embeddings-backend-id ="backend entity ID for embeddings API"
    embeddings-backend-auth ="system-assigned"             
    ignore-system-messages="true | false"      
    max-message-count="count" >
    <vary-by>"expression to partition caching"</vary-by>
</azure-openai-semantic-cache-lookup>

Attributes

Attribute Description Required Default
score-threshold Similarity score threshold used to determine whether to return a cached response to a prompt. Value is a decimal between 0.0 and 1.0. Smaller values represent greater semantic similarity. Learn more. Yes N/A
embeddings-backend-id Backend ID for OpenAI embeddings API call. Yes N/A
embeddings-backend-auth Authentication used for Azure OpenAI embeddings API backend. Yes. Must be set to system-assigned. N/A
ignore-system-messages Boolean. When set to true (recommended), removes system messages from a GPT chat completion prompt before assessing cache similarity. No false
max-message-count If specified, number of remaining dialog messages after which caching is skipped. No N/A

Elements

Name Description Required
vary-by A custom expression determined at runtime whose value partitions caching. If multiple vary-by elements are added, values are concatenated to create a unique combination. No

Usage

Usage notes

Examples

Example with corresponding azure-openai-semantic-cache-store policy

<policies>
    <inbound>
        <base />
        <azure-openai-semantic-cache-lookup
            score-threshold="0.05"
            embeddings-backend-id ="azure-openai-backend"
            embeddings-backend-auth ="system-assigned" >
            <vary-by>@(context.Subscription.Id)</vary-by>
        </azure-openai-semantic-cache-lookup>
    </inbound>
    <outbound>
        <azure-openai-semantic-cache-store duration="60" />
        <base />
    </outbound>
</policies>

For more information about working with policies, see: