Triton Inference Server | liteLLM (original) (raw)

Supported Models & Providers
Triton Inference Server

LiteLLM supports Embedding Models on Triton Inference Servers

Property	Details
Description	NVIDIA Triton Inference Server
Provider Route on LiteLLM	triton/
Supported Operations	/chat/completion, /completion, /embedding
Supported Triton endpoints	/infer, /generate, /embeddings
Link to Provider Doc	Triton Inference Server ↗

Triton `/generate` - Chat Completion

SDK
PROXY

Use the triton/ prefix to route to triton server

from litellm import completion
response = completion(
    model="triton/llama-3-8b-instruct",
    messages=[{"role": "user", "content": "who are u?"}],
    max_tokens=10,
    api_base="http://localhost:8000/generate",
)

Triton `/infer` - Chat Completion

SDK
PROXY

Use the triton/ prefix to route to triton server

from litellm import completion


response = completion(
    model="triton/llama-3-8b-instruct",
    messages=[{"role": "user", "content": "who are u?"}],
    max_tokens=10,
    api_base="http://localhost:8000/infer",
)

Triton `/embeddings` - Embedding

SDK
PROXY

Use the triton/ prefix to route to triton server

from litellm import embedding
import os

response = await litellm.aembedding(
    model="triton/<your-triton-model>",                                                       
    api_base="https://your-triton-api-base/triton/embeddings", # /embeddings endpoint you want litellm to call on your server
    input=["good morning from litellm"],
)

PreviousTopaz Nextv0

Triton Inference Server | liteLLM (original) (raw)

Triton /generate - Chat Completion​

Triton /infer - Chat Completion​

Triton /embeddings - Embedding​

Triton `/generate` - Chat Completion

Triton `/infer` - Chat Completion

Triton `/embeddings` - Embedding