Triton Inference Server | liteLLM (original) (raw)
- Supported Models & Providers
- Triton Inference Server
LiteLLM supports Embedding Models on Triton Inference Servers
| Property | Details |
|---|---|
| Description | NVIDIA Triton Inference Server |
| Provider Route on LiteLLM | triton/ |
| Supported Operations | /chat/completion, /completion, /embedding |
| Supported Triton endpoints | /infer, /generate, /embeddings |
| Link to Provider Doc | Triton Inference Server ↗ |
Triton /generate - Chat Completion
- SDK
- PROXY
Use the triton/ prefix to route to triton server
from litellm import completion
response = completion(
model="triton/llama-3-8b-instruct",
messages=[{"role": "user", "content": "who are u?"}],
max_tokens=10,
api_base="http://localhost:8000/generate",
)
Triton /infer - Chat Completion
- SDK
- PROXY
Use the triton/ prefix to route to triton server
from litellm import completion
response = completion(
model="triton/llama-3-8b-instruct",
messages=[{"role": "user", "content": "who are u?"}],
max_tokens=10,
api_base="http://localhost:8000/infer",
)
Triton /embeddings - Embedding
- SDK
- PROXY
Use the triton/ prefix to route to triton server
from litellm import embedding
import os
response = await litellm.aembedding(
model="triton/<your-triton-model>",
api_base="https://your-triton-api-base/triton/embeddings", # /embeddings endpoint you want litellm to call on your server
input=["good morning from litellm"],
)