Quick Tour (original) (raw)

The easiest way of getting started is using the official Docker container. Install Docker following their installation instructions.

Let’s say you want to deploy teknium/OpenHermes-2.5-Mistral-7B model with TGI on an Nvidia GPU. Here is an example on how to do that:

If you want to serve gated or private models, please refer tothis guidefor detailed instructions.

Once TGI is running, you can use the generate endpoint or the Open AI Chat Completion API compatible Messages API by doing requests. To learn more about how to query the endpoints, check the Consuming TGI section, where we show examples with utility libraries and UIs. Below you can see a simple snippet to query the endpoint.

import requests

headers = { "Content-Type": "application/json", }

data = { 'inputs': 'What is Deep Learning?', 'parameters': { 'max_new_tokens': 20, }, }

response = requests.post('http://127.0.0.1:8080/generate', headers=headers, json=data) print(response.json())

To see all possible deploy flags and options, you can use the --help flag. It’s possible to configure the number of shards, quantization, generation parameters, and more.

docker run ghcr.io/huggingface/text-generation-inference:3.3.3 --help