GitHub - WisdomShell/text-generation-inference: Large Language Model Text Generation Inference (original) (raw)

Making TGI deployment optimal

GitHub Repo stars Swagger API documentation

A Rust, Python and gRPC server for text generation inference. Used in production at HuggingFaceto power Hugging Chat, the Inference API and Inference Endpoint.

Table of contents

Text Generation Inference (TGI) is a toolkit for deploying and serving Large Language Models (LLMs). TGI enables high-performance text generation for the most popular open-source LLMs, including Llama, Falcon, StarCoder, BLOOM, GPT-NeoX, and more. TGI implements many features, such as:

Optimized architectures

Other architectures are supported on a best effort basis using:

AutoModelForCausalLM.from_pretrained(<model>, device_map="auto")

or

Get Started

Docker

For a detailed starting guide, please see the Quick Tour. The easiest way of getting started is using the official Docker container:

model=WisdomShell/CodeShell-7B volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run

docker run --gpus all --shm-size 1g -p 8080:80 -v volume:/datazzr0/text−generation−inference:shell−1.4.0−−model−idvolume:/data zzr0/text-generation-inference:shell-1.4.0 --model-id volume:/datazzr0/textgenerationinference:shell1.4.0modelidmodel

Note: To use GPUs, you need to install the NVIDIA Container Toolkit. We also recommend using NVIDIA drivers with CUDA version 11.8 or higher. For running the Docker container on a machine with no GPUs or CUDA support, it is enough to remove the --gpus all flag and add --disable-custom-kernels, please note CPU is not the intended platform for this project, so performance might be subpar.

To see all options to serve your models (in the code or in the cli):

text-generation-launcher --help

And then you can make requests like

curl 127.0.0.1:8080/generate
-X POST
-d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}'
-H 'Content-Type: application/json'

Note: To use NVIDIA GPUs, you need to install the NVIDIA Container Toolkit. We also recommend using NVIDIA drivers with CUDA version 12.2 or higher. For running the Docker container on a machine with no GPUs or CUDA support, it is enough to remove the --gpus all flag and add --disable-custom-kernels, please note CPU is not the intended platform for this project, so performance might be subpar.

Note: TGI supports AMD Instinct MI210 and MI250 GPUs. Details can be found in the Supported Hardware documentation. To use AMD GPUs, please use docker run --device /dev/kfd --device /dev/dri --shm-size 1g -p 8080:80 -v <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>v</mi><mi>o</mi><mi>l</mi><mi>u</mi><mi>m</mi><mi>e</mi><mo>:</mo><mi mathvariant="normal">/</mi><mi>d</mi><mi>a</mi><mi>t</mi><mi>a</mi><mi>g</mi><mi>h</mi><mi>c</mi><mi>r</mi><mi mathvariant="normal">.</mi><mi>i</mi><mi>o</mi><mi mathvariant="normal">/</mi><mi>h</mi><mi>u</mi><mi>g</mi><mi>g</mi><mi>i</mi><mi>n</mi><mi>g</mi><mi>f</mi><mi>a</mi><mi>c</mi><mi>e</mi><mi mathvariant="normal">/</mi><mi>t</mi><mi>e</mi><mi>x</mi><mi>t</mi><mo>−</mo><mi>g</mi><mi>e</mi><mi>n</mi><mi>e</mi><mi>r</mi><mi>a</mi><mi>t</mi><mi>i</mi><mi>o</mi><mi>n</mi><mo>−</mo><mi>i</mi><mi>n</mi><mi>f</mi><mi>e</mi><mi>r</mi><mi>e</mi><mi>n</mi><mi>c</mi><mi>e</mi><mo>:</mo><mn>1.4</mn><mo>−</mo><mi>r</mi><mi>o</mi><mi>c</mi><mi>m</mi><mo>−</mo><mo>−</mo><mi>m</mi><mi>o</mi><mi>d</mi><mi>e</mi><mi>l</mi><mo>−</mo><mi>i</mi><mi>d</mi></mrow><annotation encoding="application/x-tex">volume:/data ghcr.io/huggingface/text-generation-inference:1.4-rocm --model-id </annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6944em;"></span><span class="mord mathnormal" style="margin-right:0.03588em;">v</span><span class="mord mathnormal">o</span><span class="mord mathnormal" style="margin-right:0.01968em;">l</span><span class="mord mathnormal">u</span><span class="mord mathnormal">m</span><span class="mord mathnormal">e</span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mrel">:</span><span class="mspace" style="margin-right:0.2778em;"></span></span><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em;"></span><span class="mord">/</span><span class="mord mathnormal">d</span><span class="mord mathnormal">a</span><span class="mord mathnormal">t</span><span class="mord mathnormal">a</span><span class="mord mathnormal" style="margin-right:0.03588em;">g</span><span class="mord mathnormal">h</span><span class="mord mathnormal" style="margin-right:0.02778em;">cr</span><span class="mord">.</span><span class="mord mathnormal">i</span><span class="mord mathnormal">o</span><span class="mord">/</span><span class="mord mathnormal">h</span><span class="mord mathnormal" style="margin-right:0.03588em;">ugg</span><span class="mord mathnormal">in</span><span class="mord mathnormal" style="margin-right:0.03588em;">g</span><span class="mord mathnormal" style="margin-right:0.10764em;">f</span><span class="mord mathnormal">a</span><span class="mord mathnormal">ce</span><span class="mord">/</span><span class="mord mathnormal">t</span><span class="mord mathnormal">e</span><span class="mord mathnormal">x</span><span class="mord mathnormal">t</span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">−</span><span class="mspace" style="margin-right:0.2222em;"></span></span><span class="base"><span class="strut" style="height:0.854em;vertical-align:-0.1944em;"></span><span class="mord mathnormal" style="margin-right:0.03588em;">g</span><span class="mord mathnormal">e</span><span class="mord mathnormal">n</span><span class="mord mathnormal" style="margin-right:0.02778em;">er</span><span class="mord mathnormal">a</span><span class="mord mathnormal">t</span><span class="mord mathnormal">i</span><span class="mord mathnormal">o</span><span class="mord mathnormal">n</span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">−</span><span class="mspace" style="margin-right:0.2222em;"></span></span><span class="base"><span class="strut" style="height:0.8889em;vertical-align:-0.1944em;"></span><span class="mord mathnormal">in</span><span class="mord mathnormal" style="margin-right:0.10764em;">f</span><span class="mord mathnormal">ere</span><span class="mord mathnormal">n</span><span class="mord mathnormal">ce</span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mrel">:</span><span class="mspace" style="margin-right:0.2778em;"></span></span><span class="base"><span class="strut" style="height:0.7278em;vertical-align:-0.0833em;"></span><span class="mord">1.4</span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">−</span><span class="mspace" style="margin-right:0.2222em;"></span></span><span class="base"><span class="strut" style="height:0.6667em;vertical-align:-0.0833em;"></span><span class="mord mathnormal">roc</span><span class="mord mathnormal">m</span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">−</span><span class="mspace" style="margin-right:0.2222em;"></span></span><span class="base"><span class="strut" style="height:0.7778em;vertical-align:-0.0833em;"></span><span class="mord">−</span><span class="mord mathnormal">m</span><span class="mord mathnormal">o</span><span class="mord mathnormal">d</span><span class="mord mathnormal">e</span><span class="mord mathnormal" style="margin-right:0.01968em;">l</span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">−</span><span class="mspace" style="margin-right:0.2222em;"></span></span><span class="base"><span class="strut" style="height:0.6944em;"></span><span class="mord mathnormal">i</span><span class="mord mathnormal">d</span></span></span></span>model instead of the command above.

To see all options to serve your models (in the code or in the cli):

text-generation-launcher --help

API documentation

You can consult the OpenAPI documentation of the text-generation-inference REST API using the /docs route. The Swagger UI is also available at: https://huggingface.github.io/text-generation-inference.

Using a private or gated model

You have the option to utilize the HUGGING_FACE_HUB_TOKEN environment variable for configuring the token employed bytext-generation-inference. This allows you to gain access to protected resources.

For example, if you want to serve the gated Llama V2 model variants:

  1. Go to https://huggingface.co/settings/tokens
  2. Copy your cli READ token
  3. Export HUGGING_FACE_HUB_TOKEN=<your cli READ token>

or with Docker:

model=WisdomShell/CodeShell-7B volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run token=

docker run --gpus all --shm-size 1g -e HUGGING_FACE_HUB_TOKEN=$token -p 8080:80 -v volume:/datazzr0/text−generation−inference:shell−1.4.0−−model−idvolume:/data zzr0/text-generation-inference:shell-1.4.0 --model-id volume:/datazzr0/textgenerationinference:shell1.4.0modelidmodel

A note on Shared Memory (shm)

NCCL is a communication framework used byPyTorch to do distributed training/inference. text-generation-inference make use of NCCL to enable Tensor Parallelism to dramatically speed up inference for large language models.

In order to share data between the different devices of a NCCL group, NCCL might fall back to using the host memory if peer-to-peer using NVLink or PCI is not possible.

To allow the container to use 1G of Shared Memory and support SHM sharing, we add --shm-size 1g on the above command.

If you are running text-generation-inference inside Kubernetes. You can also add Shared Memory to the container by creating a volume with:

and mounting it to /dev/shm.

Finally, you can also disable SHM sharing by using the NCCL_SHM_DISABLE=1 environment variable. However, note that this will impact performance.

Distributed Tracing

text-generation-inference is instrumented with distributed tracing using OpenTelemetry. You can use this feature by setting the address to an OTLP collector with the --otlp-endpoint argument.

Architecture

TGI architecture

Local install

You can also opt to install text-generation-inference locally.

First install Rust and create a Python virtual environment with at least Python 3.9, e.g. using conda:

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

conda create -n text-generation-inference python=3.9 conda activate text-generation-inference

You may also need to install Protoc.

On Linux:

PROTOC_ZIP=protoc-21.12-linux-x86_64.zip curl -OL https://github.com/protocolbuffers/protobuf/releases/download/v21.12/$PROTOC_ZIP sudo unzip -o $PROTOC_ZIP -d /usr/local bin/protoc sudo unzip -o $PROTOC_ZIP -d /usr/local 'include/*' rm -f $PROTOC_ZIP

On MacOS, using Homebrew:

Then run:

BUILD_EXTENSIONS=True make install # Install repository and HF/transformer fork with CUDA kernels make run-codeshell

Note: on some machines, you may also need the OpenSSL libraries and gcc. On Linux machines, run:

sudo apt-get install libssl-dev gcc -y

CUDA Kernels

The custom CUDA kernels are only tested on NVIDIA A100, AMD MI210 and AMD MI250. If you have any installation or runtime issues, you can remove the kernels by using the DISABLE_CUSTOM_KERNELS=True environment variable.

Be aware that the official Docker image has them enabled by default.

Run CodeShell

Run

Quantization

You can also quantize the weights with bitsandbytes to reduce the VRAM requirement:

4bit quantization is available using the NF4 and FP4 data types from bitsandbytes. It can be enabled by providing --quantize bitsandbytes-nf4 or --quantize bitsandbytes-fp4 as a command line argument to text-generation-launcher.

Develop

make server-dev make router-dev

Testing

python

make python-server-tests make python-client-tests

or both server and client tests

make python-tests

rust cargo tests

make rust-tests

integration tests

make integration-tests