GitHub - deepspeedai/DeepSpeed-MII: MII makes low-latency and high-throughput inference possible, powered by DeepSpeed. (original) (raw)

Formatting nv-v100-legacy nv-a6000-fastgen License Apache 2.0 PyPI version

Latest News

Contents

DeepSpeed Model Implementations for Inference (MII)

Introducing MII, an open-source Python library designed by DeepSpeed to democratize powerful model inference with a focus on high-throughput, low latency, and cost-effectiveness.

Key Technologies

MII for High-Throughput Text Generation

MII provides accelerated text-generation inference through the use of four key technologies:

For a deeper dive into understanding these features please refer to our blog which also includes a detailed performance analysis.

MII Legacy

In the past, MII introduced several key performance optimizations for low-latency serving scenarios:

How does MII work?

Figure 1: MII architecture, showing how MII automatically optimizes OSS models using DS-Inference before deploying them. DeepSpeed-FastGen optimizations in the figure have been published in our blog post.

Under-the-hood MII is powered by DeepSpeed-Inference. Based on the model architecture, model size, batch size, and available hardware resources, MII automatically applies the appropriate set of system optimizations to minimize latency and maximize throughput.

Supported Models

MII currently supports over 37,000 models across eight popular model architectures. We plan to add additional models in the near term, if there are specific model architectures you would like supported please file an issue and let us know. All current models leverage Hugging Face in our backend to provide both the model weights and the model's corresponding tokenizer. For our current release we support the following model architectures:

model family size range ~model count
Falcon 7B - 180B 600
Llama 7B - 65B 57,000
Llama-2 7B - 70B 1,200
Llama-3 8B - 405B 1,600
Mistral 7B 23,000
Mixtral (MoE) 8x7B 2,900
OPT 0.1B - 66B 2,200
Phi-2 2.7B 1,500
Qwen 7B - 72B 500
Qwen2 0.5B - 72B 3700

MII Legacy Model Support

MII Legacy APIs support over 50,000 different models including BERT, RoBERTa, Stable Diffusion, and other text-generation models like Bloom, GPT-J, etc. For a full list please see our legacy supported models table.

Getting Started with MII

DeepSpeed-MII allows users to create non-persistent and persistent deployments for supported models in just a few lines of code.

Installation

The fasest way to get started is with our PyPI release of DeepSpeed-MII which means you can get started within minutes via:

pip install deepspeed-mii

For ease of use and significant reduction in lengthy compile times that many projects require in this space we distribute a pre-compiled python wheel covering the majority of our custom kernels through a new library called DeepSpeed-Kernels. We have found this library to be very portable across environments with NVIDIA GPUs with compute capabilities 8.0+ (Ampere+), CUDA 11.6+, and Ubuntu 20+. In most cases you shouldn't even need to know this library exists as it is a dependency of DeepSpeed-MII and will be installed with it. However, if for whatever reason you need to compile our kernels manually please see our advanced installation docs.

Non-Persistent Pipeline

A non-persistent pipeline is a great way to try DeepSpeed-MII. Non-persistent pipelines are only around for the duration of the python script you are running. The full example for running a non-persistent pipeline deployment is only 4 lines. Give it a try!

import mii pipe = mii.pipeline("mistralai/Mistral-7B-v0.1") response = pipe(["DeepSpeed is", "Seattle is"], max_new_tokens=128) print(response)

The returned response is a list of Response objects. We can access several details about the generation (e.g., response[0].prompt_length):

If you want to free device memory and destroy the pipeline, use the destroy method:

Tensor parallelism

Taking advantage of multi-GPU systems for greater performance is easy with MII. When run with the deepspeed launcher, tensor parallelism is automatically controlled by the --num_gpus flag:

Run on a single GPU

deepspeed --num_gpus 1 mii-example.py

Run on multiple GPUs

deepspeed --num_gpus 2 mii-example.py

Pipeline Options

While only the model name or path is required to stand up a non-persistent pipeline deployment, we offer customization options to our users:

mii.pipeline() Options:

Users can also control the generation characteristics for individual prompts (i.e., when calling pipe()) with the following options:

Persistent Deployment

A persistent deployment is ideal for use with long-running and production applications. The persistent model uses a lightweight GRPC server that can be queried by multiple clients at once. The full example for running a persistent model is only 5 lines. Give it a try!

import mii client = mii.serve("mistralai/Mistral-7B-v0.1") response = client.generate(["Deepspeed is", "Seattle is"], max_new_tokens=128) print(response)

The returned response is a list of Response objects. We can access several details about the generation (e.g., response[0].prompt_length):

If we want to generate text from other processes, we can do that too:

client = mii.client("mistralai/Mistral-7B-v0.1") response = client.generate("Deepspeed is", max_new_tokens=128)

When we no longer need a persistent deployment, we can shutdown the server from any client:

client.terminate_server()

Model Parallelism

Taking advantage of multi-GPU systems for better latency and throughput is also easy with the persistent deployments. Model parallelism is controlled by the tensor_parallel input to mii.serve:

client = mii.serve("mistralai/Mistral-7B-v0.1", tensor_parallel=2)

The resulting deployment will split the model across 2 GPUs to deliver faster inference and higher throughput than a single GPU.

Model Replicas

We can also take advantage of multi-GPU (and multi-node) systems by setting up multiple model replicas and taking advantage of the load-balancing that DeepSpeed-MII provides:

client = mii.serve("mistralai/Mistral-7B-v0.1", replica_num=2)

The resulting deployment will load 2 model replicas (one per GPU) and load-balance incoming requests between the 2 model instances.

Model parallelism and replicas can also be combined to take advantage of systems with many more GPUs. In the example below, we run 2 model replicas, each split across 2 GPUs on a system with 4 GPUs:

client = mii.serve("mistralai/Mistral-7B-v0.1", tensor_parallel=2, replica_num=2)

The choice between model parallelism and model replicas for maximum performance will depend on the nature of the hardware, model, and workload. For example, with small models users may find that model replicas provide the lowest average latency for requests. Meanwhile, large models may achieve greater overall throughput when using only model parallelism.

RESTful API

MII makes it easy to setup and run model inference via RESTful APIs by setting enable_restful_api=True when creating a persistent MII deployment. The RESTful API can receive requests at http://{HOST}:{RESTFUL_API_PORT}/mii/{DEPLOYMENT_NAME}. A full example is provided below:

client = mii.serve( "mistralai/Mistral-7B-v0.1", deployment_name="mistral-deployment", enable_restful_api=True, restful_api_port=28080, )


📌 Note: While providing a deployment_name is not necessary (MII will autogenerate one for you), it is good practice to provide a deployment_name so that you can ensure you are interfacing with the correct RESTful API.


You can then send prompts to the RESTful gateway with any HTTP client, such as curl:

curl --header "Content-Type: application/json" --request POST -d '{"prompts": ["DeepSpeed is", "Seattle is"], "max_length": 128}' http://localhost:28080/mii/mistral-deployment

or python:

import json import requests url = f"http://localhost:28080/mii/mistral-deployment" params = {"prompts": ["DeepSpeed is", "Seattle is"], "max_length": 128} json_params = json.dumps(params) output = requests.post( url, data=json_params, headers={"Content-Type": "application/json"} )

Persistent Deployment Options

While only the model name or path is required to stand up a persistent deployment, we offer customization options to our users.

mii.serve() Options:

mii.client() Options:

Users can also control the generation characteristics for individual prompts (i.e., when calling client.generate()) with the following options:

Contributing

This project welcomes contributions and suggestions.

DeepSpeed-MII has adopted the DCO. All deepspeedai repos require a DCO. (DeepSpeed previously used a CLA which is being replaced with DCO).

DCO is provided by including a sign-off-by line in commit messages. Using the -s flag for git commit will automatically append this line. For example, running git commit -s -m 'commit info.' will produce a commit that has the message commit info. Signed-off-by: My Name <my_email@my_company.com>.The DCO bot will ensure commits are signed with an email address that matches the commit author before they are eligible to be merged.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must followMicrosoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.