Easy, Fast, and Cheap LLM Inference — SkyPilot documentation (original) (raw)
You are viewing the latest developer preview docs. Click here to view docs for the latest stable release.
Source: llm/vllm
This README contains instructions to run a demo for vLLM, an open-source library for fast LLM inference and serving, which improves the throughput compared to HuggingFace by up to 24x.
Prerequisites#
Install the latest SkyPilot and check your setup of the cloud credentials:
pip install git+https://github.com/skypilot-org/skypilot.git sky check
See the vLLM SkyPilot YAMLs.
Serving Llama-2 with vLLM’s OpenAI-compatible API server#
Before you get started, you need to have access to the Llama-2 model weights on huggingface. Please check the prerequisites section in Llama-2 example for more details.
- Start serving the Llama-2 model:
sky launch -c vllm-llama2 serve-openai-api.yaml --env HF_TOKEN=YOUR_HUGGING_FACE_API_TOKEN
Optional: Only GCP offers the specified L4 GPUs currently. To use other clouds, use the --gpus
flag to request other GPUs. For example, to use H100 GPUs:
sky launch -c vllm-llama2 serve-openai-api.yaml --gpus H100:1 --env HF_TOKEN=YOUR_HUGGING_FACE_API_TOKEN
Tip: You can also use the vLLM docker container for faster setup. Refer to serve-openai-api-docker.yaml for more.
- Check the IP for the cluster with:
IP=$(sky status --ip vllm-llama2)
- You can now use the OpenAI API to interact with the model.
- Query the models hosted on the cluster:
curl http://$IP:8000/v1/models
- Query a model with input prompts for text completion:
curl http://$IP:8000/v1/completions
-H "Content-Type: application/json"
-d '{
"model": "meta-llama/Llama-2-7b-chat-hf",
"prompt": "San Francisco is a",
"max_tokens": 7,
"temperature": 0
}'
You should get a similar response as the following:
{ "id":"cmpl-50a231f7f06a4115a1e4bd38c589cd8f", "object":"text_completion","created":1692427390, "model":"meta-llama/Llama-2-7b-chat-hf", "choices":[{ "index":0, "text":"city in Northern California that is known", "logprobs":null,"finish_reason":"length" }], "usage":{"prompt_tokens":5,"total_tokens":12,"completion_tokens":7} }
- Query a model with input prompts for chat completion:
curl http://$IP:8000/v1/chat/completions
-H "Content-Type: application/json"
-d '{
"model": "meta-llama/Llama-2-7b-chat-hf",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Who are you?"
}
]
}'
You should get a similar response as the following:
{ "id": "cmpl-879a58992d704caf80771b4651ff8cb6", "object": "chat.completion", "created": 1692650569, "model": "meta-llama/Llama-2-7b-chat-hf", "choices": [{ "index": 0, "message": { "role": "assistant", "content": " Hello! I'm just an AI assistant, here to help you" }, "finish_reason": "length" }], "usage": { "prompt_tokens": 31, "total_tokens": 47, "completion_tokens": 16 } }
Serving Llama-2 with vLLM for more traffic using SkyServe#
To scale up the model serving for more traffic, we introduced SkyServe to enable a user to easily deploy multiple replica of the model:
- Adding an
service
section in the aboveserve-openai-api.yaml
file to make it an SkyServe Service YAML:
The newly-added service
section to the serve-openai-api.yaml
file.
service:
Specifying the path to the endpoint to check the readiness of the service.
readiness_probe: /v1/models
How many replicas to manage.
replicas: 2
The entire Service YAML can be found here: service.yaml.
- Start serving by using SkyServe CLI:
sky serve up -n vllm-llama2 service.yaml
- Use
sky serve status
to check the status of the serving:
sky serve status vllm-llama2
You should get a similar output as the following:
Services NAME UPTIME STATUS REPLICAS ENDPOINT vllm-llama2 7m 43s READY 2/2 3.84.15.251:30001
Service Replicas SERVICE_NAME ID IP LAUNCHED RESOURCES STATUS REGION vllm-llama2 1 34.66.255.4 11 mins ago 1x GCP({'L4': 1}) READY us-central1 vllm-llama2 2 35.221.37.64 15 mins ago 1x GCP({'L4': 1}) READY us-east4
- Check the endpoint of the service:
ENDPOINT=$(sky serve status --endpoint vllm-llama2)
- Once it status is
READY
, you can use the endpoint to interact with the model:
curl $ENDPOINT/v1/chat/completions
-H "Content-Type: application/json"
-d '{
"model": "meta-llama/Llama-2-7b-chat-hf",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Who are you?"
}
]
}'
Notice that it is the same with previously curl command. You should get a similar response as the following:
{ "id": "cmpl-879a58992d704caf80771b4651ff8cb6", "object": "chat.completion", "created": 1692650569, "model": "meta-llama/Llama-2-7b-chat-hf", "choices": [{ "index": 0, "message": { "role": "assistant", "content": " Hello! I'm just an AI assistant, here to help you" }, "finish_reason": "length" }], "usage": { "prompt_tokens": 31, "total_tokens": 47, "completion_tokens": 16 } }
Serving Mistral AI’s Mixtral 8x7b model with vLLM#
Please refer to the Mixtral 8x7b example for more details.
Included files#
serve-openai-api-docker.yaml
envs: MODEL_NAME: meta-llama/Llama-2-7b-chat-hf HF_TOKEN: # TODO: Fill with your own huggingface token, or use --env to pass.
resources: image_id: docker:vllm/vllm-openai:latest accelerators: {L4:1, A10G:1, A10:1, A100:1, A100-80GB:1} ports: - 8000
setup: | conda deactivate python3 -c "import huggingface_hub; huggingface_hub.login('${HF_TOKEN}')"
run: |
conda deactivate
echo 'Starting vllm openai api server...'
python -m vllm.entrypoints.openai.api_server
--model $MODEL_NAME --tokenizer hf-internal-testing/llama-tokenizer
--host 0.0.0.0
serve-openai-api.yaml
envs: MODEL_NAME: meta-llama/Llama-2-7b-chat-hf HF_TOKEN: # TODO: Fill with your own huggingface token, or use --env to pass.
resources: accelerators: {L4:1, A10G:1, A10:1, A100:1, A100-80GB:1} ports: - 8000
setup: | conda activate vllm if [ $? -ne 0 ]; then conda create -n vllm python=3.10 -y conda activate vllm fi
pip install transformers==4.38.0 pip install vllm==0.3.2
python -c "import huggingface_hub; huggingface_hub.login('${HF_TOKEN}')"
run: |
conda activate vllm
echo 'Starting vllm openai api server...'
python -m vllm.entrypoints.openai.api_server
--model $MODEL_NAME --tokenizer hf-internal-testing/llama-tokenizer
--host 0.0.0.0
serve.yaml
envs: MODEL_NAME: decapoda-research/llama-65b-hf
resources: accelerators: A100-80GB:8
setup: | conda activate vllm if [ $? -ne 0 ]; then conda create -n vllm python=3.10 -y conda activate vllm fi
Install fschat and accelerate for chat completion
git clone https://github.com/vllm-project/vllm.git || true pip install transformers==4.38.0 pip install vllm==0.3.2
pip install gradio
run: |
conda activate vllm
echo 'Starting vllm api server...'
python -u -m vllm.entrypoints.api_server
--model $MODEL_NAME
--tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE
--tokenizer hf-internal-testing/llama-tokenizer 2>&1 | tee api_server.log &
echo 'Waiting for vllm api server to start...'
while ! cat api_server.log | grep -q 'Uvicorn running on'
; do sleep 1; done
echo 'Starting gradio server...' python vllm/examples/gradio_webserver.py
service-with-auth.yaml
service.yaml
The newly-added service
section to the serve-openai-api.yaml
file.
service:
Specifying the path to the endpoint to check the readiness of the service.
readiness_probe: path: /v1/models # Set authorization headers here if needed. headers: Authorization: Bearer $AUTH_TOKEN
How many replicas to manage.
replicas: 1
Fields below are the same with serve-openai-api.yaml
.
envs: MODEL_NAME: meta-llama/Llama-2-7b-chat-hf HF_TOKEN: # TODO: Fill with your own huggingface token, or use --env to pass. AUTH_TOKEN: # TODO: Fill with your own auth token (a random string), or use --env to pass.
resources: accelerators: {L4:1, A10G:1, A10:1, A100:1, A100-80GB:1} ports: 8000
setup: | conda activate vllm if [ $? -ne 0 ]; then conda create -n vllm python=3.10 -y conda activate vllm fi
pip install transformers==4.38.0 pip install vllm==0.3.2
python -c "import huggingface_hub; huggingface_hub.login('${HF_TOKEN}')"
run: |
conda activate vllm
echo 'Starting vllm openai api server...'
python -m vllm.entrypoints.openai.api_server
--model $MODEL_NAME --tokenizer hf-internal-testing/llama-tokenizer
--host 0.0.0.0 --port 8000 --api-key $AUTH_TOKEN
service.yaml
service.yaml
The newly-added service
section to the serve-openai-api.yaml
file.
service:
Specifying the path to the endpoint to check the readiness of the service.
readiness_probe: /v1/models
How many replicas to manage.
replicas: 2
Fields below are the same with serve-openai-api.yaml
.
envs: MODEL_NAME: meta-llama/Llama-2-7b-chat-hf HF_TOKEN: # TODO: Fill with your own huggingface token, or use --env to pass.
resources: accelerators: {L4:1, A10G:1, A10:1, A100:1, A100-80GB:1} ports: - 8000
setup: | conda activate vllm if [ $? -ne 0 ]; then conda create -n vllm python=3.10 -y conda activate vllm fi
pip install transformers==4.38.0 pip install vllm==0.3.2
python -c "import huggingface_hub; huggingface_hub.login('${HF_TOKEN}')"
run: |
conda activate vllm
echo 'Starting vllm openai api server...'
python -m vllm.entrypoints.openai.api_server
--model $MODEL_NAME --tokenizer hf-internal-testing/llama-tokenizer
--host 0.0.0.0