Deploying Llama3.2 Multimodal Models — AWS Neuron Documentation (original) (raw)

Tutorial: Deploying Llama3.2 Multimodal Models#

NeuronX Distributed Inference (NxDI) enables you to deploy Llama-3.2-11B-Vision-Instruct andLlama-3.2-90B-Vision-Instruct models on Neuron Trainium and Inferentia instances.

You can run Llama3.2 Multimodal with default configuration options. NxD Inference also provides several features and configuration options that you can use to optimize and tune the performance for inference. This guide walks through how to run Llama3.2 Multimodal with vLLM, and how to enable optimizations for inference performance on Trn1/Inf2 instances. It takes about 20-60 minutes to complete.

Table of contents

Step 1: Set up Development Environment#

  1. Launch a trn1.32xlarge or inf2.48xlarge instance on Ubuntu 22 with Neuron Multi-Framework DLAMI. Please refer to the setup guideif you don’t have one yet. If you are looking to install NxD Inference library without using pre-existing DLAMI, please refer to the NxD Inference Setup Guide.
  2. Use default virtual environment pre-installed with the Neuron SDK.

source /opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/bin/activate

  1. Install the fork of vLLM (releases/v2.23.0-v0) that supports NxD Inference following Setupin the vLLM User Guide for NxD Inference.
  2. You should now have the Neuron SDK and other necessary packages installed, including neuronx-distributed-inference, neuronx-cc, torch, torchvision, and vllm-neuronx.

Step 2: Download and Convert Checkpoints#

Download Llama3.2 Multimodal models from eitherMeta’s official websiteor HuggingFace(HF) (11B,90B).

NxDI supports HF checkpoint out-of-the-box. To use the Meta checkpoint, you need to run the following script to convert the downloaded Meta checkpoint into NxDI supported format.

python -m neuronx_distributed_inference.models.mllama.convert_mllama_weights_to_neuron
--input-dir
--output-dir
--instruct
--num-shards 8 #(1 for 11B and 8 for 90B)

After the script is finished running, you should have the following configuration and checkpoint files. Verify by ls <path_to_neuron_checkpoint>:

chat_template.json model.safetensors tokenizer_config.json config.json special_tokens_map.json generation_config.json tokenizer.json

Note

The following code examples use HF checkpoint, as it is supported by default, no script needs to be run.

Step 3: Deploy with vLLM Inference#

We provide two examples to deploy Llama3.2 Multimodal with vLLM:

  1. Offline inference: you can provide prompts in a python script and execute it.
  2. Online inference: you will serve the model in an online server and send requests.

If you already have a compiled model artifact in MODEL_PATHwith the same specified configuration, or if you have set an environment variableNEURON_COMPILED_ARTIFACTS , the vLLM engine will load the compiled model and run inference directly. Otherwise, it will automatically compile and save a new model artifact. See vLLM User Guide for NxD Inference for more. We provide example configurations here, continue reading on how to tune them per your use case.

Configurations#

You should specifically tune these configurations when optimizing performance for Llama3.2 Multimodal models. Please refer to NxD Inference Features Configuration Guide for detailed explanation of each configuration.

Note

Longer sequence takes up more memory, so we should use less buckets. For example, to compile the 90B model on trn1.32xlarge with SEQ_LEN=16384, BATCH_SIZE=4, we can use buckets [1024, 2048, 16384] to cover the longest possible sequence as well as shorter sequence where the majority of traffic comes from. We also set an environment variable by export NEURON_SCRATCHPAD_PAGE_SIZE=1024 to increase the scratchpad size in our direct memory access engine to fit the large tensors.

Model Inputs#

Offline Example#

import torch import requests from PIL import Image

from vllm import LLM, SamplingParams from vllm import TextPrompt

from neuronx_distributed_inference.models.mllama.utils import add_instruct

def get_image(image_url): image = Image.open(requests.get(image_url, stream=True).raw) return image

Configurations

MODEL_PATH = "/home/ubuntu/model_hf/Llama-3.2-90B-Vision-Instruct-hf" BATCH_SIZE = 4 SEQ_LEN = 2048 TENSOR_PARALLEL_SIZE = 32 CONTEXT_ENCODING_BUCKETS = [1024, 2048] TOKEN_GENERATION_BUCKETS = [1024, 2048] SEQUENCE_PARALLEL_ENABLED = False IS_CONTINUOUS_BATCHING = True ON_DEVICE_SAMPLING_CONFIG = {"global_topk":64, "dynamic": True, "deterministic": False}

Model Inputs

PROMPTS = ["What is in this image? Tell me a story", "What is the recipe of mayonnaise in two sentences?" , "Describe this image", "What is the capital of Italy famous for?", ] IMAGES = [get_image("https://github.com/meta-llama/llama-models/blob/main/models/scripts/resources/dog.jpg?raw=true"), torch.empty((0,0)), get_image("https://awsdocs-neuron.readthedocs-hosted.com/en/latest/_images/nxd-inference-block-diagram.jpg"), torch.empty((0,0)), ] SAMPLING_PARAMS = [dict(top_k=1, temperature=1.0, top_p=1.0, max_tokens=256), dict(top_k=1, temperature=0.9, top_p=1.0, max_tokens=256), dict(top_k=10, temperature=0.9, top_p=0.5, max_tokens=512), dict(top_k=10, temperature=0.75, top_p=0.5, max_tokens=1024), ]

def get_VLLM_mllama_model_inputs(prompt, single_image, sampling_params): # Prepare all inputs for mllama generation, including: # 1. put text prompt into instruct chat template # 2. compose single text and single image prompt into Vllm's prompt class # 3. prepare sampling parameters input_image = single_image has_image = torch.tensor([1]) if isinstance(single_image, torch.Tensor) and single_image.numel() == 0: has_image = torch.tensor([0])

instruct_prompt = add_instruct(prompt, has_image)
inputs = TextPrompt(prompt=instruct_prompt)
inputs["multi_modal_data"] = {"image": input_image}
# Create a sampling params object.
sampling_params = SamplingParams(**sampling_params)
return inputs, sampling_params

def print_outputs(outputs): # Print the outputs. for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

if name == 'main': assert len(PROMPTS) == len(IMAGES) == len(SAMPLING_PARAMS),
f"""Text, image prompts and sampling parameters should have the same batch size, got {len(PROMPTS)}, {len(IMAGES)}, and {len(SAMPLING_PARAMS)}"""

# Create an LLM.
llm = LLM(
    model=MODEL_PATH,
    max_num_seqs=BATCH_SIZE,
    max_model_len=SEQ_LEN,
    block_size=SEQ_LEN,
    device="neuron",
    tensor_parallel_size=TENSOR_PARALLEL_SIZE,
    override_neuron_config={
        "context_encoding_buckets": CONTEXT_ENCODING_BUCKETS,
        "token_generation_buckets": TOKEN_GENERATION_BUCKETS,
        "sequence_parallel_enabled": SEQUENCE_PARALLEL_ENABLED,
        "is_continuous_batching": IS_CONTINUOUS_BATCHING,
        "on_device_sampling_config": ON_DEVICE_SAMPLING_CONFIG,
    }
)

batched_inputs = []
batched_sample_params = []
for pmpt, img, params in zip(PROMPTS, IMAGES, SAMPLING_PARAMS):
    inputs, sampling_params = get_VLLM_mllama_model_inputs(pmpt, img, params)
    # test batch-size = 1
    outputs = llm.generate(inputs, sampling_params)
    print_outputs(outputs)
    batched_inputs.append(inputs)
    batched_sample_params.append(sampling_params)

# test batch-size = 4
outputs = llm.generate(batched_inputs, batched_sample_params)
print_outputs(outputs)

This script will print the outputs. Below is an example output from image-text prompt:

Prompt: '<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\n<|image|>What is in this image? Tell me a story<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n', Generated text: 'The image shows a dog riding a skateboard. The dog is standing on the skateboard, which is in the middle of the road. The dog is looking at the camera with its mouth open, as if it is smiling. The dog has floppy ears and a long tail. It is wearing a collar around its neck. The skateboard is black with red wheels. The background is blurry, but it appears to be a city street with buildings and cars in the distance.'

Online Example#

First, open a terminal and spin up a server of the model. If you specify a new set of configurations, a new neuron model artifact will be compiled now.

MODEL_PATH="/home/ubuntu/model_hf/Llama-3.2-90B-Vision-Instruct-hf" python3 -m vllm.entrypoints.openai.api_server
--model $MODEL_PATH
--tensor-parallel-size 32
--max-model-len 2048
--max-num-seqs 4
--device neuron
--override-neuron-config '{ "context_encoding_buckets": [1024, 2048], "token_generation_buckets": [1024, 2048], "sequence_parallel_enabled": false, "is_continuous_batching": true, "on_device_sampling_config": { "global_topk": 64, "dynamic": true, "deterministic": false } }'

If you see the below logs, that means your server is up and running:

INFO: Started server process [284309] INFO: Waiting for application startup. INFO: Application startup complete. INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

Then open a new terminal as the client where you can send requests to the server. We’ve enabled continuous batching by default, so you can open up to--max-num-seqs client terminals to send requests. To send a text-only request:

MODEL_PATH="/home/ubuntu/model_hf/Llama-3.2-90B-Vision-Instruct-hf" curl http://localhost:8000/v1/chat/completions
-H "Content-Type: application/json"
-d '{ "model": "'"$MODEL_PATH"'", "messages": [ { "role": "user", "content": "What is the capital of Italy?" } ] }'

You should receive outputs shown in the client terminal shortly:

{"id":"chat-2df3e876738b470ab27b090e0a09736e","object":"chat.completion", "created":1734401826,"model":"/home/ubuntu/model_hf/Llama-3.2-90B-Vision-Instruct-hf/", "choices":[{"index":0,"message":{"role":"assistant","content":"The capital of Italy is Rome.","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null}], "usage":{"prompt_tokens":42,"total_tokens":50,"completion_tokens":8},"prompt_logprobs":null}

If the request fails, increase the value of the VLLM_RPC_TIMEOUT environment variable usingexport VLLM_RPC_TIMEOUT=180000, then restart the server. The timeout value depends on the model and deployment configuration used.

To send a request with both text and image prompts:

curl http://localhost:8000/v1/chat/completions
-H "Content-Type: application/json"
-d '{ "model": "'"$MODEL_PATH"'", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image" }, { "type": "image_url", "image_url": { "url": "https://awsdocs-neuron.readthedocs-hosted.com/en/latest/_images/nxd-inference-block-diagram.jpg" } } ] } ] }'

You can expect results appear in the client terminal shortly:

{"id":"chat-fd1319865bd44d6aa60a4739cce61c9d","object":"chat.completion", "created":1734401984,"model":"/home/ubuntu/model_hf/Llama-3.2-90B-Vision-Instruct-hf/", "choices":[{"index":0,"message":{"role":"assistant","content":"The image presents a diagram illustrating the components of NxD Inference, with a focus on inference modules and additional modules. The diagram is divided into two main sections: "Inference Modules" and "Additional Modules." \n\nInference Modules:\n\n Attention Techniques\n KV Caching\n Continuous Batching\n\n*Additional Modules:*\n\n Speculative Decoding (Draft model and Draft heads (Medusa / Eagle))\n\nThe diagram also includes a section titled "NxD Core (Distributed Strategies, Distributed Model Tracing)" and a logo for PyTorch at the bottom.","tool_calls":[]},"logprobs":null, "finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":14,"total_tokens":137, "completion_tokens":123},"prompt_logprobs":null}