Using Speculative Decoding and Quantization to improve Llama-3.1-405B inference performance on Trn2 instances — AWS Neuron Documentation (original) (raw)
Contents
- Prerequisites
- Scenario 1: Run Llama-3.1-405b inference with base configuration using bf16 weights
- Scenario 2: Run Llama-3.1-405b inference with fp8 weights and fused speculation (with draft model)
- Conclusion
Tutorial: Using Speculative Decoding and Quantization to improve Llama-3.1-405B inference performance on Trn2 instances#
NeuronX Distributed (NxD) Inference allows you to deploy Llama3.1 405B on a single Trn2 instance. This tutorial will show you how to optimize inference performance for Llama3.1 405B on a Trn2 instance with speculative decoding and quantization. We will compile and load the model into a VLLM server and measure performance using LLMPerf. This tutorial consists of two parts. In the first part, we will collect performance metrics for our base configuration with bf16
model weights. In the second part, we will optimize inference performance with fp8
quantized weights and speculative decoding. The performance is then compared with the results from part 1.
Table of contents
- Prerequisites
- Scenario 1: Run Llama-3.1-405b inference with base configuration using bf16 weights
- Scenario 2: Run Llama-3.1-405b inference with fp8 weights and fused speculation (with draft model)
- Conclusion
Prerequisites#
Set up and connect to a Trn2.48xlarge instance#
As a prerequisite, this tutorial requires that you have a Trn2 instance created from a Deep Learning AMI that has the Neuron SDK pre-installed.
To set up a Trn2 instance using Deep Learning AMI with pre-installed Neuron SDK, see NxD Inference Setup Guide.
After setting up an instance, use SSH to connect to the Trn2 instance using the key pair that you chose when you launched the instance.
After you are connected, activate the Python virtual environment that includes the Neuron SDK.
source ~/aws_neuronx_venv_pytorch_2_5_nxd_inference/bin/activate
Run pip list
to verify that the Neuron SDK is installed.
You should see Neuron packages includingneuronx-distributed-inference
and neuronx-cc
.
Install packages#
NxD Inference supports running models with vLLM. This functionality is available in a fork of the vLLM GitHub repository:
To run NxD Inference with vLLM, you need to download and install vLLM from this fork. Clone the Neuron vLLM fork.
git clone -b releases/v2.23.0-v0 https://github.com/aws-neuron/upstreaming-to-vllm.git
Make sure to activate the Neuron virtual environment if using a new terminal instead of the one from connect step above.
source ~/aws_neuronx_venv_pytorch_2_5_nxd_inference/bin/activate
Install the Neuron vLLM fork into the virtual environment.
cd upstreaming-to-vllm pip install -r requirements-neuron.txt VLLM_TARGET_DEVICE="neuron" pip install -e . cd ..
In this tutorial, you will use llmperf to measure the inference performance of the base Llama-3.1-405b-Instruct configuration and the more optimized configuration. We will use the load test feature of LLMPerf and measure the performance for accepting 10,000 tokens as input and generating 1500 tokens as output. Install llmperf into the virtual environment.
git clone https://github.com/ray-project/llmperf.git cd llmperf pip install -e .
Download models#
To run inference in the first part of the tutorial, you need to download the Llama-3.1-405b-Instruct model checkpoint with bf16
weights from Hugging Face (meta-llama/Llama-3.1-405B-Instruct). For the second part of the tutorial, you will run a more optimized inference configuration. For this part, you need to download an fp8-quantized Llama3.1-405B-FP8 model checkpoint (meta-llama/Llama-3.1-405B-Instruct-FP8). With Speculative Decoding, you will also need to specify a draft model. You can download and use the model checkpoint from meta-llama/Llama-3.2-1B-Instruct. For more information, seeDownloading modelsin the Hugging Face documentation.
Scenario 1: Run Llama-3.1-405b inference with base configuration using bf16 weights#
Step 1: Compile the model and run generate#
We will first compile and run generation on a sample prompt using a command installed by neuronx-distributed-inference
. Save the contents of the below script to your favorite shell script file, for example, compile_model.sh
and then run it.
Note that we are using the following features as described in the tutorial for running 405B model Tutorial: Deploying Llama3.1 405B (Trn2)
- Logical NeuronCore Configuration (LNC)
- Tensor parallelism (TP) on Trn2
- Optimized Kernels
The script compiles the model and runs generation on the given input prompt. Please refer to NxD Inference API Reference for more information on these inference_demo
flags. Note the path we used to save the compiled model. This path should be used when launching vLLM server for inference so that the compiled model can be loaded without recompilation.
Note
Known issue: Using kernels with bucket length of 1024 or less may lead to Numerical Error
in inference.
RuntimeError: Failed to execute the model status=1003 message=Numerical Error
Replace this with the path where you downloaded and saved the model files.
MODEL_PATH="/home/ubuntu/models/Llama-3.1-405B-Instruct/"
This is where the compiled model will be saved. The same path
should be used when launching vLLM server for inference.
COMPILED_MODEL_PATH="/home/ubuntu/traced_model/Llama-3.1-405B-Instruct/"
NUM_CORES=128 TP_DEGREE=64 LNC=2
export NEURON_RT_VIRTUAL_CORE_SIZE=$LNC export NEURON_RT_NUM_CORES=$((NUM_CORES/NEURON_RT_VIRTUAL_CORE_SIZE)) export NEURON_RT_EXEC_TIMEOUT=600
inference_demo
--model-type llama
--task-type causal-lm
run
--model-path $MODEL_PATH
--compiled-model-path $COMPILED_MODEL_PATH
--torch-dtype bfloat16
--start_rank_id 0
--local_ranks_size $TP_DEGREE
--tp-degree $TP_DEGREE
--batch-size 1
--max-context-length 12288
--seq-len 12800
--on-device-sampling
--top-k 1
--fused-qkv
--sequence-parallel-enabled
--qkv-kernel-enabled
--attn-kernel-enabled
--mlp-kernel-enabled
--cc-pipeline-tiling-factor 1
--pad-token-id 2
--enable-bucketing
-—context-encoding-buckets 2048 4096 10240 12288
-—token-generation-buckets 12800
--prompt "What is annapurna labs?" 2>&1 | tee log
The above script will compile a Neuron model for this base-case configuration, and also run generate on the example prompt specified with the -prompt
flag. You can change this prompt to your prompt of choice. The script’s output will be written into log
, a log file in the working directory.
In addition, in the subsequent runs of this script, you can add a --skip-compile
flag to skip the compiling step since the model is already compiled in the first run of the script. This will allow you to test the model with different prompts.
Step 2: Start the Vllm server with the compiled Neuron model#
After compiling the model, you can run the model using vLLM. Save the contents of the below script to another shell script file, for example, start_vllm.sh
and then run it.
export NEURON_RT_VIRTUAL_CORE_SIZE=2
MODEL_PATH="/home/ubuntu/models/Llama-3.1-405B-Instruct" COMPILED_MODEL_PATH="/home/ubuntu/traced_models/Llama-3.1-405B-Instruct"
export VLLM_NEURON_FRAMEWORK="neuronx-distributed-inference"
export NEURON_COMPILED_ARTIFACTS=$COMPILED_MODEL_PATH
VLLM_RPC_TIMEOUT=100000 python -m vllm.entrypoints.openai.api_server
-—model $MODEL_PATH
-—max-num-seqs 1
-—max-model-len 12800
-—tensor-parallel-size 64
-—device neuron
-—use-v2-block-manager
-—override-neuron-config "{}"
-—port 8000 & PID=$!
echo "vLLM server started with PID $PID"
Step 3: Measure performance using LLMPerf#
After the above steps, the vllm server should be running. Before we can use the llmperf
package, we need to make a few changes to its code. Follow benchmarking with LLMPerf guide to apply the code changes.
We can now measure the performance using llmperf
. Below is a sample shell script to run llmperf
. More information about several arguments used in the script can be found in thellmperf open source code .
This should be the same path to which the model was downloaded (also used in the above steps).
MODEL_PATH="/home/ubuntu/models/Llama-3.1-405B-Instruct"
This is the name of directory where the test results will be saved.
OUTPUT_PATH=llmperf-results-sonnets
export OPENAI_API_BASE="http://localhost:8000/v1" export OPENAI_API_KEY="mock_key"
python token_benchmark_ray.py
--model $MODEL_PATH
--mean-input-tokens 10000
--stddev-input-tokens 0
--mean-output-tokens 1500
--stddev-output-tokens 0
--num-concurrent-requests 1
--timeout 3600
--max-num-completed-requests 50
--additional-sampling-params '{}'
--results-dir $OUTPUT_PATH
--llm-api "openai"
The output for this llama-3.1-405B model run for the base case is shown below. Please note that the numbers can slightly vary between runs but should be in the same order of magnitude.
Results for token benchmark for /home/ubuntu/models/llama-3.1-405b queried with the openai api.
inter_token_latency_s p25 = 0.03783673520494379 p50 = 0.037929154633788834 p75 = 0.03799374728198055 p90 = 0.03806084386428147 p95 = 0.03818095359194858 p99 = 0.03862880035825585 mean = 0.03790912092492011 min = 0.03711292916794487 max = 0.03867580939426865 stddev = 0.0002364662521116205 ttft_s p25 = 2.437347081664484 p50 = 2.441959390998818 p75 = 2.4439403364085592 p90 = 2.444729209714569 p95 = 2.445114637189545 p99 = 79.22927707570342 mean = 5.451600373298861 min = 2.427013176959008 max = 153.00210832804441 stddev = 21.29264628138615 end_to_end_latency_s p25 = 70.06310007086722 p50 = 70.09642704750877 p75 = 70.1557097924524 p90 = 70.28295350184199 p95 = 70.56055794338462 p99 = 148.28325726192182 mean = 73.19207735829521 min = 70.00512732309289 max = 222.50397142698057 stddev = 21.54750467688136 request_output_throughput_token_per_s p25 = 25.417755028050912 p50 = 25.463487985775544 p75 = 25.522234144656743 p90 = 25.6487981126861 p95 = 25.729858763245502 p99 = 25.90146713883131 mean = 25.13808905954906 min = 8.080754642125802 max = 26.021214285642255 stddev = 2.465472136291901 number_input_tokens p25 = 10000.0 p50 = 10000.0 p75 = 10000.0 p90 = 10000.0 p95 = 10000.0 p99 = 10000.0 mean = 10000.0 min = 10000 max = 10000 stddev = 0.0 number_output_tokens p25 = 1783.0 p50 = 1785.0 p75 = 1789.75 p90 = 1798.1 p95 = 1803.55 p99 = 1816.67 mean = 1787.92 min = 1779 max = 1825 stddev = 8.54720386310933 Number Of Errored Requests: 0 Overall Output Throughput: 24.421011092151268 Number Of Completed Requests: 50 Completed Requests Per Minute: 0.8195336846889548
Scenario 2: Run Llama-3.1-405b inference with fp8 weights and fused speculation (with draft model)#
Step 1: Rescale the model weights to use Neuron FP8 format and save the modules to not convert file in model path#
Since Neuron device only supports the FP8_EXP4 (IEEE-754)
data type, and the HuggingFace FP8 checkpoint for Llamma-405b is in a different FP8 format (OCP FP8 E4M3/e4m3fn
) which has a different range, we need to rescale the public model weights. Follow this guide to rescale the FP8 model weights from HuggingFace: link.
Running a quantized model requires us to create modules to not convert json file to explicitly mention the layers which are not quantized in the model. For this tutorial we can use the following file.
Download: modules_to_not_convert.json
Next we will compile and run the model and record performance metrics.
Step 2: Compile the model and run generate#
We will first compile and run generation on a sample prompt using a command installed by neuronx-distributed-inference
. Save the contents of the below script to your favorite shell script file, for example, compile_model.sh
and then run it.
Note that we are using the following features as described in the tutorial for running 405B model Tutorial: Deploying Llama3.1 405B (Trn2)
- Logical NeuronCore Configuration (LNC)
- Tensor parallelism (TP) on Trn2
- Optimized Kernels
The compiling script is similar to the one in part 1. Note that we have added the path for the draft model.
Note
Known issue: Using kernels with bucket length of 1024 or less may lead to Numerical Error
in inference.
RuntimeError: Failed to execute the model status=1003 message=Numerical Error
Replace this with the path where you downloaded and saved the model files.
MODEL_PATH="/home/ubuntu/models/Llama-3.1-405B-Instruct-FP8-rescaled/"
Replace this with the path where you downloaded and saved the draft model files.
DRAFT_MODEL_PATH="/home/ubuntu/models/Llama-3.2-1b-instruct/"
This is where the compiled model (.pt file) and sharded checkpoints will be saved. The same path
should be used when launching vLLM server for inference.
COMPILED_MODEL_PATH="/home/ubuntu/traced_model/Llama-3.1-405B-Instruct/"
Add a modules to not convert json file to the model path to specify non quantized modules.
MTNC_FILE_PATH="/home/ubuntu/models/Llama-3.1-405B-Instruct-FP8-rescaled/modules_to_not_convert.json"
NUM_CORES=128 TP_DEGREE=64 LNC=2
export NEURON_RT_VIRTUAL_CORE_SIZE=$LNC export NEURON_RT_NUM_CORES=$((NUM_CORES/NEURON_RT_VIRTUAL_CORE_SIZE)) export NEURON_RT_EXEC_TIMEOUT=600 export XLA_HANDLE_SPECIAL_SCALAR=1 export UNSAFE_FP8FNCAST=1
inference_demo
-—model-type llama
-—task-type causal-lm
run
-—model-path $MODEL_PATH
-—compiled-model-path $COMPILED_MODEL_PATH
-—torch-dtype bfloat16
-—start_rank_id 0
-—local_ranks_size $TP_DEGREE
-—tp-degree $TP_DEGREE
-—batch-size 1
-—max-context-length 12288
-—seq-len 12800
-—on-device-sampling
-—top-k 1
-—fused-qkv
-—sequence-parallel-enabled
-—qkv-kernel-enabled
-—attn-kernel-enabled
-—mlp-kernel-enabled
-—cc-pipeline-tiling-factor 1
-—draft-model-path $DRAFT_MODEL_PATH
-—enable-fused-speculation
-—speculation-length 7
-—pad-token-id 2
-—quantized-mlp-kernel-enabled
-—quantization-type per_channel_symmetric
-—rmsnorm-quantize-kernel-enabled
-—enable-bucketing
-—prompt "What is annapurna labs?"
--modules-to-not-convert-file $MTNC_FILE_PATH
-—context-encoding-buckets 2048 4096 10240 12288
-—token-generation-buckets 12800 2>&1 | tee compile_and_generate_log
The above script will compile a Neuron model with fused speculation, and also run generate on the example prompt specified with the -prompt
flag. Please refer to NxD Inference API Reference for more information on these inference_demo
flags.
You can change this prompt to your prompt of choice. The script’s output will be written into compile_and_generate_log
, a log file in the working directory.
In this script, we also turn on some additional environment variables: XLA_HANDLE_SPECIAL_SCALAR
and UNSAFE_FP8FNCAST
to enable Neuron compiler to treat rescaled FP8FN
weights asFP8_EXP4
weights.
In addition, in the subsequent runs of this script, you can add a --skip-compile
flag to skip the compiling step since the model is already compiled in the first run of the script. This will allow you to test the model with different prompts.
Step 3: Start the Vllm server with the compiled Neuron model#
After compiling the model, you can run the model using vLLM. Save the contents of the below script to another shell script file, for example, start_vllm.sh
and then run it.
export NEURON_RT_INSPECT_ENABLE=0 export NEURON_RT_VIRTUAL_CORE_SIZE=2 export XLA_HANDLE_SPECIAL_SCALAR=1 export UNSAFE_FP8FNCAST=1
MODEL_PATH="/home/ubuntu/models/Llama-3.1-405B-Instruct-FP8-rescaled" DRAFT_MODEL_PATH="/home/ubuntu/models/Llama-3.2-1b-instruct" COMPILED_MODEL_PATH="/home/ubuntu/traced_models/Llama-3.1-405B-Instruct_fp8"
export VLLM_NEURON_FRAMEWORK="neuronx-distributed-inference"
export NEURON_COMPILED_ARTIFACTS=$COMPILED_MODEL_PATH
VLLM_RPC_TIMEOUT=100000 python -m vllm.entrypoints.openai.api_server
-—model $MODEL_PATH
-—max-num-seqs 1
-—max-model-len 12800
-—tensor-parallel-size 64
-—device neuron
-—speculative-max-model-len 12800
-—speculative-model $DRAFT_MODEL_PATH
-—num-speculative-tokens 7
-—use-v2-block-manager
-—override-neuron-config "{"enable_fused_speculation":true, "quantized-mlp-kernel-enabled":true, "quantization-type":"per_channel_symmetric", "skip_warmup": true}"
-—port 8000 & PID=$!
echo "vLLM server started with PID $PID"
Step 4: Measure performance using LLMPerf#
After the above steps, the vllm server should be running. Before we can use the llmperf
package, we need to make a few changes to its code. Follow benchmarking with LLMPerf guide to apply the code changes.
We can now measure the performance using llmperf
. Run the following script with the modified llmperf
package.
This should be the same path to which the model was downloaded (also used in the above steps).
MODEL_PATH="/home/ubuntu/models/Llama-3.1-405B-Instruct-FP8-rescaled"
This is the name of directory where the test results will be saved.
OUTPUT_PATH=llmperf-results-sonnets
export OPENAI_API_BASE="http://localhost:8000/v1" export OPENAI_API_KEY="mock_key"
python token_benchmark_ray.py
--model $MODEL_PATH
--mean-input-tokens 10000
--stddev-input-tokens 0
--mean-output-tokens 1500
--stddev-output-tokens 0
--num-concurrent-requests 1
--timeout 3600
--max-num-completed-requests 50
--additional-sampling-params '{}'
--results-dir $OUTPUT_PATH
--llm-api "openai"
The output for this llama-3.1-405B model run with fused speculation with fused spec is shown below. Please note that the numbers can slightly vary between runs but should be in the same order of magnitude.
Results for token benchmark for /home/ubuntu/models/Llama-3.1-405B-Instruct-FP8-rescaled queried with the openai api.
inter_token_latency_s p25 = 0.008220573497974934 p50 = 0.008265312568750231 p75 = 0.008438719224417583 p90 = 0.00848199803312309 p95 = 0.008495625438929224 p99 = 0.011143428944987235 mean = 0.008419798457414533 min = 0.008173695931987216 max = 0.01364151847269386 stddev = 0.0007612118573477839 ttft_s p25 = 2.2543624382815324 p50 = 2.254961202503182 p75 = 2.2576071268413216 p90 = 2.2596270388457924 p95 = 2.260639927221928 p99 = 2.2628143909573555 mean = 2.256157155628316 min = 2.2534945809748024 max = 2.2629711360204965 stddev = 0.0023667267664955545 end_to_end_latency_s p25 = 14.586015026085079 p50 = 14.65608573507052 p75 = 14.91364526405232 p90 = 14.977840351965279 p95 = 15.000083449739032 p99 = 18.969864878777866 mean = 14.886235136194154 min = 14.520539953839034 max = 22.716861865017563 stddev = 1.1415236552464672 request_output_throughput_token_per_s p25 = 100.64608830743339 p50 = 102.4148205461138 p75 = 102.90679421801005 p90 = 103.02201242683091 p95 = 103.26614794565539 p99 = 103.36118277211666 mean = 101.22055373532301 min = 66.0742671641385 max = 103.37081160698546 stddev = 5.19249551094185 number_input_tokens p25 = 10000.0 p50 = 10000.0 p75 = 10000.0 p90 = 10000.0 p95 = 10000.0 p99 = 10000.0 mean = 10000.0 min = 10000 max = 10000 stddev = 0.0 number_output_tokens p25 = 1501.0 p50 = 1501.0 p75 = 1501.0 p90 = 1501.0 p95 = 1501.0 p99 = 1501.0 mean = 1501.0 min = 1501 max = 1501 stddev = 0.0 Number Of Errored Requests: 0 Overall Output Throughput: 100.69986490153724 Number Of Completed Requests: 50 Completed Requests Per Minute: 4.025311055357918
Conclusion#
As seen from the table below, draft model based fused speculative decoding and quantization significantly improved inference performance: TPOT reduced by 4x and output token throughput increased by 4x, while TTFT decreased from 2442 ms to 2255 ms compared to baseline without speculative decoding. Please note that batch size of 1 is used in this tutorial for computing the below metrics.
Scenario (all using BF16) | TTFT (P50 in ms) | TPOT (P50 in ms) | Output token Throughput (per second) |
---|---|---|---|
No speculative decoding | 2442 | 37.9 | 25.46 |
Fused speculative decoding + rescaled weights (Llama 3.2 1B Draft) | 2255 | 8.27 | 102.41 |