Support overlap scheduling for speculative decoding by timmy-feng · Pull Request #9588 · sgl-project/sglang (original) (raw)
Motivation
Speculative decoding currently does not support overlap scheduling due to the sequential logic between the draft and target models. However, overlap scheduling has been shown to achieve up to 10% performance gains in non-speculative use cases. This PR achieves host overlap in speculative decoding with 5-10% improvement at various batch sizes.
Feature parity is in the works.
To enable this experimental feature, the SGLANG_ENABLE_EXPERIMENTAL_EAGLE_OVERLAP_SCHEDULE=1 environment variable must be set. Additionally, using Flash Attention 3 is recommended as there is a sync in the Flash Infer backend.
Modifications
There should be no behavior change if the SGLANG_ENABLE_EXPERIMENTAL_EAGLE_OVERLAP_SCHEDULE environment variable is not set.
Host Syncs
The following was done to remove host syncs:
- Remove dynamic shapes from target verify by always padding accept length to
spec_stepsand adding padding handlers - Moved request finished / filtering checking to
process_batch_result_decodeandfilter_batchrespectively - Handle both allocation and freeing of pages on the scheduler (return of
resolve_last_batch_result-> eviction mask to scheduler) - Overestimate
seq_lens_cpusince the host can only know the exact sequence length from one step ago
Eagle Client
After removal of all syncs, EagleWorkerClient was implemented with:
- A mock
forward_speculative_batch_generationfunction which puts work on a queue for the forward thread - A
FutureSpecInfoclass which contains future buffers corresponding to each tensor inEagleDraftInput forward_thread_func_andresoluve_last_batch_resultwhich mirror their counterparts intp_worker_overlap_thread.py
Future Work
We hope these items can be addressed in future PR's.
page_size > 1exists in this branch of work which supports paged attention for all backends other thanfa3- Support return logprobs
- Support grammar (likely with syncs)
- Support for P/D disaggregation -- we briefly implemented a proof-of-concept to prove P/D, overlap scheduling, and speculative decoding can work together.
- Reduce code duplication between
eagle_worker.pyandeagle_worker_for_overlap_scheduer.py. We separated these two files for now to reduce the risk of breaking changes.
Accuracy Tests
I ran GSM8K on an H100.
# Main
SGLANG_ENABLE_EXPERIMENTAL_EAGLE_OVERLAP_SCHEDULE=1 python -m sglang.launch_server --model-path Qwen/Qwen3-8B --speculative-draft-model-path Tengyunw/qwen3_8b_eagle3 --speculative-algorithm EAGLE3 --speculative-num-steps 5 --speculative-eagle-topk 10 --speculative-num-draft-tokens 32 --attention-backend fa3 --mem-fraction-static 0.7 --dtype bfloat16 --port 30000
python benchmark/gsm8k/bench_sglang.py --num-questions 200
Accuracy: 0.955
Invalid: 0.000
Latency: 12.379 s
Output throughput: 1922.860 token/s
# This branch
Accuracy: 0.950
Invalid: 0.000
Latency: 11.918 s
Output throughput: 2039.966 token/s
Benchmarking and Profiling
Benchmarks were run on an H200.
Before
With concurrency 1:
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max request concurrency: 1
Successful requests: 200
Benchmark duration (s): 121.97
Total input tokens: 64205
Total generated tokens: 42957
Total generated tokens (retokenized): 42956
Request throughput (req/s): 1.64
Input token throughput (tok/s): 526.41
Output token throughput (tok/s): 352.20
Total token throughput (tok/s): 878.60
Concurrency: 1.00
Accept length: 3.48
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 609.39
Median E2E Latency (ms): 464.94
---------------Time to First Token----------------
Mean TTFT (ms): 24.10
Median TTFT (ms): 21.63
P99 TTFT (ms): 70.97
---------------Inter-Token Latency----------------
Mean ITL (ms): 2.74
Median ITL (ms): 2.04
P95 ITL (ms): 4.98
P99 ITL (ms): 9.74
Max ITL (ms): 17.18
==================================================
With concurrency 4:
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max request concurrency: 4
Successful requests: 200
Benchmark duration (s): 40.23
Total input tokens: 64205
Total generated tokens: 42957
Total generated tokens (retokenized): 42956
Request throughput (req/s): 4.97
Input token throughput (tok/s): 1596.02
Output token throughput (tok/s): 1067.83
Total token throughput (tok/s): 2663.85
Concurrency: 3.94
Accept length: 3.46
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 792.78
Median E2E Latency (ms): 593.81
---------------Time to First Token----------------
Mean TTFT (ms): 31.66
Median TTFT (ms): 24.22
P99 TTFT (ms): 158.63
---------------Inter-Token Latency----------------
Mean ITL (ms): 3.56
Median ITL (ms): 2.82
P95 ITL (ms): 8.44
P99 ITL (ms): 11.96
Max ITL (ms): 267.40
==================================================
After
With concurrency 1:
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max request concurrency: 1
Successful requests: 200
Benchmark duration (s): 112.04
Total input tokens: 64205
Total generated tokens: 42957
Total generated tokens (retokenized): 42956
Request throughput (req/s): 1.79
Input token throughput (tok/s): 573.05
Output token throughput (tok/s): 383.40
Total token throughput (tok/s): 956.45
Concurrency: 1.00
Accept length: 3.54
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 559.91
Median E2E Latency (ms): 431.09
---------------Time to First Token----------------
Mean TTFT (ms): 30.63
Median TTFT (ms): 28.67
P99 TTFT (ms): 73.23
---------------Inter-Token Latency----------------
Mean ITL (ms): 2.48
Median ITL (ms): 1.83
P95 ITL (ms): 4.48
P99 ITL (ms): 8.80
Max ITL (ms): 10.16
==================================================
With concurrency 4:
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max request concurrency: 4
Successful requests: 200
Benchmark duration (s): 36.16
Total input tokens: 64205
Total generated tokens: 42957
Total generated tokens (retokenized): 42955
Request throughput (req/s): 5.53
Input token throughput (tok/s): 1775.41
Output token throughput (tok/s): 1187.86
Total token throughput (tok/s): 2963.27
Concurrency: 3.95
Accept length: 3.57
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 713.59
Median E2E Latency (ms): 529.06
---------------Time to First Token----------------
Mean TTFT (ms): 39.49
Median TTFT (ms): 29.25
P99 TTFT (ms): 213.86
---------------Inter-Token Latency----------------
Mean ITL (ms): 3.15
Median ITL (ms): 2.33
P95 ITL (ms): 7.01
P99 ITL (ms): 11.79
Max ITL (ms): 85.62
==================================================
Repro Script
This script was run on an H200:
#! /bin/bash
# Start SGLang server
SGLANG_ENABLE_EXPERIMENTAL_EAGLE_OVERLAP_SCHEDULE=1 python -m sglang.launch_server \
--model-path Qwen/Qwen3-8B \
--speculative-draft-model-path Tengyunw/qwen3_8b_eagle3 \
--speculative-algorithm EAGLE3 \
--speculative-num-steps 5 \
--speculative-eagle-topk 10 \
--speculative-num-draft-tokens 32 \
--attention-backend fa3 \
--mem-fraction-static 0.9 \
--dtype bfloat16 \
--port 30000 &
PID=$!
# Wait for server to start
while ! curl -s http://localhost:30000/health > /dev/null; do
sleep 1
done
# Run accuracy benchmark
python benchmark/gsm8k/bench_sglang.py --num-questions 200
# Flush cache
curl -X POST http://localhost:30000/flush_cache
# Run latency benchmark (bs1)
python -m sglang.bench_serving \
--backend sglang \
--num-prompts 200 \
--max-concurrency 1
# Flush cache
curl -X POST http://localhost:30000/flush_cache
# Run latency benchmark (bs4)
python -m sglang.bench_serving \
--backend sglang \
--num-prompts 200 \
--max-concurrency 4
# Kill server
kill $PID
Checklist
- Format your code according to the Format code with pre-commit.
- Add unit tests according to the Run and add unit tests.
- Update documentation according to Write documentations.
- Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.