Support overlap scheduling for speculative decoding by timmy-feng · Pull Request #9588 · sgl-project/sglang (original) (raw)

Motivation

Speculative decoding currently does not support overlap scheduling due to the sequential logic between the draft and target models. However, overlap scheduling has been shown to achieve up to 10% performance gains in non-speculative use cases. This PR achieves host overlap in speculative decoding with 5-10% improvement at various batch sizes.

Feature parity is in the works.

To enable this experimental feature, the SGLANG_ENABLE_EXPERIMENTAL_EAGLE_OVERLAP_SCHEDULE=1 environment variable must be set. Additionally, using Flash Attention 3 is recommended as there is a sync in the Flash Infer backend.

Modifications

There should be no behavior change if the SGLANG_ENABLE_EXPERIMENTAL_EAGLE_OVERLAP_SCHEDULE environment variable is not set.

Host Syncs

The following was done to remove host syncs:

Remove dynamic shapes from target verify by always padding accept length to spec_steps and adding padding handlers
Moved request finished / filtering checking to process_batch_result_decode and filter_batch respectively
Handle both allocation and freeing of pages on the scheduler (return of resolve_last_batch_result -> eviction mask to scheduler)
Overestimate seq_lens_cpu since the host can only know the exact sequence length from one step ago

Eagle Client

After removal of all syncs, EagleWorkerClient was implemented with:

A mock forward_speculative_batch_generation function which puts work on a queue for the forward thread
A FutureSpecInfo class which contains future buffers corresponding to each tensor in EagleDraftInput
forward_thread_func_ and resoluve_last_batch_result which mirror their counterparts in tp_worker_overlap_thread.py

Future Work

We hope these items can be addressed in future PR's.

page_size > 1 exists in this branch of work which supports paged attention for all backends other than fa3
Support return logprobs
Support grammar (likely with syncs)
Support for P/D disaggregation -- we briefly implemented a proof-of-concept to prove P/D, overlap scheduling, and speculative decoding can work together.
Reduce code duplication between eagle_worker.py and eagle_worker_for_overlap_scheduer.py. We separated these two files for now to reduce the risk of breaking changes.

Accuracy Tests

I ran GSM8K on an H100.

# Main
SGLANG_ENABLE_EXPERIMENTAL_EAGLE_OVERLAP_SCHEDULE=1 python -m sglang.launch_server --model-path Qwen/Qwen3-8B --speculative-draft-model-path Tengyunw/qwen3_8b_eagle3 --speculative-algorithm EAGLE3 --speculative-num-steps 5 --speculative-eagle-topk 10 --speculative-num-draft-tokens 32 --attention-backend fa3 --mem-fraction-static 0.7 --dtype bfloat16 --port 30000
python benchmark/gsm8k/bench_sglang.py --num-questions 200
Accuracy: 0.955
Invalid: 0.000
Latency: 12.379 s
Output throughput: 1922.860 token/s

# This branch
Accuracy: 0.950
Invalid: 0.000
Latency: 11.918 s
Output throughput: 2039.966 token/s

Benchmarking and Profiling

Benchmarks were run on an H200.

Before

With concurrency 1:

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 1         
Successful requests:                     200       
Benchmark duration (s):                  121.97    
Total input tokens:                      64205     
Total generated tokens:                  42957     
Total generated tokens (retokenized):    42956     
Request throughput (req/s):              1.64      
Input token throughput (tok/s):          526.41    
Output token throughput (tok/s):         352.20    
Total token throughput (tok/s):          878.60    
Concurrency:                             1.00      
Accept length:                           3.48      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   609.39    
Median E2E Latency (ms):                 464.94    
---------------Time to First Token----------------
Mean TTFT (ms):                          24.10     
Median TTFT (ms):                        21.63     
P99 TTFT (ms):                           70.97     
---------------Inter-Token Latency----------------
Mean ITL (ms):                           2.74      
Median ITL (ms):                         2.04      
P95 ITL (ms):                            4.98      
P99 ITL (ms):                            9.74      
Max ITL (ms):                            17.18     
==================================================

With concurrency 4:

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 4         
Successful requests:                     200       
Benchmark duration (s):                  40.23     
Total input tokens:                      64205     
Total generated tokens:                  42957     
Total generated tokens (retokenized):    42956     
Request throughput (req/s):              4.97      
Input token throughput (tok/s):          1596.02   
Output token throughput (tok/s):         1067.83   
Total token throughput (tok/s):          2663.85   
Concurrency:                             3.94      
Accept length:                           3.46      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   792.78    
Median E2E Latency (ms):                 593.81    
---------------Time to First Token----------------
Mean TTFT (ms):                          31.66     
Median TTFT (ms):                        24.22     
P99 TTFT (ms):                           158.63    
---------------Inter-Token Latency----------------
Mean ITL (ms):                           3.56      
Median ITL (ms):                         2.82      
P95 ITL (ms):                            8.44      
P99 ITL (ms):                            11.96     
Max ITL (ms):                            267.40    
==================================================

After

With concurrency 1:

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 1         
Successful requests:                     200       
Benchmark duration (s):                  112.04    
Total input tokens:                      64205     
Total generated tokens:                  42957     
Total generated tokens (retokenized):    42956     
Request throughput (req/s):              1.79      
Input token throughput (tok/s):          573.05    
Output token throughput (tok/s):         383.40    
Total token throughput (tok/s):          956.45    
Concurrency:                             1.00      
Accept length:                           3.54      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   559.91    
Median E2E Latency (ms):                 431.09    
---------------Time to First Token----------------
Mean TTFT (ms):                          30.63     
Median TTFT (ms):                        28.67     
P99 TTFT (ms):                           73.23     
---------------Inter-Token Latency----------------
Mean ITL (ms):                           2.48      
Median ITL (ms):                         1.83      
P95 ITL (ms):                            4.48      
P99 ITL (ms):                            8.80      
Max ITL (ms):                            10.16     
==================================================

With concurrency 4:

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 4         
Successful requests:                     200       
Benchmark duration (s):                  36.16     
Total input tokens:                      64205     
Total generated tokens:                  42957     
Total generated tokens (retokenized):    42955     
Request throughput (req/s):              5.53      
Input token throughput (tok/s):          1775.41   
Output token throughput (tok/s):         1187.86   
Total token throughput (tok/s):          2963.27   
Concurrency:                             3.95      
Accept length:                           3.57      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   713.59    
Median E2E Latency (ms):                 529.06    
---------------Time to First Token----------------
Mean TTFT (ms):                          39.49     
Median TTFT (ms):                        29.25     
P99 TTFT (ms):                           213.86    
---------------Inter-Token Latency----------------
Mean ITL (ms):                           3.15      
Median ITL (ms):                         2.33      
P95 ITL (ms):                            7.01      
P99 ITL (ms):                            11.79     
Max ITL (ms):                            85.62     
==================================================

Repro Script

This script was run on an H200:

#! /bin/bash

# Start SGLang server
SGLANG_ENABLE_EXPERIMENTAL_EAGLE_OVERLAP_SCHEDULE=1 python -m sglang.launch_server \
    --model-path Qwen/Qwen3-8B \
    --speculative-draft-model-path Tengyunw/qwen3_8b_eagle3 \
    --speculative-algorithm EAGLE3 \
    --speculative-num-steps 5 \
    --speculative-eagle-topk 10 \
    --speculative-num-draft-tokens 32 \
    --attention-backend fa3 \
    --mem-fraction-static 0.9 \
    --dtype bfloat16 \
    --port 30000 &
PID=$!


# Wait for server to start
while ! curl -s http://localhost:30000/health > /dev/null; do
    sleep 1
done

# Run accuracy benchmark
python benchmark/gsm8k/bench_sglang.py --num-questions 200

# Flush cache
curl -X POST http://localhost:30000/flush_cache

# Run latency benchmark (bs1)
python -m sglang.bench_serving \
    --backend sglang \
    --num-prompts 200 \
    --max-concurrency 1

# Flush cache
curl -X POST http://localhost:30000/flush_cache

# Run latency benchmark (bs4)
python -m sglang.bench_serving \
    --backend sglang \
    --num-prompts 200 \
    --max-concurrency 4

# Kill server
kill $PID

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.