Support overlap scheduling for speculative decoding by timmy-feng · Pull Request #9588 · sgl-project/sglang (original) (raw)

Motivation

Speculative decoding currently does not support overlap scheduling due to the sequential logic between the draft and target models. However, overlap scheduling has been shown to achieve up to 10% performance gains in non-speculative use cases. This PR achieves host overlap in speculative decoding with 5-10% improvement at various batch sizes.

Feature parity is in the works.

To enable this experimental feature, the SGLANG_ENABLE_EXPERIMENTAL_EAGLE_OVERLAP_SCHEDULE=1 environment variable must be set. Additionally, using Flash Attention 3 is recommended as there is a sync in the Flash Infer backend.

Modifications

There should be no behavior change if the SGLANG_ENABLE_EXPERIMENTAL_EAGLE_OVERLAP_SCHEDULE environment variable is not set.

Host Syncs

The following was done to remove host syncs:

Eagle Client

After removal of all syncs, EagleWorkerClient was implemented with:

Future Work

We hope these items can be addressed in future PR's.

Accuracy Tests

I ran GSM8K on an H100.

# Main
SGLANG_ENABLE_EXPERIMENTAL_EAGLE_OVERLAP_SCHEDULE=1 python -m sglang.launch_server --model-path Qwen/Qwen3-8B --speculative-draft-model-path Tengyunw/qwen3_8b_eagle3 --speculative-algorithm EAGLE3 --speculative-num-steps 5 --speculative-eagle-topk 10 --speculative-num-draft-tokens 32 --attention-backend fa3 --mem-fraction-static 0.7 --dtype bfloat16 --port 30000
python benchmark/gsm8k/bench_sglang.py --num-questions 200
Accuracy: 0.955
Invalid: 0.000
Latency: 12.379 s
Output throughput: 1922.860 token/s

# This branch
Accuracy: 0.950
Invalid: 0.000
Latency: 11.918 s
Output throughput: 2039.966 token/s

Benchmarking and Profiling

Benchmarks were run on an H200.

Before

With concurrency 1:

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 1         
Successful requests:                     200       
Benchmark duration (s):                  121.97    
Total input tokens:                      64205     
Total generated tokens:                  42957     
Total generated tokens (retokenized):    42956     
Request throughput (req/s):              1.64      
Input token throughput (tok/s):          526.41    
Output token throughput (tok/s):         352.20    
Total token throughput (tok/s):          878.60    
Concurrency:                             1.00      
Accept length:                           3.48      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   609.39    
Median E2E Latency (ms):                 464.94    
---------------Time to First Token----------------
Mean TTFT (ms):                          24.10     
Median TTFT (ms):                        21.63     
P99 TTFT (ms):                           70.97     
---------------Inter-Token Latency----------------
Mean ITL (ms):                           2.74      
Median ITL (ms):                         2.04      
P95 ITL (ms):                            4.98      
P99 ITL (ms):                            9.74      
Max ITL (ms):                            17.18     
==================================================

With concurrency 4:

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 4         
Successful requests:                     200       
Benchmark duration (s):                  40.23     
Total input tokens:                      64205     
Total generated tokens:                  42957     
Total generated tokens (retokenized):    42956     
Request throughput (req/s):              4.97      
Input token throughput (tok/s):          1596.02   
Output token throughput (tok/s):         1067.83   
Total token throughput (tok/s):          2663.85   
Concurrency:                             3.94      
Accept length:                           3.46      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   792.78    
Median E2E Latency (ms):                 593.81    
---------------Time to First Token----------------
Mean TTFT (ms):                          31.66     
Median TTFT (ms):                        24.22     
P99 TTFT (ms):                           158.63    
---------------Inter-Token Latency----------------
Mean ITL (ms):                           3.56      
Median ITL (ms):                         2.82      
P95 ITL (ms):                            8.44      
P99 ITL (ms):                            11.96     
Max ITL (ms):                            267.40    
==================================================

After

With concurrency 1:

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 1         
Successful requests:                     200       
Benchmark duration (s):                  112.04    
Total input tokens:                      64205     
Total generated tokens:                  42957     
Total generated tokens (retokenized):    42956     
Request throughput (req/s):              1.79      
Input token throughput (tok/s):          573.05    
Output token throughput (tok/s):         383.40    
Total token throughput (tok/s):          956.45    
Concurrency:                             1.00      
Accept length:                           3.54      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   559.91    
Median E2E Latency (ms):                 431.09    
---------------Time to First Token----------------
Mean TTFT (ms):                          30.63     
Median TTFT (ms):                        28.67     
P99 TTFT (ms):                           73.23     
---------------Inter-Token Latency----------------
Mean ITL (ms):                           2.48      
Median ITL (ms):                         1.83      
P95 ITL (ms):                            4.48      
P99 ITL (ms):                            8.80      
Max ITL (ms):                            10.16     
==================================================

With concurrency 4:

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 4         
Successful requests:                     200       
Benchmark duration (s):                  36.16     
Total input tokens:                      64205     
Total generated tokens:                  42957     
Total generated tokens (retokenized):    42955     
Request throughput (req/s):              5.53      
Input token throughput (tok/s):          1775.41   
Output token throughput (tok/s):         1187.86   
Total token throughput (tok/s):          2963.27   
Concurrency:                             3.95      
Accept length:                           3.57      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   713.59    
Median E2E Latency (ms):                 529.06    
---------------Time to First Token----------------
Mean TTFT (ms):                          39.49     
Median TTFT (ms):                        29.25     
P99 TTFT (ms):                           213.86    
---------------Inter-Token Latency----------------
Mean ITL (ms):                           3.15      
Median ITL (ms):                         2.33      
P95 ITL (ms):                            7.01      
P99 ITL (ms):                            11.79     
Max ITL (ms):                            85.62     
==================================================

Repro Script

This script was run on an H200:

#! /bin/bash

# Start SGLang server
SGLANG_ENABLE_EXPERIMENTAL_EAGLE_OVERLAP_SCHEDULE=1 python -m sglang.launch_server \
    --model-path Qwen/Qwen3-8B \
    --speculative-draft-model-path Tengyunw/qwen3_8b_eagle3 \
    --speculative-algorithm EAGLE3 \
    --speculative-num-steps 5 \
    --speculative-eagle-topk 10 \
    --speculative-num-draft-tokens 32 \
    --attention-backend fa3 \
    --mem-fraction-static 0.9 \
    --dtype bfloat16 \
    --port 30000 &
PID=$!


# Wait for server to start
while ! curl -s http://localhost:30000/health > /dev/null; do
    sleep 1
done

# Run accuracy benchmark
python benchmark/gsm8k/bench_sglang.py --num-questions 200

# Flush cache
curl -X POST http://localhost:30000/flush_cache

# Run latency benchmark (bs1)
python -m sglang.bench_serving \
    --backend sglang \
    --num-prompts 200 \
    --max-concurrency 1

# Flush cache
curl -X POST http://localhost:30000/flush_cache

# Run latency benchmark (bs4)
python -m sglang.bench_serving \
    --backend sglang \
    --num-prompts 200 \
    --max-concurrency 4

# Kill server
kill $PID

Checklist