SGLang Frontend Language — SGLang (original) (raw)
SGLang Frontend Language#
SGLang frontend language can be used to define simple and easy prompts in a convenient, structured way.
Launch A Server#
Launch the server in your terminal and wait for it to initialize.
import requests import os
from sglang import assistant_begin, assistant_end from sglang import assistant, function, gen, system, user from sglang import image from sglang import RuntimeEndpoint, set_default_backend from sglang.srt.utils import load_image from sglang.test.test_utils import is_in_ci from sglang.utils import print_highlight, terminate_process, wait_for_server
if is_in_ci(): from patch import launch_server_cmd else: from sglang.utils import launch_server_cmd
server_process, port = launch_server_cmd( "python -m sglang.launch_server --model-path Qwen/Qwen2.5-7B-Instruct --host 0.0.0.0" )
wait_for_server(f"http://localhost:{port}") print(f"Server started on http://localhost:{port}")
[2025-06-16 16:47:24] server_args=ServerArgs(model_path='Qwen/Qwen2.5-7B-Instruct', tokenizer_path='Qwen/Qwen2.5-7B-Instruct', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization=None, quantization_param_path=None, context_length=None, device='cuda', served_model_name='Qwen/Qwen2.5-7B-Instruct', chat_template=None, completion_template=None, is_embedding=False, enable_multimodal=None, revision=None, impl='auto', host='0.0.0.0', port=37057, mem_fraction_static=0.874, max_running_requests=200, max_total_tokens=20480, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='fcfs', schedule_conservativeness=1.0, cpu_offload_gb=0, page_size=1, tp_size=1, pp_size=1, max_micro_batch_size=None, stream_interval=1, stream_output=False, random_seed=889184955, constrained_json_whitespace_pattern=None, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, sleep_on_idle=False, log_level='info', log_level_http=None, log_requests=False, log_requests_level=0, show_time_cost=False, enable_metrics=False, bucket_time_to_first_token=None, bucket_e2e_request_latency=None, bucket_inter_token_latency=None, collect_tokens_histogram=False, decode_log_interval=40, enable_request_time_stats_logging=False, kv_events_config=None, api_key=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, tool_call_parser=None, dp_size=1, load_balance_method='round_robin', dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, lora_paths=None, max_loras_per_batch=8, lora_backend='triton', attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', mm_attention_backend=None, speculative_algorithm=None, speculative_draft_model_path=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, ep_size=1, enable_ep_moe=False, enable_deepep_moe=False, deepep_mode='auto', ep_num_redundant_experts=0, ep_dispatch_algorithm='static', init_expert_location='trivial', enable_eplb=False, eplb_algorithm='auto', eplb_rebalance_num_iterations=1000, eplb_rebalance_layers_per_chunk=None, expert_distribution_recorder_mode=None, expert_distribution_recorder_buffer_size=1000, enable_expert_distribution_metrics=False, deepep_config=None, moe_dense_tp_size=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, disable_radix_cache=False, cuda_graph_max_bs=None, cuda_graph_bs=None, disable_cuda_graph=True, disable_cuda_graph_padding=False, enable_profile_cuda_graph=False, enable_nccl_nvls=False, enable_tokenizer_batch_encode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_mscclpp=False, disable_overlap_schedule=False, disable_overlap_cg_plan=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_two_batch_overlap=False, enable_torch_compile=False, torch_compile_max_bs=32, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, enable_custom_logit_processor=False, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through_selective', flashinfer_mla_disable_ragged=False, disable_shared_experts_fusion=False, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, enable_return_hidden_states=False, warmups=None, debug_tensor_dump_output_folder=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, debug_tensor_dump_prefill_only=False, disaggregation_mode='null', disaggregation_transfer_backend='mooncake', disaggregation_bootstrap_port=8998, disaggregation_decode_tp=None, disaggregation_decode_dp=None, disaggregation_prefill_pp=1, disaggregation_ib_device=None, num_reserved_decode_tokens=512, pdlb_url=None) [2025-06-16 16:47:31] Attention backend not set. Use fa3 backend by default. [2025-06-16 16:47:31] Init torch distributed begin. [2025-06-16 16:47:32] Init torch distributed ends. mem usage=0.00 GB [2025-06-16 16:47:32] Load weight begin. avail mem=78.50 GB [2025-06-16 16:47:33] Using model weights format ['*.safetensors'] Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s] Loading safetensors checkpoint shards: 25% Completed | 1/4 [00:00<00:02, 1.47it/s] Loading safetensors checkpoint shards: 50% Completed | 2/4 [00:01<00:01, 1.36it/s] Loading safetensors checkpoint shards: 75% Completed | 3/4 [00:02<00:00, 1.32it/s] Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00, 1.37it/s] Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00, 1.37it/s]
[2025-06-16 16:47:36] Load weight end. type=Qwen2ForCausalLM, dtype=torch.bfloat16, avail mem=64.20 GB, mem usage=14.30 GB. [2025-06-16 16:47:36] KV Cache is allocated. #tokens: 20480, K size: 0.55 GB, V size: 0.55 GB [2025-06-16 16:47:36] Memory pool end. avail mem=62.90 GB [2025-06-16 16:47:36] max_total_num_tokens=20480, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=200, context_len=32768, available_gpu_mem=62.81 GB [2025-06-16 16:47:37] INFO: Started server process [259288] [2025-06-16 16:47:37] INFO: Waiting for application startup. [2025-06-16 16:47:37] INFO: Application startup complete. [2025-06-16 16:47:37] INFO: Uvicorn running on http://0.0.0.0:37057 (Press CTRL+C to quit) [2025-06-16 16:47:37] INFO: 127.0.0.1:33874 - "GET /v1/models HTTP/1.1" 200 OK [2025-06-16 16:47:38] INFO: 127.0.0.1:33880 - "GET /get_model_info HTTP/1.1" 200 OK [2025-06-16 16:47:38] Prefill batch. #new-seq: 1, #new-token: 6, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0 [2025-06-16 16:47:39] INFO: 127.0.0.1:33890 - "POST /generate HTTP/1.1" 200 OK [2025-06-16 16:47:39] The server is fired up and ready to roll!
NOTE: Typically, the server runs in a separate terminal. In this notebook, we run the server and notebook code together, so their outputs are combined. To improve clarity, the server logs are displayed in the original black color, while the notebook outputs are highlighted in blue. We are running those notebooks in a CI parallel environment, so the throughput is not representative of the actual performance.
Server started on http://localhost:37057
Set the default backend. Note: Besides the local server, you may use also OpenAI
or other API endpoints.
set_default_backend(RuntimeEndpoint(f"http://localhost:{port}"))
[2025-06-16 16:47:42] INFO: 127.0.0.1:33904 - "GET /get_model_info HTTP/1.1" 200 OK
Basic Usage#
The most simple way of using SGLang frontend language is a simple question answer dialog between a user and an assistant.
@function def basic_qa(s, question): s += system(f"You are a helpful assistant than can answer questions.") s += user(question) s += assistant(gen("answer", max_tokens=512))
state = basic_qa("List 3 countries and their capitals.") print_highlight(state["answer"])
[2025-06-16 16:47:42] Prefill batch. #new-seq: 1, #new-token: 31, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0 [2025-06-16 16:47:42] Decode batch. #running-req: 1, #token: 1, token usage: 0.00, cuda graph: False, gen throughput (token/s): 6.67, #queue-req: 0 [2025-06-16 16:47:42] INFO: 127.0.0.1:33916 - "POST /generate HTTP/1.1" 200 OK
Sure! Here are three countries with their respective capitals:
1. **France - Paris** 2. **Germany - Berlin** 3. **Japan - Tokyo**
Multi-turn Dialog#
SGLang frontend language can also be used to define multi-turn dialogs.
@function def multi_turn_qa(s): s += system(f"You are a helpful assistant than can answer questions.") s += user("Please give me a list of 3 countries and their capitals.") s += assistant(gen("first_answer", max_tokens=512)) s += user("Please give me another list of 3 countries and their capitals.") s += assistant(gen("second_answer", max_tokens=512)) return s
state = multi_turn_qa() print_highlight(state["first_answer"]) print_highlight(state["second_answer"])
[2025-06-16 16:47:42] Prefill batch. #new-seq: 1, #new-token: 18, #cached-token: 18, token usage: 0.00, #running-req: 0, #queue-req: 0 [2025-06-16 16:47:43] INFO: 127.0.0.1:33924 - "POST /generate HTTP/1.1" 200 OK
Here's a list of three countries along with their respective capitals:
1. France - Paris 2. Japan - Tokyo 3. Brazil - Brasília
[2025-06-16 16:47:43] Prefill batch. #new-seq: 1, #new-token: 23, #cached-token: 67, token usage: 0.00, #running-req: 0, #queue-req: 0 [2025-06-16 16:47:43] Decode batch. #running-req: 1, #token: 98, token usage: 0.00, cuda graph: False, gen throughput (token/s): 100.85, #queue-req: 0 [2025-06-16 16:47:43] INFO: 127.0.0.1:33936 - "POST /generate HTTP/1.1" 200 OK
Certainly! Here's another list of three countries along with their respective capitals:
1. Canada - Ottawa 2. Egypt - Cairo 3. Mexico - Mexico City
Control flow#
You may use any Python code within the function to define more complex control flows.
@function def tool_use(s, question): s += assistant( "To answer this question: " + question + ". I need to use a " + gen("tool", choices=["calculator", "search engine"]) + ". " )
if s["tool"] == "calculator":
s += assistant("The math expression is: " + gen("expression"))
elif s["tool"] == "search engine":
s += assistant("The key word to search is: " + gen("word"))
state = tool_use("What is 2 * 2?") print_highlight(state["tool"]) print_highlight(state["expression"])
[2025-06-16 16:47:43] Prefill batch. #new-seq: 1, #new-token: 25, #cached-token: 8, token usage: 0.00, #running-req: 0, #queue-req: 0 [2025-06-16 16:47:43] INFO: 127.0.0.1:33952 - "POST /generate HTTP/1.1" 200 OK [2025-06-16 16:47:43] Prefill batch. #new-seq: 1, #new-token: 2, #cached-token: 31, token usage: 0.00, #running-req: 0, #queue-req: 0 [2025-06-16 16:47:43] Prefill batch. #new-seq: 1, #new-token: 3, #cached-token: 31, token usage: 0.00, #running-req: 1, #queue-req: 0 [2025-06-16 16:47:43] INFO: 127.0.0.1:33964 - "POST /generate HTTP/1.1" 200 OK
[2025-06-16 16:47:43] Prefill batch. #new-seq: 1, #new-token: 13, #cached-token: 33, token usage: 0.00, #running-req: 0, #queue-req: 0 [2025-06-16 16:47:43] Decode batch. #running-req: 1, #token: 58, token usage: 0.00, cuda graph: False, gen throughput (token/s): 95.33, #queue-req: 0 [2025-06-16 16:47:43] INFO: 127.0.0.1:33968 - "POST /generate HTTP/1.1" 200 OK
2 * 2.
Let's calculate it without a calculator:
\[ 2 * 2 = 4 \]
So, 2 * 2 equals 4.
Parallelism#
Use fork
to launch parallel prompts. Because sgl.gen
is non-blocking, the for loop below issues two generation calls in parallel.
@function def tip_suggestion(s): s += assistant( "Here are two tips for staying healthy: " "1. Balanced Diet. 2. Regular Exercise.\n\n" )
forks = s.fork(2)
for i, f in enumerate(forks):
f += assistant(
f"Now, expand tip {i+1} into a paragraph:\n"
+ gen("detailed_tip", max_tokens=256, stop="\n\n")
)
s += assistant("Tip 1:" + forks[0]["detailed_tip"] + "\n")
s += assistant("Tip 2:" + forks[1]["detailed_tip"] + "\n")
s += assistant(
"To summarize the above two tips, I can say:\n" + gen("summary", max_tokens=512)
)
state = tip_suggestion() print_highlight(state["summary"])
[2025-06-16 16:47:43] Prefill batch. #new-seq: 1, #new-token: 35, #cached-token: 14, token usage: 0.00, #running-req: 0, #queue-req: 0 [2025-06-16 16:47:43] Prefill batch. #new-seq: 1, #new-token: 35, #cached-token: 14, token usage: 0.00, #running-req: 1, #queue-req: 0 [2025-06-16 16:47:44] Decode batch. #running-req: 2, #token: 84, token usage: 0.00, cuda graph: False, gen throughput (token/s): 94.71, #queue-req: 0 [2025-06-16 16:47:44] Decode batch. #running-req: 2, #token: 164, token usage: 0.01, cuda graph: False, gen throughput (token/s): 218.97, #queue-req: 0 [2025-06-16 16:47:44] Decode batch. #running-req: 2, #token: 244, token usage: 0.01, cuda graph: False, gen throughput (token/s): 218.29, #queue-req: 0 [2025-06-16 16:47:45] Decode batch. #running-req: 2, #token: 324, token usage: 0.02, cuda graph: False, gen throughput (token/s): 217.82, #queue-req: 0 [2025-06-16 16:47:45] INFO: 127.0.0.1:33980 - "POST /generate HTTP/1.1" 200 OK [2025-06-16 16:47:45] Decode batch. #running-req: 1, #token: 224, token usage: 0.01, cuda graph: False, gen throughput (token/s): 143.40, #queue-req: 0 [2025-06-16 16:47:45] INFO: 127.0.0.1:33982 - "POST /generate HTTP/1.1" 200 OK [2025-06-16 16:47:45] Prefill batch. #new-seq: 1, #new-token: 357, #cached-token: 39, token usage: 0.00, #running-req: 0, #queue-req: 0 [2025-06-16 16:47:45] Decode batch. #running-req: 1, #token: 426, token usage: 0.02, cuda graph: False, gen throughput (token/s): 104.73, #queue-req: 0 [2025-06-16 16:47:46] Decode batch. #running-req: 1, #token: 466, token usage: 0.02, cuda graph: False, gen throughput (token/s): 114.41, #queue-req: 0 [2025-06-16 16:47:46] Decode batch. #running-req: 1, #token: 506, token usage: 0.02, cuda graph: False, gen throughput (token/s): 115.16, #queue-req: 0 [2025-06-16 16:47:47] Decode batch. #running-req: 1, #token: 546, token usage: 0.03, cuda graph: False, gen throughput (token/s): 110.91, #queue-req: 0 [2025-06-16 16:47:47] INFO: 127.0.0.1:33986 - "POST /generate HTTP/1.1" 200 OK
1. **Balanced Diet**: Consuming a variety of nutrient-rich foods from all food groups in appropriate portions. This includes plenty of fruits and vegetables, whole grains, lean proteins, and healthy fats. Avoiding excessive intake of processed foods, sugary drinks, and unhealthy fats further supports your health.
2. **Regular Exercise**: Engaging in regular physical activity that includes at least 150 minutes of moderate aerobic activity or 75 minutes of vigorous activity each week. Additionally, incorporating strength training exercises at least twice a week is beneficial. Regular exercise helps keep your body in shape, improves cardiovascular health, boosts mood, and supports overall well-being.
By combining these two practices, you can significantly enhance your physical and mental health, leading to a healthier and more active lifestyle.
Constrained Decoding#
Use regex
to specify a regular expression as a decoding constraint. This is only supported for local models.
@function def regular_expression_gen(s): s += user("What is the IP address of the Google DNS servers?") s += assistant( gen( "answer", temperature=0, regex=r"((25[0-5]|2[0-4]\d|[01]?\d\d?).){3}(25[0-5]|2[0-4]\d|[01]?\d\d?)", ) )
state = regular_expression_gen() print_highlight(state["answer"])
[2025-06-16 16:47:47] Prefill batch. #new-seq: 1, #new-token: 18, #cached-token: 12, token usage: 0.00, #running-req: 0, #queue-req: 0 [2025-06-16 16:47:47] INFO: 127.0.0.1:34002 - "POST /generate HTTP/1.1" 200 OK
Use regex
to define a JSON
decoding schema.
character_regex = ( r"""{\n""" + r""" "name": "[\w\d\s]{1,16}",\n""" + r""" "house": "(Gryffindor|Slytherin|Ravenclaw|Hufflepuff)",\n""" + r""" "blood status": "(Pure-blood|Half-blood|Muggle-born)",\n""" + r""" "occupation": "(student|teacher|auror|ministry of magic|death eater|order of the phoenix)",\n""" + r""" "wand": {\n""" + r""" "wood": "[\w\d\s]{1,16}",\n""" + r""" "core": "[\w\d\s]{1,16}",\n""" + r""" "length": [0-9]{1,2}.[0-9]{0,2}\n""" + r""" },\n""" + r""" "alive": "(Alive|Deceased)",\n""" + r""" "patronus": "[\w\d\s]{1,16}",\n""" + r""" "bogart": "[\w\d\s]{1,16}"\n""" + r"""}""" )
@function def character_gen(s, name): s += user( f"{name} is a character in Harry Potter. Please fill in the following information about this character." ) s += assistant(gen("json_output", max_tokens=256, regex=character_regex))
state = character_gen("Harry Potter") print_highlight(state["json_output"])
[2025-06-16 16:47:48] Prefill batch. #new-seq: 1, #new-token: 24, #cached-token: 14, token usage: 0.00, #running-req: 0, #queue-req: 0 [2025-06-16 16:47:48] Decode batch. #running-req: 1, #token: 54, token usage: 0.00, cuda graph: False, gen throughput (token/s): 25.31, #queue-req: 0 [2025-06-16 16:47:48] Decode batch. #running-req: 1, #token: 94, token usage: 0.00, cuda graph: False, gen throughput (token/s): 110.88, #queue-req: 0 [2025-06-16 16:47:49] Decode batch. #running-req: 1, #token: 134, token usage: 0.01, cuda graph: False, gen throughput (token/s): 106.23, #queue-req: 0 [2025-06-16 16:47:49] INFO: 127.0.0.1:46102 - "POST /generate HTTP/1.1" 200 OK
{ "name": "Harry Potter", "house": "Gryffindor", "blood status": "Half-blood", "occupation": "student", "wand": { "wood": "Holly", "core": "Phoenix feather", "length": 11.0 }, "alive": "Deceased", "patronus": " stag", "bogart": "Dementors" }
Batching#
Use run_batch
to run a batch of prompts.
@function def text_qa(s, question): s += user(question) s += assistant(gen("answer", stop="\n"))
states = text_qa.run_batch( [ {"question": "What is the capital of the United Kingdom?"}, {"question": "What is the capital of France?"}, {"question": "What is the capital of Japan?"}, ], progress_bar=True, )
for i, state in enumerate(states): print_highlight(f"Answer {i+1}: {states[i]['answer']}")
[2025-06-16 16:47:49] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 13, token usage: 0.00, #running-req: 0, #queue-req: 0 [2025-06-16 16:47:49] INFO: 127.0.0.1:46114 - "POST /generate HTTP/1.1" 200 OK
100%|██████████| 3/3 [00:00<00:00, 21.93it/s]
[2025-06-16 16:47:49] Prefill batch. #new-seq: 1, #new-token: 11, #cached-token: 17, token usage: 0.00, #running-req: 0, #queue-req: 0 [2025-06-16 16:47:49] Prefill batch. #new-seq: 1, #new-token: 9, #cached-token: 17, token usage: 0.00, #running-req: 1, #queue-req: 0 [2025-06-16 16:47:49] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 19, token usage: 0.00, #running-req: 2, #queue-req: 0 [2025-06-16 16:47:49] INFO: 127.0.0.1:46140 - "POST /generate HTTP/1.1" 200 OK [2025-06-16 16:47:49] INFO: 127.0.0.1:46150 - "POST /generate HTTP/1.1" 200 OK [2025-06-16 16:47:49] INFO: 127.0.0.1:46126 - "POST /generate HTTP/1.1" 200 OK
Answer 1: The capital of the United Kingdom is London.
Answer 2: The capital of France is Paris.
Answer 3: The capital of Japan is Tokyo.
Streaming#
Use stream
to stream the output to the user.
@function def text_qa(s, question): s += user(question) s += assistant(gen("answer", stop="\n"))
state = text_qa.run( question="What is the capital of France?", temperature=0.1, stream=True )
for out in state.text_iter(): print(out, end="", flush=True)
<|im_start|>system You are a helpful assistant.<|im_end|> <|im_start|>user What is the capital of France?<|im_end|> <|im_start|>assistant [2025-06-16 16:47:49] INFO: 127.0.0.1:46166 - "POST /generate HTTP/1.1" 200 OK [2025-06-16 16:47:49] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 25, token usage: 0.00, #running-req: 0, #queue-req: 0 The[2025-06-16 16:47:49] Decode batch. #running-req: 1, #token: 28, token usage: 0.00, cuda graph: False, gen throughput (token/s): 118.89, #queue-req: 0 capital of France is Paris.<|im_end|>
Complex Prompts#
You may use {system|user|assistant}_{begin|end}
to define complex prompts.
@function def chat_example(s): s += system("You are a helpful assistant.") # Same as: s += s.system("You are a helpful assistant.")
with s.user():
s += "Question: What is the capital of France?"
s += assistant_begin()
s += "Answer: " + gen("answer", max_tokens=100, stop="\n")
s += assistant_end()
state = chat_example() print_highlight(state["answer"])
[2025-06-16 16:47:49] Prefill batch. #new-seq: 1, #new-token: 17, #cached-token: 14, token usage: 0.00, #running-req: 0, #queue-req: 0 [2025-06-16 16:47:49] INFO: 127.0.0.1:46178 - "POST /generate HTTP/1.1" 200 OK
The capital of France is Paris.
terminate_process(server_process)
[2025-06-16 16:47:49] Child process unexpectedly failed with exitcode=9. pid=259507
Multi-modal Generation#
You may use SGLang frontend language to define multi-modal prompts. See here for supported models.
server_process, port = launch_server_cmd( "python -m sglang.launch_server --model-path Qwen/Qwen2.5-VL-7B-Instruct --host 0.0.0.0" )
wait_for_server(f"http://localhost:{port}") print(f"Server started on http://localhost:{port}")
[2025-06-16 16:47:56] server_args=ServerArgs(model_path='Qwen/Qwen2.5-VL-7B-Instruct', tokenizer_path='Qwen/Qwen2.5-VL-7B-Instruct', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization=None, quantization_param_path=None, context_length=None, device='cuda', served_model_name='Qwen/Qwen2.5-VL-7B-Instruct', chat_template=None, completion_template=None, is_embedding=False, enable_multimodal=None, revision=None, impl='auto', host='0.0.0.0', port=31215, mem_fraction_static=0.874, max_running_requests=200, max_total_tokens=20480, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='fcfs', schedule_conservativeness=1.0, cpu_offload_gb=0, page_size=1, tp_size=1, pp_size=1, max_micro_batch_size=None, stream_interval=1, stream_output=False, random_seed=941290998, constrained_json_whitespace_pattern=None, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, sleep_on_idle=False, log_level='info', log_level_http=None, log_requests=False, log_requests_level=0, show_time_cost=False, enable_metrics=False, bucket_time_to_first_token=None, bucket_e2e_request_latency=None, bucket_inter_token_latency=None, collect_tokens_histogram=False, decode_log_interval=40, enable_request_time_stats_logging=False, kv_events_config=None, api_key=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, tool_call_parser=None, dp_size=1, load_balance_method='round_robin', dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, lora_paths=None, max_loras_per_batch=8, lora_backend='triton', attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', mm_attention_backend=None, speculative_algorithm=None, speculative_draft_model_path=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, ep_size=1, enable_ep_moe=False, enable_deepep_moe=False, deepep_mode='auto', ep_num_redundant_experts=0, ep_dispatch_algorithm='static', init_expert_location='trivial', enable_eplb=False, eplb_algorithm='auto', eplb_rebalance_num_iterations=1000, eplb_rebalance_layers_per_chunk=None, expert_distribution_recorder_mode=None, expert_distribution_recorder_buffer_size=1000, enable_expert_distribution_metrics=False, deepep_config=None, moe_dense_tp_size=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, disable_radix_cache=False, cuda_graph_max_bs=None, cuda_graph_bs=None, disable_cuda_graph=True, disable_cuda_graph_padding=False, enable_profile_cuda_graph=False, enable_nccl_nvls=False, enable_tokenizer_batch_encode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_mscclpp=False, disable_overlap_schedule=False, disable_overlap_cg_plan=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_two_batch_overlap=False, enable_torch_compile=False, torch_compile_max_bs=32, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, enable_custom_logit_processor=False, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through_selective', flashinfer_mla_disable_ragged=False, disable_shared_experts_fusion=False, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, enable_return_hidden_states=False, warmups=None, debug_tensor_dump_output_folder=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, debug_tensor_dump_prefill_only=False, disaggregation_mode='null', disaggregation_transfer_backend='mooncake', disaggregation_bootstrap_port=8998, disaggregation_decode_tp=None, disaggregation_decode_dp=None, disaggregation_prefill_pp=1, disaggregation_ib_device=None, num_reserved_decode_tokens=512, pdlb_url=None)
You have video processor config saved in preprocessor.json
file which is deprecated. Video processor configs should be saved in their own video_preprocessor.json
file. You can rename the file or load and save the processor back which renames it automatically. Loading from preprocessor.json
will be removed in v5.0.
[2025-06-16 16:47:57] You have video processor config saved in preprocessor.json
file which is deprecated. Video processor configs should be saved in their own video_preprocessor.json
file. You can rename the file or load and save the processor back which renames it automatically. Loading from preprocessor.json
will be removed in v5.0.
[2025-06-16 16:47:58] Infer the chat template name from the model path and obtain the result: qwen2-vl.
You have video processor config saved in preprocessor.json
file which is deprecated. Video processor configs should be saved in their own video_preprocessor.json
file. You can rename the file or load and save the processor back which renames it automatically. Loading from preprocessor.json
will be removed in v5.0.
[2025-06-16 16:48:03] You have video processor config saved in preprocessor.json
file which is deprecated. Video processor configs should be saved in their own video_preprocessor.json
file. You can rename the file or load and save the processor back which renames it automatically. Loading from preprocessor.json
will be removed in v5.0.
[2025-06-16 16:48:04] Attention backend not set. Use flashinfer backend by default.
[2025-06-16 16:48:04] Automatically reduce --mem-fraction-static to 0.787 because this is a multimodal model.
[2025-06-16 16:48:04] Init torch distributed begin.
[2025-06-16 16:48:05] Init torch distributed ends. mem usage=0.00 GB
[2025-06-16 16:48:05] Load weight begin. avail mem=58.28 GB
[2025-06-16 16:48:05] Multimodal attention backend not set. Use sdpa.
[2025-06-16 16:48:05] Using sdpa as multimodal attention backend.
[2025-06-16 16:48:06] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards: 0% Completed | 0/5 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 20% Completed | 1/5 [00:00<00:02, 1.45it/s]
Loading safetensors checkpoint shards: 40% Completed | 2/5 [00:01<00:02, 1.35it/s]
Loading safetensors checkpoint shards: 60% Completed | 3/5 [00:02<00:01, 1.33it/s]
Loading safetensors checkpoint shards: 80% Completed | 4/5 [00:03<00:00, 1.31it/s]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:03<00:00, 1.68it/s]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:03<00:00, 1.51it/s]
[2025-06-16 16:48:09] Load weight end. type=Qwen2_5_VLForConditionalGeneration, dtype=torch.bfloat16, avail mem=62.79 GB, mem usage=-4.51 GB. [2025-06-16 16:48:09] KV Cache is allocated. #tokens: 20480, K size: 0.55 GB, V size: 0.55 GB [2025-06-16 16:48:09] Memory pool end. avail mem=61.43 GB [2025-06-16 16:48:11] max_total_num_tokens=20480, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=200, context_len=128000, available_gpu_mem=60.85 GB [2025-06-16 16:48:11] INFO: Started server process [261355] [2025-06-16 16:48:11] INFO: Waiting for application startup. [2025-06-16 16:48:11] INFO: Application startup complete. [2025-06-16 16:48:11] INFO: Uvicorn running on http://0.0.0.0:31215 (Press CTRL+C to quit) [2025-06-16 16:48:12] INFO: 127.0.0.1:44246 - "GET /v1/models HTTP/1.1" 200 OK [2025-06-16 16:48:12] INFO: 127.0.0.1:44262 - "GET /get_model_info HTTP/1.1" 200 OK [2025-06-16 16:48:12] Prefill batch. #new-seq: 1, #new-token: 6, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0 [2025-06-16 16:48:13] INFO: 127.0.0.1:44272 - "POST /generate HTTP/1.1" 200 OK [2025-06-16 16:48:13] The server is fired up and ready to roll!
NOTE: Typically, the server runs in a separate terminal. In this notebook, we run the server and notebook code together, so their outputs are combined. To improve clarity, the server logs are displayed in the original black color, while the notebook outputs are highlighted in blue. We are running those notebooks in a CI parallel environment, so the throughput is not representative of the actual performance.
Server started on http://localhost:31215
set_default_backend(RuntimeEndpoint(f"http://localhost:{port}"))
[2025-06-16 16:48:17] INFO: 127.0.0.1:44288 - "GET /get_model_info HTTP/1.1" 200 OK
Ask a question about an image.
@function def image_qa(s, image_file, question): s += user(image(image_file) + question) s += assistant(gen("answer", max_tokens=256))
image_url = "https://github.com/sgl-project/sglang/blob/main/test/lang/example_image.png?raw=true" image_bytes, _ = load_image(image_url) state = image_qa(image_bytes, "What is in the image?") print_highlight(state["answer"])
[2025-06-16 16:48:18] Prefill batch. #new-seq: 1, #new-token: 307, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0 [2025-06-16 16:48:19] Decode batch. #running-req: 1, #token: 340, token usage: 0.02, cuda graph: False, gen throughput (token/s): 4.99, #queue-req: 0 [2025-06-16 16:48:19] Decode batch. #running-req: 1, #token: 380, token usage: 0.02, cuda graph: False, gen throughput (token/s): 64.29, #queue-req: 0 [2025-06-16 16:48:20] INFO: 127.0.0.1:46436 - "POST /generate HTTP/1.1" 200 OK
The image depicts a person on a street interacting with a yellow vehicle, likely a taxi. The person appears to be half-way into the taxi, hanging some bedding or clotheslines out onto the street. The context suggests some form of unconventional clothing drying technique, although this specific interaction might seem unusual and potentially illegal without the appropriate establishments typically SNMP for this purpose. The background includes a yellow car crafted like a regular taxi for the scene and other city elements.
terminate_process(server_process)
[2025-06-16 16:48:20] Child process unexpectedly failed with exitcode=9. pid=261567 [2025-06-16 16:48:20] Child process unexpectedly failed with exitcode=9. pid=261501