OpenAI APIs - Vision — SGLang (original) (raw)
OpenAI APIs - Vision#
SGLang provides OpenAI-compatible APIs to enable a smooth transition from OpenAI services to self-hosted local models. A complete reference for the API is available in the OpenAI API Reference. This tutorial covers the vision APIs for vision language models.
SGLang supports various vision language models such as Llama 3.2, LLaVA-OneVision, Qwen2.5-VL, Gemma3 and more.
As an alternative to the OpenAI API, you can also use the SGLang offline engine.
Launch A Server#
Launch the server in your terminal and wait for it to initialize.
from sglang.test.test_utils import is_in_ci
if is_in_ci(): from patch import launch_server_cmd else: from sglang.utils import launch_server_cmd
from sglang.utils import wait_for_server, print_highlight, terminate_process
vision_process, port = launch_server_cmd( """ python3 -m sglang.launch_server --model-path Qwen/Qwen2.5-VL-7B-Instruct """ )
wait_for_server(f"http://localhost:{port}")
[2025-06-14 19:51:05] server_args=ServerArgs(model_path='Qwen/Qwen2.5-VL-7B-Instruct', tokenizer_path='Qwen/Qwen2.5-VL-7B-Instruct', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization=None, quantization_param_path=None, context_length=None, device='cuda', served_model_name='Qwen/Qwen2.5-VL-7B-Instruct', chat_template=None, completion_template=None, is_embedding=False, enable_multimodal=None, revision=None, impl='auto', host='127.0.0.1', port=35598, mem_fraction_static=0.874, max_running_requests=200, max_total_tokens=20480, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='fcfs', schedule_conservativeness=1.0, cpu_offload_gb=0, page_size=1, tp_size=1, pp_size=1, max_micro_batch_size=None, stream_interval=1, stream_output=False, random_seed=612092913, constrained_json_whitespace_pattern=None, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, sleep_on_idle=False, log_level='info', log_level_http=None, log_requests=False, log_requests_level=0, show_time_cost=False, enable_metrics=False, bucket_time_to_first_token=None, bucket_e2e_request_latency=None, bucket_inter_token_latency=None, collect_tokens_histogram=False, decode_log_interval=40, enable_request_time_stats_logging=False, kv_events_config=None, api_key=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, tool_call_parser=None, dp_size=1, load_balance_method='round_robin', dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, lora_paths=None, max_loras_per_batch=8, lora_backend='triton', attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', mm_attention_backend=None, speculative_algorithm=None, speculative_draft_model_path=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, ep_size=1, enable_ep_moe=False, enable_deepep_moe=False, deepep_mode='auto', ep_num_redundant_experts=0, ep_dispatch_algorithm='static', init_expert_location='trivial', enable_eplb=False, eplb_algorithm='auto', eplb_rebalance_num_iterations=1000, eplb_rebalance_layers_per_chunk=None, expert_distribution_recorder_mode=None, expert_distribution_recorder_buffer_size=1000, enable_expert_distribution_metrics=False, deepep_config=None, moe_dense_tp_size=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, disable_radix_cache=False, cuda_graph_max_bs=None, cuda_graph_bs=None, disable_cuda_graph=True, disable_cuda_graph_padding=False, enable_profile_cuda_graph=False, enable_nccl_nvls=False, enable_tokenizer_batch_encode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_mscclpp=False, disable_overlap_schedule=False, disable_overlap_cg_plan=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_two_batch_overlap=False, enable_torch_compile=False, torch_compile_max_bs=32, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, enable_custom_logit_processor=False, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through_selective', flashinfer_mla_disable_ragged=False, disable_shared_experts_fusion=False, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, enable_return_hidden_states=False, warmups=None, debug_tensor_dump_output_folder=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, debug_tensor_dump_prefill_only=False, disaggregation_mode='null', disaggregation_transfer_backend='mooncake', disaggregation_bootstrap_port=8998, disaggregation_ib_device=None, num_reserved_decode_tokens=512, pdlb_url=None)
You have video processor config saved in preprocessor.json
file which is deprecated. Video processor configs should be saved in their own video_preprocessor.json
file. You can rename the file or load and save the processor back which renames it automatically. Loading from preprocessor.json
will be removed in v5.0.
[2025-06-14 19:51:10] You have video processor config saved in preprocessor.json
file which is deprecated. Video processor configs should be saved in their own video_preprocessor.json
file. You can rename the file or load and save the processor back which renames it automatically. Loading from preprocessor.json
will be removed in v5.0.
[2025-06-14 19:51:10] Infer the chat template name from the model path and obtain the result: qwen2-vl.
You have video processor config saved in preprocessor.json
file which is deprecated. Video processor configs should be saved in their own video_preprocessor.json
file. You can rename the file or load and save the processor back which renames it automatically. Loading from preprocessor.json
will be removed in v5.0.
[2025-06-14 19:51:16] You have video processor config saved in preprocessor.json
file which is deprecated. Video processor configs should be saved in their own video_preprocessor.json
file. You can rename the file or load and save the processor back which renames it automatically. Loading from preprocessor.json
will be removed in v5.0.
[2025-06-14 19:51:17] Attention backend not set. Use flashinfer backend by default.
[2025-06-14 19:51:17] Automatically reduce --mem-fraction-static to 0.787 because this is a multimodal model.
[2025-06-14 19:51:17] Init torch distributed begin.
[2025-06-14 19:51:17] Init torch distributed ends. mem usage=0.09 GB
[2025-06-14 19:51:17] Load weight begin. avail mem=46.58 GB
[2025-06-14 19:51:17] Multimodal attention backend not set. Use sdpa.
[2025-06-14 19:51:17] Using sdpa as multimodal attention backend.
[2025-06-14 19:51:18] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards: 0% Completed | 0/5 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 20% Completed | 1/5 [00:00<00:02, 1.62it/s]
Loading safetensors checkpoint shards: 40% Completed | 2/5 [00:01<00:01, 1.51it/s]
Loading safetensors checkpoint shards: 60% Completed | 3/5 [00:02<00:01, 1.45it/s]
Loading safetensors checkpoint shards: 80% Completed | 4/5 [00:02<00:00, 1.41it/s]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:03<00:00, 1.81it/s]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:03<00:00, 1.64it/s]
[2025-06-14 19:51:21] Load weight end. type=Qwen2_5_VLForConditionalGeneration, dtype=torch.bfloat16, avail mem=30.88 GB, mem usage=15.70 GB. [2025-06-14 19:51:21] KV Cache is allocated. #tokens: 20480, K size: 0.55 GB, V size: 0.55 GB [2025-06-14 19:51:21] Memory pool end. avail mem=29.51 GB [2025-06-14 19:51:23] max_total_num_tokens=20480, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=200, context_len=128000, available_gpu_mem=28.94 GB [2025-06-14 19:51:23] INFO: Started server process [1580987] [2025-06-14 19:51:23] INFO: Waiting for application startup. [2025-06-14 19:51:23] INFO: Application startup complete. [2025-06-14 19:51:23] INFO: Uvicorn running on http://127.0.0.1:35598 (Press CTRL+C to quit) [2025-06-14 19:51:24] INFO: 127.0.0.1:42178 - "GET /v1/models HTTP/1.1" 200 OK [2025-06-14 19:51:24] INFO: 127.0.0.1:42188 - "GET /get_model_info HTTP/1.1" 200 OK [2025-06-14 19:51:24] Prefill batch. #new-seq: 1, #new-token: 6, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0 [2025-06-14 19:51:25] INFO: 127.0.0.1:42200 - "POST /generate HTTP/1.1" 200 OK [2025-06-14 19:51:25] The server is fired up and ready to roll!
NOTE: Typically, the server runs in a separate terminal. In this notebook, we run the server and notebook code together, so their outputs are combined. To improve clarity, the server logs are displayed in the original black color, while the notebook outputs are highlighted in blue. We are running those notebooks in a CI parallel environment, so the throughput is not representative of the actual performance.
Using cURL#
Once the server is up, you can send test requests using curl or requests.
import subprocess
curl_command = f""" curl -s http://localhost:{port}/v1/chat/completions \ -d '{{ "model": "Qwen/Qwen2.5-VL-7B-Instruct", "messages": [ {{ "role": "user", "content": [ {{ "type": "text", "text": "What’s in this image?" }}, {{ "type": "image_url", "image_url": {{ "url": "https://github.com/sgl-project/sglang/blob/main/test/lang/example_image.png?raw=true" }} }} ] }} ], "max_tokens": 300 }}' """
response = subprocess.check_output(curl_command, shell=True).decode() print_highlight(response)
response = subprocess.check_output(curl_command, shell=True).decode() print_highlight(response)
[2025-06-14 19:51:30] Prefill batch. #new-seq: 1, #new-token: 307, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0 [2025-06-14 19:51:31] Decode batch. #running-req: 1, #token: 340, token usage: 0.02, cuda graph: False, gen throughput (token/s): 4.75, #queue-req: 0 [2025-06-14 19:51:32] Decode batch. #running-req: 1, #token: 380, token usage: 0.02, cuda graph: False, gen throughput (token/s): 59.23, #queue-req: 0 [2025-06-14 19:51:32] Decode batch. #running-req: 1, #token: 420, token usage: 0.02, cuda graph: False, gen throughput (token/s): 60.57, #queue-req: 0 [2025-06-14 19:51:32] INFO: 127.0.0.1:40634 - "POST /v1/chat/completions HTTP/1.1" 200 OK
{"id":"35b3f128101249a194cb0b4d57e1d561","object":"chat.completion","created":1749930689,"model":"Qwen/Qwen2.5-VL-7B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"The image shows a man on a busy street engaged in a humorous or unconventional form of advertising or promotion. He appears to be balancing a shirt on makeshift skis on the back of a moving taxi. The shirt's \"ski rack\" suggests a playful take on mountain skiing equipment, while the context implies he’s using the taxi to \"ski\" or \"slide\" past pedestrians. The taxi driver may be assisting or controlling the movement through traffic. The scene seems staged as part of an advertisement or promotional stunt, potentially for a clothing store, ski brand, or a humorous setup.","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":"stop","matched_stop":151645}],"usage":{"prompt_tokens":307,"total_tokens":424,"completion_tokens":117,"prompt_tokens_details":null}}
[2025-06-14 19:51:33] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 306, token usage: 0.01, #running-req: 0, #queue-req: 0 [2025-06-14 19:51:33] Decode batch. #running-req: 1, #token: 343, token usage: 0.02, cuda graph: False, gen throughput (token/s): 37.85, #queue-req: 0 [2025-06-14 19:51:34] Decode batch. #running-req: 1, #token: 383, token usage: 0.02, cuda graph: False, gen throughput (token/s): 61.93, #queue-req: 0 [2025-06-14 19:51:35] Decode batch. #running-req: 1, #token: 1, token usage: 0.00, cuda graph: False, gen throughput (token/s): 62.23, #queue-req: 0 [2025-06-14 19:51:35] INFO: 127.0.0.1:40650 - "POST /v1/chat/completions HTTP/1.1" 200 OK
{"id":"2036482475f94ed2a74480e881678cfb","object":"chat.completion","created":1749930692,"model":"Qwen/Qwen2.5-VL-7B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"The image shows an individual performing laundry service on the side of a road. The person is leaning forward using an ironing board, which features two legs extended outward, thus keeping it upright, and is pressing or sorting uniforms with their hands. They are dressed casually in a yellow shirt and are standing beside the open trunk or cargo area of a yellow taxi vehicle. The taxi vehicle also appears branded as \"Mobilien,\" suggesting perhaps a specialized business. Surrounding the scene, there's urban architecture including buildings and a taxi passing by, adding to the street-side urban setting.","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":"stop","matched_stop":151645}],"usage":{"prompt_tokens":307,"total_tokens":423,"completion_tokens":116,"prompt_tokens_details":null}}
Using Python Requests#
import requests
url = f"http://localhost:{port}/v1/chat/completions"
data = { "model": "Qwen/Qwen2.5-VL-7B-Instruct", "messages": [ { "role": "user", "content": [ {"type": "text", "text": "What’s in this image?"}, { "type": "image_url", "image_url": { "url": "https://github.com/sgl-project/sglang/blob/main/test/lang/example_image.png?raw=true" }, }, ], } ], "max_tokens": 300, }
response = requests.post(url, json=data) print_highlight(response.text)
[2025-06-14 19:51:35] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 306, token usage: 0.01, #running-req: 0, #queue-req: 0 [2025-06-14 19:51:36] Decode batch. #running-req: 1, #token: 347, token usage: 0.02, cuda graph: False, gen throughput (token/s): 38.70, #queue-req: 0 [2025-06-14 19:51:36] Decode batch. #running-req: 1, #token: 387, token usage: 0.02, cuda graph: False, gen throughput (token/s): 62.22, #queue-req: 0 [2025-06-14 19:51:37] INFO: 127.0.0.1:40662 - "POST /v1/chat/completions HTTP/1.1" 200 OK
{"id":"900b1306d64d4e88af6faf4e1d57ff9c","object":"chat.completion","created":1749930695,"model":"Qwen/Qwen2.5-VL-7B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"This image shows a man standing near the trunk of a yellow taxi, engaging in what looks like an unusual public service behavior—ironing clothes placed on some makeshift irons. The taxi has \"Taxi Plus\" written on its rear door panel, and there are signs of an urban street scene, with buildings, storefronts, and other taxis in the background. The man appears to be ironing a shirt and pants, with the backdrop seeming chaotic yet humorous.","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":"stop","matched_stop":151645}],"usage":{"prompt_tokens":307,"total_tokens":401,"completion_tokens":94,"prompt_tokens_details":null}}
Using OpenAI Python Client#
from openai import OpenAI
client = OpenAI(base_url=f"http://localhost:{port}/v1", api_key="None")
response = client.chat.completions.create( model="Qwen/Qwen2.5-VL-7B-Instruct", messages=[ { "role": "user", "content": [ { "type": "text", "text": "What is in this image?", }, { "type": "image_url", "image_url": { "url": "https://github.com/sgl-project/sglang/blob/main/test/lang/example_image.png?raw=true" }, }, ], } ], max_tokens=300, )
print_highlight(response.choices[0].message.content)
[2025-06-14 19:51:37] Prefill batch. #new-seq: 1, #new-token: 292, #cached-token: 15, token usage: 0.00, #running-req: 0, #queue-req: 0 [2025-06-14 19:51:38] Decode batch. #running-req: 1, #token: 333, token usage: 0.02, cuda graph: False, gen throughput (token/s): 35.75, #queue-req: 0 [2025-06-14 19:51:38] Decode batch. #running-req: 1, #token: 373, token usage: 0.02, cuda graph: False, gen throughput (token/s): 60.61, #queue-req: 0 [2025-06-14 19:51:39] INFO: 127.0.0.1:40678 - "POST /v1/chat/completions HTTP/1.1" 200 OK
The image appears to show a man wearing a bright yellow shirt standing on the rear window of a yellow SUV parked on a city street, configuring what looks like an improvised backseat drying rack or drying system. The rack holds clothing which he is handling, suggesting he might be using it to clean or sort the clothes. The SUV resembles a New York City taxi, evident from its distinctive yellow body color. The surrounding streetscape includes additional taxis in transit, buildings, and street signs indicating a busy urban street scene.
Multiple-Image Inputs#
The server also supports multiple images and interleaved text and images if the model supports it.
from openai import OpenAI
client = OpenAI(base_url=f"http://localhost:{port}/v1", api_key="None")
response = client.chat.completions.create( model="Qwen/Qwen2.5-VL-7B-Instruct", messages=[ { "role": "user", "content": [ { "type": "image_url", "image_url": { "url": "https://github.com/sgl-project/sglang/blob/main/test/lang/example_image.png?raw=true", }, }, { "type": "image_url", "image_url": { "url": "https://raw.githubusercontent.com/sgl-project/sglang/main/assets/logo.png", }, }, { "type": "text", "text": "I have two very different images. They are not related at all. " "Please describe the first image in one sentence, and then describe the second image in another sentence.", }, ], } ], temperature=0, )
print_highlight(response.choices[0].message.content)
[2025-06-14 19:51:40] Prefill batch. #new-seq: 1, #new-token: 2532, #cached-token: 14, token usage: 0.00, #running-req: 0, #queue-req: 0 [2025-06-14 19:51:40] Decode batch. #running-req: 1, #token: 2549, token usage: 0.12, cuda graph: False, gen throughput (token/s): 18.38, #queue-req: 0 [2025-06-14 19:51:41] Decode batch. #running-req: 1, #token: 2589, token usage: 0.13, cuda graph: False, gen throughput (token/s): 61.38, #queue-req: 0 [2025-06-14 19:51:41] INFO: 127.0.0.1:57452 - "POST /v1/chat/completions HTTP/1.1" 200 OK
The first image shows a man ironing clothes on the back of a taxi in a busy urban street. The second image is a stylized logo featuring the letters "SGL" with a book and a computer icon incorporated into the design.
terminate_process(vision_process)
[2025-06-14 19:51:41] Child process unexpectedly failed with exitcode=9. pid=1581501