Server Arguments — SGLang (original) (raw)
Contents
- Common launch commands
- Model, processor and tokenizer
- Serving: HTTP & API
- Parallelism
- Memory and scheduling
- Other runtime options
- Logging
- Multi-node distributed serving
- LoRA
- Kernel backend
- Constrained Decoding
- Speculative decoding
- Debug options
- Optimization
Server Arguments#
This page provides a list of server arguments used in the command line to configure the behavior and performance of the language model server during deployment. These arguments enable users to customize key aspects of the server, including model selection, parallelism policies, memory management, and optimization techniques.
Common launch commands#
- To enable multi-GPU tensor parallelism, add
--tp 2
. If it reports the error “peer access is not supported between these two devices”, add--enable-p2p-check
to the server launch command.
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --tp 2 - To enable multi-GPU data parallelism, add
--dp 2
. Data parallelism is better for throughput if there is enough memory. It can also be used together with tensor parallelism. The following command uses 4 GPUs in total. We recommend SGLang Router for data parallelism.
python -m sglang_router.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --dp 2 --tp 2 - If you see out-of-memory errors during serving, try to reduce the memory usage of the KV cache pool by setting a smaller value of
--mem-fraction-static
. The default value is0.9
.
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --mem-fraction-static 0.7 - See hyperparameter tuning on tuning hyperparameters for better performance.
- For docker and Kubernetes runs, you need to set up shared memory which is used for communication between processes. See
--shm-size
for docker and/dev/shm
size update for Kubernetes manifests. - If you see out-of-memory errors during prefill for long prompts, try to set a smaller chunked prefill size.
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --chunked-prefill-size 4096 - To enable
torch.compile
acceleration, add--enable-torch-compile
. It accelerates small models on small batch sizes. By default, the cache path is located at/tmp/torchinductor_root
, you can customize it using environment variableTORCHINDUCTOR_CACHE_DIR
. For more details, please refer to PyTorch official documentation and Enabling cache for torch.compile. - To enable torchao quantization, add
--torchao-config int4wo-128
. It supports other quantization strategies (INT8/FP8) as well. - To enable fp8 weight quantization, add
--quantization fp8
on a fp16 checkpoint or directly load a fp8 checkpoint without specifying any arguments. - To enable fp8 kv cache quantization, add
--kv-cache-dtype fp8_e5m2
. - If the model does not have a chat template in the Hugging Face tokenizer, you can specify a custom chat template.
- To run tensor parallelism on multiple nodes, add
--nnodes 2
. If you have two nodes with two GPUs on each node and want to run TP=4, letsgl-dev-0
be the hostname of the first node and50000
be an available port, you can use the following commands. If you meet deadlock, please try to add--disable-cuda-graph
Node 0
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --tp 4 --dist-init-addr sgl-dev-0:50000 --nnodes 2 --node-rank 0
Node 1
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --tp 4 --dist-init-addr sgl-dev-0:50000 --nnodes 2 --node-rank 1
Please consult the documentation below and server_args.py to learn more about the arguments you may provide when launching a server.
Model, processor and tokenizer#
Serving: HTTP & API#
HTTP Server configuration#
API configuration#
Parallelism#
Tensor parallelism#
Data parallelism#
Expert parallelism#
Memory and scheduling#
Other runtime options#
Logging#
Multi-node distributed serving#
LoRA#
Kernel backend#
Constrained Decoding#
Speculative decoding#
Debug options#
Note: We recommend to stay with the defaults and only use these options for debugging for best possible performance.
Optimization#
Note: Some of these options are still in experimental stage.