[Bugfix] fix v1 cpu worker fails on macOS by kebe7jun · Pull Request #19121 · vllm-project/vllm (original) (raw)

Fixed an issue where CPU v1 mode could not be enabled on macOS.
INFO 06-04 11:39:59 [config.py:822] This model supports multiple tasks: {'generate', 'score', 'embed', 'reward', 'classify'}. Defaulting to 'generate'.
WARNING 06-04 11:39:59 [config.py:3199] Your device 'cpu' doesn't support torch.bfloat16. Falling back to torch.float16 for compatibility.
WARNING 06-04 11:39:59 [config.py:3250] Casting torch.bfloat16 to torch.float16.
INFO 06-04 11:39:59 [config.py:1933] Defaulting to use mp for distributed inference
INFO 06-04 11:39:59 [config.py:1967] Disabled the custom all-reduce kernel because it is not supported on current platform.
WARNING 06-04 11:39:59 [cpu.py:135] Environment variable VLLM_CPU_KVCACHE_SPACE (GiB) for CPU backend is not set, using 4 by default.
/Users/kebeliu/workspace/vllm/.venv/lib/python3.9/site-packages/urllib3/__init__.py:35: NotOpenSSLWarning: urllib3 v2 only supports OpenSSL 1.1.1+, currently the 'ssl' module is compiled with 'LibreSSL 2.8.3'. See: https://github.com/urllib3/urllib3/issues/3020
  warnings.warn(
INFO 06-04 11:40:05 [importing.py:17] Triton not installed or not compatible; certain GPU-related functions will not be available.
WARNING 06-04 11:40:05 [importing.py:29] Triton is not installed. Using dummy decorators. Install it via `pip install triton` to enable kernel compilation.
INFO 06-04 11:40:08 [__init__.py:244] Automatically detected platform cpu.
INFO 06-04 11:40:13 [core.py:455] Waiting for init message from front-end.
WARNING 06-04 11:40:13 [cpu.py:135] Environment variable VLLM_CPU_KVCACHE_SPACE (GiB) for CPU backend is not set, using 4 by default.
INFO 06-04 11:40:13 [core.py:70] Initializing a V1 LLM engine (v0.9.1.dev345+g8e5939caf) with config: model='/Users/kebeliu/.cache/huggingface/hub/models--Qwen--Qwen2.5-0.5B-Instruct/snapshots/7ae557604adf67be50417f59c2c2f167def9a775/', speculative_config=None, tokenizer='/Users/kebeliu/.cache/huggingface/hub/models--Qwen--Qwen2.5-0.5B-Instruct/snapshots/7ae557604adf67be50417f59c2c2f167def9a775/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=True, kv_cache_dtype=auto,  device_config=cpu, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=model, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=False, pooler_config=None, compilation_config={"level":2,"debug_dump_path":"","cache_dir":"","backend":"eager","custom_ops":["none","none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false,"dce":true,"size_asserts":false,"nan_asserts":false,"memory_planning":true,"epilogue_fusion":true},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":0,"local_cache_dir":null}
INFO 06-04 11:40:13 [shm_broadcast.py:251] vLLM message queue communication handle: Handle(local_reader_ranks=[0, 1], buffer_handle=(2, 10485760, 10, 'psm_b1adad50'), local_subscribe_addr='ipc:///var/folders/f4/fp0rrg2123nbs7c6rghvl77w0000gn/T/07bee9ee-d946-4855-9e0c-3c473795ec74', remote_subscribe_addr=None, remote_addr_ipv6=False)
/Users/kebeliu/workspace/vllm/.venv/lib/python3.9/site-packages/urllib3/__init__.py:35: NotOpenSSLWarning: urllib3 v2 only supports OpenSSL 1.1.1+, currently the 'ssl' module is compiled with 'LibreSSL 2.8.3'. See: https://github.com/urllib3/urllib3/issues/3020
  warnings.warn(
/Users/kebeliu/workspace/vllm/.venv/lib/python3.9/site-packages/urllib3/__init__.py:35: NotOpenSSLWarning: urllib3 v2 only supports OpenSSL 1.1.1+, currently the 'ssl' module is compiled with 'LibreSSL 2.8.3'. See: https://github.com/urllib3/urllib3/issues/3020
  warnings.warn(
INFO 06-04 11:40:17 [importing.py:17] Triton not installed or not compatible; certain GPU-related functions will not be available.
INFO 06-04 11:40:17 [importing.py:17] Triton not installed or not compatible; certain GPU-related functions will not be available.
WARNING 06-04 11:40:17 [importing.py:29] Triton is not installed. Using dummy decorators. Install it via `pip install triton` to enable kernel compilation.
WARNING 06-04 11:40:17 [importing.py:29] Triton is not installed. Using dummy decorators. Install it via `pip install triton` to enable kernel compilation.
INFO 06-04 11:40:20 [__init__.py:244] Automatically detected platform cpu.
INFO 06-04 11:40:20 [__init__.py:244] Automatically detected platform cpu.
WARNING 06-04 11:40:26 [utils.py:2722] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.cpu_worker.CPUWorker object at 0x31c09d070>
WARNING 06-04 11:40:26 [utils.py:2722] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.cpu_worker.CPUWorker object at 0x30d0de070>
(VllmWorker rank=1 pid=57317) (VllmWorker rank=0 pid=57316) INFO 06-04 11:40:26 [shm_broadcast.py:251] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_96bb3b4c'), local_subscribe_addr='ipc:///var/folders/f4/fp0rrg2123nbs7c6rghvl77w0000gn/T/52915ade-4a80-4b9d-8ec9-eee67485db7f', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 06-04 11:40:26 [shm_broadcast.py:251] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_c1de700d'), local_subscribe_addr='ipc:///var/folders/f4/fp0rrg2123nbs7c6rghvl77w0000gn/T/e60cdb14-153c-4ae8-a9a1-298fc4af1f44', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=0 pid=57316) INFO 06-04 11:40:26 [shm_broadcast.py:251] vLLM message queue communication handle: Handle(local_reader_ranks=[1], buffer_handle=(1, 4194304, 6, 'psm_05615c62'), local_subscribe_addr='ipc:///var/folders/f4/fp0rrg2123nbs7c6rghvl77w0000gn/T/0c82ec11-8443-4ca4-b15c-fbfbb8d79d5c', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=0 pid=57316) (VllmWorker rank=1 pid=57317) INFO 06-04 11:40:26 [parallel_state.py:1065] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
INFO 06-04 11:40:26 [parallel_state.py:1065] rank 1 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 1, EP rank 1
(VllmWorker rank=1 pid=57317) WARNING 06-04 11:40:26 [cpu.py:242] Pin memory is not supported on CPU.
(VllmWorker rank=0 pid=57316) WARNING 06-04 11:40:26 [cpu.py:242] Pin memory is not supported on CPU.
(VllmWorker rank=1 pid=57317) INFO 06-04 11:40:26 [cpu_model_runner.py:52] Starting to load model /Users/kebeliu/.cache/huggingface/hub/models--Qwen--Qwen2.5-0.5B-Instruct/snapshots/7ae557604adf67be50417f59c2c2f167def9a775/...
(VllmWorker rank=0 pid=57316) INFO 06-04 11:40:26 [cpu_model_runner.py:52] Starting to load model /Users/kebeliu/.cache/huggingface/hub/models--Qwen--Qwen2.5-0.5B-Instruct/snapshots/7ae557604adf67be50417f59c2c2f167def9a775/...
(VllmWorker rank=1 pid=57317) (VllmWorker rank=0 pid=57316) INFO 06-04 11:40:26 [cpu.py:69] Using Torch SDPA backend.
INFO 06-04 11:40:26 [cpu.py:69] Using Torch SDPA backend.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  1.01it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  1.01it/s]
(VllmWorker rank=0 pid=57316) 
(VllmWorker rank=1 pid=57317) INFO 06-04 11:40:27 [default_loader.py:272] Loading weights took 0.99 seconds
(VllmWorker rank=0 pid=57316) INFO 06-04 11:40:27 [default_loader.py:272] Loading weights took 1.00 seconds
INFO 06-04 11:40:27 [kv_cache_utils.py:679] GPU KV cache size: 699,040 tokens
INFO 06-04 11:40:27 [kv_cache_utils.py:683] Maximum concurrency for 2,048 tokens per request: 341.33x
INFO 06-04 11:40:27 [kv_cache_utils.py:679] GPU KV cache size: 699,040 tokens
INFO 06-04 11:40:27 [kv_cache_utils.py:683] Maximum concurrency for 2,048 tokens per request: 341.33x
(VllmWorker rank=1 pid=57317) (VllmWorker rank=0 pid=57316) INFO 06-04 11:40:27 [cpu.py:69] Using Torch SDPA backend.
INFO 06-04 11:40:27 [cpu.py:69] Using Torch SDPA backend.
(VllmWorker rank=0 pid=57316) INFO 06-04 11:40:28 [cpu_model_runner.py:61] Warming up model for the compilation...
(VllmWorker rank=1 pid=57317) INFO 06-04 11:40:28 [cpu_model_runner.py:61] Warming up model for the compilation...
(VllmWorker rank=0 pid=57316) (VllmWorker rank=1 pid=57317) INFO 06-04 11:40:40 [cpu_model_runner.py:64] Warming up done.
INFO 06-04 11:40:40 [cpu_model_runner.py:64] Warming up done.
INFO 06-04 11:40:40 [core.py:171] init engine (profile, create kv cache, warmup model) took 13.27 seconds
INFO 06-04 11:40:41 [loggers.py:137] Engine 000: vllm cache_config_info with initialization after num_gpu_blocks is: 43690
WARNING 06-04 11:40:41 [config.py:1362] Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
INFO 06-04 11:40:41 [serving_chat.py:118] Using default chat sampling params from model: {'repetition_penalty': 1.1, 'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}
INFO 06-04 11:40:41 [serving_completion.py:66] Using default completion sampling params from model: {'repetition_penalty': 1.1, 'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}
INFO 06-04 11:40:41 [api_server.py:1351] Starting vLLM API server 0 on http://0.0.0.0:8000
INFO 06-04 11:40:41 [launcher.py:29] Available routes are:
INFO 06-04 11:40:41 [launcher.py:37] Route: /openapi.json, Methods: HEAD, GET
INFO 06-04 11:40:41 [launcher.py:37] Route: /docs, Methods: HEAD, GET
INFO 06-04 11:40:41 [launcher.py:37] Route: /docs/oauth2-redirect, Methods: HEAD, GET
INFO 06-04 11:40:41 [launcher.py:37] Route: /redoc, Methods: HEAD, GET
INFO 06-04 11:40:41 [launcher.py:37] Route: /health, Methods: GET
INFO 06-04 11:40:41 [launcher.py:37] Route: /load, Methods: GET
INFO 06-04 11:40:41 [launcher.py:37] Route: /ping, Methods: POST
INFO 06-04 11:40:41 [launcher.py:37] Route: /ping, Methods: GET
INFO 06-04 11:40:41 [launcher.py:37] Route: /tokenize, Methods: POST
INFO 06-04 11:40:41 [launcher.py:37] Route: /detokenize, Methods: POST
INFO 06-04 11:40:41 [launcher.py:37] Route: /v1/models, Methods: GET
INFO 06-04 11:40:41 [launcher.py:37] Route: /version, Methods: GET
INFO 06-04 11:40:41 [launcher.py:37] Route: /v1/chat/completions, Methods: POST
INFO 06-04 11:40:41 [launcher.py:37] Route: /v1/completions, Methods: POST
INFO 06-04 11:40:41 [launcher.py:37] Route: /v1/embeddings, Methods: POST
INFO 06-04 11:40:41 [launcher.py:37] Route: /pooling, Methods: POST
INFO 06-04 11:40:41 [launcher.py:37] Route: /classify, Methods: POST
INFO 06-04 11:40:41 [launcher.py:37] Route: /score, Methods: POST
INFO 06-04 11:40:41 [launcher.py:37] Route: /v1/score, Methods: POST
INFO 06-04 11:40:41 [launcher.py:37] Route: /v1/audio/transcriptions, Methods: POST
INFO 06-04 11:40:41 [launcher.py:37] Route: /rerank, Methods: POST
INFO 06-04 11:40:41 [launcher.py:37] Route: /v1/rerank, Methods: POST
INFO 06-04 11:40:41 [launcher.py:37] Route: /v2/rerank, Methods: POST
INFO 06-04 11:40:41 [launcher.py:37] Route: /invocations, Methods: POST
INFO 06-04 11:40:41 [launcher.py:37] Route: /metrics, Methods: GET
INFO:     Started server process [56977]
INFO:     Waiting for application startup.
INFO:     Application startup complete.