[Perf] Optimize the performance of structured output + reasoning by chaunceyjiang · Pull Request #33557 · vllm-project/vllm (original) (raw)

Purpose

  1. Optimize the performance of structured output + reasoning

Move reasoner.is_reasoning_end(request.prompt_token_ids or []) from the core engine to the frontend.

  1. fix [Bug]: DeepSeek V3.2 tool_choice==required in thinking mode gives internal server error. #33215
    The root cause of [Bug]: DeepSeek V3.2 tool_choice==required in thinking mode gives internal server error. #33215 is the use of the parameter
    "chat_template_kwargs": {"thinking": true, "enable_thinking": true}.

After this parameter is enabled, the frontend uses the DeepSeek reasoning parser. However, when initializing reasoning_parser, StructuredOutputManager does not take the enable_thinking parameter into account, which leads to inconsistent reasoning parser behavior between the frontend and the core engine.

Therefore, this PR gradually moves the reasoning_parser in StructuredOutputManager to the frontend.

Test Plan

see tests/entrypoints/openai/test_completion_with_function_calling.py::test_function_tool_use

Test Result


Essential Elements of an Effective PR Description Checklist