[Perf] Optimize the performance of structured output + reasoning by chaunceyjiang · Pull Request #33557 · vllm-project/vllm (original) (raw)

Purpose

Optimize the performance of structured output + reasoning

Move reasoner.is_reasoning_end(request.prompt_token_ids or []) from the core engine to the frontend.

fix [Bug]: DeepSeek V3.2 tool_choice==required in thinking mode gives internal server error. #33215
The root cause of [Bug]: DeepSeek V3.2 tool_choice==required in thinking mode gives internal server error. #33215 is the use of the parameter
"chat_template_kwargs": {"thinking": true, "enable_thinking": true}.

After this parameter is enabled, the frontend uses the DeepSeek reasoning parser. However, when initializing reasoning_parser, StructuredOutputManager does not take the enable_thinking parameter into account, which leads to inconsistent reasoning parser behavior between the frontend and the core engine.

Therefore, this PR gradually moves the reasoning_parser in StructuredOutputManager to the frontend.

Test Plan

see tests/entrypoints/openai/test_completion_with_function_calling.py::test_function_tool_use

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.