[Perf] Optimize the performance of structured output + reasoning by chaunceyjiang · Pull Request #33557 · vllm-project/vllm (original) (raw)
Purpose
- Optimize the performance of structured output + reasoning
Move reasoner.is_reasoning_end(request.prompt_token_ids or []) from the core engine to the frontend.
- fix [Bug]: DeepSeek V3.2 tool_choice==required in thinking mode gives internal server error. #33215
The root cause of [Bug]: DeepSeek V3.2 tool_choice==required in thinking mode gives internal server error. #33215 is the use of the parameter"chat_template_kwargs": {"thinking": true, "enable_thinking": true}.
After this parameter is enabled, the frontend uses the DeepSeek reasoning parser. However, when initializing reasoning_parser, StructuredOutputManager does not take the enable_thinking parameter into account, which leads to inconsistent reasoning parser behavior between the frontend and the core engine.
Therefore, this PR gradually moves the reasoning_parser in StructuredOutputManager to the frontend.
Test Plan
see tests/entrypoints/openai/test_completion_with_function_calling.py::test_function_tool_use
Test Result
Essential Elements of an Effective PR Description Checklist
- The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
- The test plan, such as providing test command.
- The test results, such as pasting the results comparison before and after, or e2e results
- (Optional) The necessary documentation update, such as updating
supported_models.mdandexamplesfor a new model. - (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.