BUG: VLLMModel breaks when using vllm > 0.10.1 (original) (raw)

Description

VLLMModel in smolagents breaks when using vllm version 0.10.1 or higher due to API changes in vllm that removed the guided_decoding_backend parameter.

Steps to Reproduce

  1. Install vllm > 0.10.1
  2. Install smolagents 1.22.0
  3. Initialize a VLLMModel
  4. Create a CodeAgent with the VLLMModel
  5. Run GradioUI with the CodeAgent
  6. Chat with the agent

Code to Reproduce

from smolagents import VLLMModel, CodeAgent, GradioUI

def main(): model = VLLMModel( model_id="HuggingFaceTB/SmolLM3-3B", model_kwargs={ "max_model_len": 4096, "max_num_batched_tokens": 4096, } )

agent = CodeAgent(model=model, tools=[])
gradio_ui = GradioUI(agent)
gradio_ui.launch()

if name == "main": main()

Expected Behavior

The agent should work normally with vllm 0.10.1+.

Actual Behavior

The following exception is raised:

gradio.exceptions.Error: "Error in interaction: Error in generating model output:\nLLM.generate() got an unexpected keyword argument 'guided_options_request'"

Root Cause

Starting from vllm 0.10.1, guided_decoding_backend was removed in PR #21347. According to the vllm structured outputs documentation, the migration path is to remove guided_decoding_backend and replace it with structured_outputs within the sampling_params using StructuredOutputsParams.

Proposed Solution

The VLLMModel.generate() method needs to be updated to convert the old guided_options_request format to the new structured_outputs format. Here's a potential fix:

class PatchedVLLMModel(VLLMModel): def generate( self, messages, stop_sequences=None, response_format=None, tools_to_call_from=None, **kwargs, ) -> ChatMessage: # NOTE: This overrides smolagents' VLLMModel.generate to convert # the old 'guided_options_request' to the new 'structured_outputs' format. from vllm import SamplingParams # type: ignore from vllm.sampling_params import StructuredOutputsParams # type: ignore

    completion_kwargs = self._prepare_completion_kwargs(
        messages=messages,
        flatten_messages_as_text=(not self._is_vlm),
        stop_sequences=stop_sequences,
        tools_to_call_from=tools_to_call_from,
        **kwargs,
    )

    messages = completion_kwargs.pop("messages")
    prepared_stop_sequences = completion_kwargs.pop("stop", [])
    tools = completion_kwargs.pop("tools", None)
    completion_kwargs.pop("tool_choice", None)

    prompt = self.tokenizer.apply_chat_template(
        messages,
        tools=tools,
        add_generation_prompt=True,
        tokenize=False,
    )

    # Convert old guided_options_request format to new structured_outputs
    structured_outputs_params = None
    if response_format:
        if "json_schema" in response_format:
            # Extract the JSON schema from the response_format
            json_schema = response_format["json_schema"]["schema"]
            structured_outputs_params = StructuredOutputsParams(json=json_schema)
        elif "choice" in response_format:
            # Handle choice-based structured outputs
            structured_outputs_params = StructuredOutputsParams(choice=response_format["choice"])
        elif "regex" in response_format:
            # Handle regex-based structured outputs
            structured_outputs_params = StructuredOutputsParams(regex=response_format["regex"])
        elif "grammar" in response_format:
            # Handle grammar-based structured outputs
            structured_outputs_params = StructuredOutputsParams(grammar=response_format["grammar"])
        elif "structural_tag" in response_format:
            # Handle structural tag-based structured outputs
            structured_outputs_params = StructuredOutputsParams(structural_tag=response_format["structural_tag"])
        else:
            print(f"WARNING: Unsupported response_format type: {response_format}")
            structured_outputs_params = None

    sampling_params = SamplingParams(
        n=kwargs.get("n", 1),
        temperature=kwargs.get("temperature", 0.7),
        max_tokens=kwargs.get("max_tokens", 64),
        stop=prepared_stop_sequences,
        structured_outputs=structured_outputs_params,
    )

    out = self.model.generate(
        prompt,
        sampling_params=sampling_params,
    )

    output_text = out[0].outputs[0].text
    
    return ChatMessage(
        role=MessageRole.ASSISTANT,
        content=output_text,
        raw={"out": output_text, "completion_kwargs": completion_kwargs},
        token_usage=TokenUsage(
            input_tokens=len(out[0].prompt_token_ids),
            output_tokens=len(out[0].outputs[0].token_ids),
        ),
    )

Environment

Additional Context

This is a breaking change in vllm that affects backward compatibility. The fix should maintain compatibility with both older and newer versions of vllm if possible.