Improve agent image handling by aaron-ang · Pull Request #1601 · huggingface/smolagents (original) (raw)

Context
I use smolagents to create a multimodel agent running an on-device VLM via llama-server and found inference to be quite slow. Looking at the logs, I realized that the image inputs are being processed in every VLM call (or agent step), which takes up a lot of processing time (2 seconds for each image). Ideally, the images should be processed once as the latent representations are already saved in the KV cache.

Changes proposed
We should not include images into ActionSteps in agent.run as they are already included in the TaskStep and saved to agent.memory. In other words, we only initialize images once in agent.memory. This allows the underlying VLM to process the images only once during the initial prefill and reduce latency for subsequent steps by exploiting the KV cache.

The same idea applies for provide_final_answer: the TaskStep is already stored in agent.memory which includes the original images passed into agent.run. There shouldn't be a need to include images in the final answer system prompt since they will be retrieved from agent.memory.

Next, I added support for sharing images between agents and only passing image(s) to the model only if it supports it. I contemplated creating another PR and decided not to as it builds on top of the previous change. The motivation is from my use case of running a task from a controller agent that is not multimodal, but can delegate tasks to its managed agents which can handle images.