COMMA: A Communicative Multimodal Multi-Agent Benchmark (original) (raw)

1University of Wisconsin-Madison, 2Nanjing University

COMMA is a multimodal benchmark designed to assess the collaborative abilities of multimodal agents. Our benchmark is inspired by the cooperative gameplay scenario in the Keep Talking and Nobody Explodes Game. In this game, two players work together to defuse a bomb under time pressure. One agent, the defuser, can see the bomb but lacks the instructions to disarm it. The other agent, the expert, has access to the bomb's manual but cannot see the bomb itself. The agents must rely on effective communication to exchange information, navigate challenges, and defuse the bomb.

Example puzzles presented to the solver in our benchmark. Notably, each puzzle is multimodal and cannot be solved without communicating with the expert.

Abstract

We introduce a novel benchmark designed to evaluate the collaborative performance of multimodal multi-agent systems through language communication. Our benchmark features a variety of puzzles, providing a comprehensive evaluation across four key categories of agentic capability in a communicative collaboration setting. By testing both agent-agent and agent-human collaborations using open-source and closed-source models, our findings reveal surprising weaknesses in state-of-theart models, including proprietary models like GPT-4o. These models struggle to outperform even a simple random agent baseline in agent-agent collaboration and only surpass the random baseline when a human is involved.

Agent Interaction

Agent Setup

Overview of the interaction between the Solver and Expert agents in our benchmark. Both agents operate with structured input corresponding to working and episodic memory. The Solver receives an image of the puzzle state (working memory) and makes decisions based on the available actions described in the task prompt. The Expert, guided by instruction manuals (working memory), provides advice based on the Solver's descriptions, such as indicating which buttons to press. The Solver can choose to execute actions by interacting with the environment or communicate with the Expert for further guidance. Their interaction is documented through a dialogue, showcasing the cooperation required to complete the task. Both agents engage in self-reflection by referencing the conversation history, which is continuously updated and incorporated into their input as episodic memory.

Results

Graph of Conversation Length and Success Rate.

We find that even the most powerful closed-source LLMs struggle to communicate without human involvement. Here we plot the success rate as a function of conversation length for different settings. The open-source models such as LLaVA, InternVL, actually underperform the random baseline for most conversation lengths, suggesting limited ability to utilize episodic memory and plateau after about 4-5 conversation turns.

Pie chart of common failure modes.

We further analyze the underlying reason for failure based on conversations of all 1000 puzzles. We look at the best closed-source and open-source performing model conversations: GPT-4o (left) and LLaMA 3.2 (right), and define the following failure modes:

For more detailed examples and results, feel free to check out our paper!

Leaderboard 🏆

Table of performance.

Using COMMA, we benchmark the collaborative capabilities of closed-source and open-source multimodal LLMs, summarized in the table above.

BibTeX

@article{ossowski2024comma,
      title={COMMA: A Communicative Multimodal Multi-Agent Benchmark},
      author={Ossowski, Timothy and Chen, Jixuan and Maqbool, Danyal and Cai, Zefan and Bradshaw, Tyler and Hu, Junjie},
      journal={arXiv preprint arXiv:2410.07553},
      year={2024}
    }