EscapeCraft: A 3D Room Escape Environment for Benchmarking Complex Multimodal Reasoning Ability (original) (raw)

Ziyue Wang1 ♠111Equal contribution, ♠ Project lead, 🖂 Corresponding author,Yurui Dong3111Equal contribution, ♠ Project lead, 🖂 Corresponding author, Fuwen Luo1, Minyuan Ruan1, Zhili Cheng1,
Chi Chen1, Peng Li2 🖂, Yang Liu1,2 🖂
1Dept. of Comp. Sci. & Tech., Institute for AI, Tsinghua University, Beijing, China
2Institute for AI Industry Research (AIR), Tsinghua University, Beijing, China
3School of Management, Fudan University, Shanghai, China

Abstract

The rapid advancing of Multimodal Large Language Models (MLLMs) has spurred interest in complex multimodal reasoning tasks in the real-world and virtual environment, which require coordinating multiple abilities, including visual perception, visual reasoning, spatial awareness, and target deduction. However, existing evaluations primarily assess the final task completion, often degrading assessments to isolated abilities such as visual grounding and visual question answering. Less attention is given to comprehensively and quantitatively analyzing reasoning process in multimodal environments, which is crucial for understanding model behaviors and underlying reasoning mechanisms beyond merely task success. To address this, we introduce MM-Escape, an extensible benchmark for investigating multimodal reasoning, inspired by real-world escape games. MM-Escape emphasizes intermediate model behaviors alongside final task completion. To achieve this, we develop EscapeCraft, a customizable and open environment that enables models to engage in free-form exploration for assessing multimodal reasoning. Extensive experiments show that MLLMs, regardless of scale, can successfully complete the simplest room escape tasks, with some exhibiting human-like exploration strategies. Yet, performance dramatically drops as task difficulty increases. Moreover, we observe that performance bottlenecks vary across models, revealing distinct failure modes and limitations in their multimodal reasoning abilities, such as repetitive trajectories without adaptive exploration, getting stuck in corners due to poor visual spatial awareness, and ineffective use of acquired props, such as the key. We hope our work sheds light on new challenges in multimodal reasoning, and uncovers potential improvements in MLLMs capabilities.111GitHub repo: https://github.com/THUNLP-MT/EscapeCraft. 222Home page: https://thunlp-mt.github.io/EscapeCraft.

[Uncaptioned image]

Figure 1: Illustration of our proposed room escape environment EscapeCraft, which allows us to generate customized room scenes (left) and define groundtruth reasoning path of tasks (right). Based on EscapeCraft, we create MM-Escape benchmark, targeting at evaluating both the task completion performance and the entire multimodal reasoning process of MLLMs.

1 Introduction

The rapid development of Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) have driven the advancement of diverse multimodal systems and applications for academic research [43, 2], industrial engineering [16], and everyday assistance [35, 14]. Multimodal reasoning is essential for these applications that require integrating multiple abilities such as visual perception, spatial awareness, and visual grounding [36]. For example, it enhances autonomous driving by improving the holistic understanding of multi-view information and localization, which are essential for vehicle actions and planning [11, 13], and also advances the general-purpose assistants in better performing visual and multimodal tasks in the wild [18, 22].

Despite significant attention and effort towards improving multimodal reasoning abilities of MLLMs [36, 23, 48], comprehensive evaluation remains underexplored for two key reasons. First, fundamental tasks such as visual grounding [49, 47, 9] and image captioning [26, 1] are conducted in constraint environment and straightforward objectives, reducing the need for autonomous exploration. They primarily focus on identifying correct answers, such as bounding boxes and objects, without requiring coordinating multiple multimodal abilities. Second, while multimodal tasks in open-world settings [27, 21, 30] involve complex environments and objectives, they emphasize final task completion, often measured by success rate [19]. This results in a lack of profound analysis over the reasoning process, leading to potentially inaccurate assessments of multimodal reasoning capabilities. Moreover, some open-world tasks provide structured knowledge libraries [10, 37] that standardize the reasoning mechanisms. These consequently limit the autonomy of models to conduct multimodal reasoning, making the reasoning more reliant on predefined knowledge rather than exploration of multimodal surroundings.

We argue that in open multimodal environment, including real-world settings and virtual simulators, complex multimodal reasoning should not be solely assessed by task completion results or isolated tasks. Instead, it is more practical and realistic to examine how models autonomously coordinate across multiple multimodal reasoning abilities. Recently, open environments such as Habitat [31], AI2-THOR[17], and OsWorld[41] are widely discussed, where MLLMs are required to exhibit complex reasoning skills including visual searching, spatial understanding, tool utilization, and long-term decision-making. However, as summarize in Table 1, there is a constraint on tasks or environments especially designed for evaluating multimodal reasoning in open-ended and interactive settings, leaving the evaluation of complex multimodal reasoning underestimated.

In this paper, we introduce MM-Escape, an extensible benchmark for evaluating complex multimodal reasoning, inspired by real-world escape games. To achieve this, we develop EscapeCraft, a customizable open environment that engages models in free-form exploration through the room escape task, assisting in comprehensive assessments of their multimodal reasoning abilities. MM-Escape provides solutions to the above issues of excessive concerns on task completion and simplified objectives, by emphasizing intermediate behaviors alongside final task completion. It measures the entire reasoning process consists of compound abilities. While our designed games are intuitive for human players, they pose significant challenges for models regarding reasoning paths and interaction behaviors, exposing key limitations in current multimodal reasoning capabilities. As an additional bonus, we introduce post-game debriefing task, which is feasible only for the most challenging settings. It requires models to recall the escape process and reconstruct the story via collected clues, which assist in the future assessment of more complicated reasoning tasks.

We conduct comprehensive experiments and derive two key findings. First, although the overall performance remains far from human-level, recent MLLMs demonstrate strong multimodal reasoning ability. Notably, models like GPT-4o and Gemini-1.5-Pro achieve high escape rates and exhibit human-like exploration and reasoning strategies. While their interaction success rates are halved with steps more than doubled compared to human results. Second, as game difficulty increases, the model performance sharply declines, revealing distinct failure modes across models. For instance, GPT-4o tends to repeat actions and get trapped in historical trajectories when reasoning paths grow longer, Gemini struggles with spatial awareness and often gets stuck in corners. We also thoroughly discuss post-game debriefing and other extensible settings in Section 5. These findings highlight the need for greater attentions toward comprehensive analysis and improvement of multimodal reasoning abilities. Our contributions are as follows:

Table 1: Benchmarks aiming at evaluating complex reasoning ability of models. URP means unpredictable reasoning path. PA means process assessment.

2.1 Complex Reasoning Abilities of MLLMs

Recent research on MLLMs has moved beyond addressing simple tasks, such as image captioning and image retrieval, and instead focuses on enhancing model abilities towards more complex tasks and more realistic scenarios. For example, MLLMs can operate digital devices such as computers [44, 28] and mobile phones [34, 40], play video games requiring long action sequences [3, 7], and perform robotic manipulation in the real world [12, 50]. Unlike tasks such as visual question answering and visual grounding, which have standardized answers, these complex tasks allow diverse solutions and engage in open environments. As a result, multimodal reasoning ability is becoming increasingly important for achieving more advanced targets.

2.2 Complex Multimodal Reasoning Evaluation

Researchers have made efforts to evaluate the visual reasoning ability of MLLMs in complex tasks which require models to coordinate multiple multimodal abilities [25, 24, 45]. However, many of them leverage videos with predefined perspectives and trajectories as input. Models cannot interact with environments by themselves.

Some other works focus on complex open worlds, such as Minecraft [33, 38, 30, 39, 21, 20], to evaluate model abilities of multimodal reasoning. However, two issues still exist. First, due to clearly defined game rules, standardized guides are available from websites like Minecraft Wiki 333https://minecraft.wiki/, and training datasets of game demonstrations [15], making it feasible for models to reason based on their commonsense or relying on external knowledge rather than observations during free-form exploration. Second, evaluations are often outcome-oriented, for example, focusing on specific goals like obtaining a diamond in Minecraft, while how models explore the open world is ignored. Similar issues exist in other complex open-ended tasks in open worlds like computer use [41] and robotic manipulation [32, 46].

Compared to existing benchmarks, our work features in three aspects: a) we support sufficient interactions between models and environments; b) environments can be flexibly configured, preventing models from solely relying external knowledge to achieve the goals; c) reasoning processes are also evaluated alongside final results. These enable us to understand model abilities more comprehensively.

2.3 Multimodal v.s. Pure Text Environments

Previous works investigated reasoning ability of models in pure text scenarios by allowing models to interact with environments via text interfaces [6, 42, 29]. However, some multimodal information cannot be easily and precisely expressed in verbal forms, hindering reasoning ability based on multimodal information to be evaluated in pure text environments. Additionally, models enjoy enhanced decision-making autonomy within multimodal environments compared to text environments. For example, models may decide their rotation angles or travel distances based on multimodal inputs when requiring to observe target objects or nearby surroundings, but it is hard to make such decisions based on text descriptions. Therefore, we believe that multimodal environments are essential for evaluating reasoning abilities of models, and propose EscapeCraft and MM-Escape to address the above issues.

Refer to caption

Figure 2: An example of success escape path from Gemini-1.5-pro. For the first four steps, it rotates, without moving, to observe the surroundings, and then starts moving towards the direct on the observed tables in the following steps. For step 5 and 6, we zoom in to the key to make it clearer. The key is distorted because the view of model is tilting downwards to observe objects on the table. From step 10 to 14, the model is already close to the door, and is slowly rotating and tilting to locate the door.

3 MM-Escape

In this paper, we address comprehensive investigation of complex multimodal reasoning for MLLMs by introducing MM-Escape, an extensible benchmark. To support this, we design EscapeCraft, a highly customizable and open environment that engages models in free-form exploration for quantitatively assessing the intermediate multimodal reasoning process, as depicted in Figure 1. We highlight that complex multimodal reasoning requires an autonomous integration of multiple multimodal abilities, and that evaluating the reasoning process is as crucial as assessing final task completion. We will first describe our introduced tasks in the following section, and detail the construction of our benchmark along with its corresponding environment.

3.1 Task Definition

Inspired by real-world game, room escape, we propose to place MLLMs in an escape escape scenario to investigate their multimodal reasoning abilities in an open-environment, where models should autonomously explore multimodal surroundings, and reason toward both temporary and long-term goals. We name the introduced task, multimodal room escape. It consists of a mandatory target, the room escaping task, aiming at escaping a locked room, and an optional target the post-game debriefing task, requiring to reconstruct the story discovered during the escaping. Our multimodal room escape task assesses the entire reasoning process rather than solely focusing on final game completion.

Room Escaping Task.

This task presents a ultimate goal of exiting the room, as shown in Figure 1 (Game Target) and a detailed example is demonstrated in Figure 2. It requires models to fully explore and interact with the multimodal environment, search for props and clues, identify the exit, and correctly use props to unlock the door. We do not exploit step-by-step instructions to restricted model actions, ensuring them to freely explore the environment without constraints, and automatically deduce current or short-term goals. This task evaluates the integrated multimodal reasoning ability, including object recognition, visual search, visual reasoning, target deduction, spatial reasoning, and prop utilization. Please refer to Supplementary Materials Sec. J for detailed discussion over full required abilities.

Post-game debriefing task differs from the room escaping task that involves reasoning about past experiences, current states, and future plans, as it takes place after the game is completed444This task is also common in real-world escape game.. It serves as an optional tasks because its preliminary requirement is to collect all props in the game and exit the room successfully, which is quite challenging for most of current models. Therefore, for a fair comparison, we only apply this target to cases where models correctly complete the game as shown in Figure 1. It emphasizes logicality and consistency of reconstruction, requiring models to reflect on the past experiences and events.

3.2 Construction and Design of Environment

We develop EscapeCraft to place models in an escape game, a free-form exploration environment, to comprehensively investigate complex multimodal reasoning ability of models. It supports customizable and extensible scene generation for our escape game with minimal manual effort.

Room Scene Generation

To support room escape task, we generate diverse and interactable room environments, which requires efficient and large-scale scene data generation and rendering. We develop an environment, EscapeCraft, by extending ProcTHOR [8] and Legent [4], which are originally labor-intensive regarding the scene generation process. We enable automatic size adaptation to predefined or customized configurations, such as room scale, number of rooms, and required furniture, by incorporating 3D furniture models with annotated size information. Critical objects are made fully interactable to serve as props and clues to assist in completing the task. The flexibility and interactivity of objects allow for automatic large-scale 3D room generation following the requirement of our benchmark. EscapeCraft is highly customizable and extensible, making it well-suited for the room escape task. Detailed construction can be found in Supplementary Material C.1.

Action Space

We define three types of actions, moving, view adjustment, and interaction. The moving action, i.e. moving forward, allows the model to change its position to perceive objects at different depth. View adjustment enables perception from different angles and facilitates object selection for interaction, including horizontal or vertical rotation, and looking at specific coordinates. Interaction actions contain grabbing, using, reading, and inputting, allowing models to obtain and utilize props from the environment, and process messages displayed by the props. These actions can be executed individually or integratedly in a multi-action way. See Supplementary Materials Sec. C.1.2 for details.

Inventory system

To enable model players to acquire and utilize props within the environment, we design an inventory system that allows models to store and manage acquired items, access detailed information about them, and use them as needed. This system assists models in effectively using props, and successfully escaping the room.

3.3 MM-Escape Benchmark

Refer to caption

Figure 3: Illustration of difficulties, showing required paths for the reasoning process of successful escapes regarding each levels. The levels can be customized and extended as depicted by “difficulty-n” example, and as demonstrated in Figure 1 (lower right part).

Game Settings.

We introduce an automatic reasoning chain generation procedure by configuring the Prop Chain. In detail, this is a singly linked list representing the ordered sequence of items and interactions required to complete the game. Each node in the chain corresponds to an interactive element, such as a key, a locked box, or a note with a password, where the tail node represents the exit point of the game. To construct a complete escape game setting, we annotate the links between nodes in the prop chain to define the ways to obtain different props (such as unlimited acquisition or requiring a key to open, etc.) and their inclusion relationships (for example, a key can be placed in a box). See Supplementary Material Sec. C.2 for details.

Following this, we employ varying difficulty levels in MM-Escape to facilitate in-depth assessments of complex multimodal reasoning. Shown in Figure 3, difficulty is primarily determined by the predefined prop chain of a game, where longer chains correspond to higher difficulty. We define three standard difficulty levels for individual rooms:

Since the prop chain can grow infinitely, our difficulty levels are inherently extendable. Moerover, the type of questions or tasks in each reasoning hop are customizable and interchangeable, further enhancing the difficulty and flexibility of MM-Escape. We also explore some extended settings that incorporate with other tasks, such as embodied QA and visual logical reasoning, with case studies in Section 5.2

Additionally, to further investigate the behavior and complex multimodal reasoning abilities of MLLMs, we introduce a multi-room setting by combining two standard single rooms. We create multiple multi-room combinations, each containing two individual rooms. The configurations include: two Difficulty-1 rooms, two Difficulty-2 rooms, and a mixed setting of Difficulty-1 and Difficulty-2. In the multi-room setting, models start in the first room, which has only one exit. Upon successfully exiting, models enter the second room and search for the final game exit. However, this setting presents a greater challenge than the single-room scenario, as there are two doors in the second room, requiring models to distinguish between exit and entrance based on their corresponding surroundings.

Statistics of MM-Escape

For individual room settings, we generated 11 scenes for each of Difficulty-1 and Difficulty-2, and 21 scenes for Difficulty-3. As there are two types of props required by Difficulty-3, we enable a key-first and a password-first prop chains. For multi-room settings, we generated 10 scenes for the three different combinations introduced above. There are totally 63 scenes for standard evaluation of our benchmark, which could be further extended for future research without adapting the environment. These scenes are categorized into four distinct styles: living room (14), kitchen (19), bathroom (19), and bedroom (11). And different objects are automatically placed within each scene to correspond to its specific style. On average, difficulty-1 scenes contain 20.18 objects, difficulty-2 contains 14.55, and difficulty-3 contains 15.24. The objects are logically arranged in a manner consistent with real-life settings and randomly distributed within the rooms to serve as part of the tasks for models.

Table 2: Results of standard single room setting. Prop: Prop Gain; Steps: average steps used to complete the game; Grab SR: the precision of grabbing; Grab Ratio: the portion of grabbing actions regarding the total consumed steps. Note that Difficulty-1 requires no prop, and the prop gain is therefore omitted for this setting. The max allowed steps are 50, 75, 100 for Difficulty-1, -2, -3 respectively. The best score of each metrics is bolded and the second is underlined.

Models Difficulty-1 & Difficulty-1 Difficulty-1 & Difficulty-2 Difficulty-2 & Difficulty-2
ER(%)↑↑\uparrow↑ Steps↓↓\downarrow↓ ER(%)↑↑\uparrow↑ Steps↓↓\downarrow↓ ER(%)↑↑\uparrow↑ Steps↓↓\downarrow↓
GPT-4o 75.00 35.50 34.25 0.32 90.00 100.00 34.90 35.52 0.31 70.00 80.00 39.50 42.32 0.37
Gemini-1.5-pro 22.22 40.22 22.89 0.38 40.00 50.00 56.60 16.79 0.05 60.00 80.00 60.00 22.71 0.34
Llama-3.2-11b-vision 55.56 31.00 36.25 0.36 10.00 60.00 66.40 4.40 0.40 10.00 40.00 76.80 27.00 0.19
Claude 3.5 Sonnet 22.22 45.22 10.62 0.08 20.00 20.00 71.90 6.75 0.09 10.00 10.00 80.00 23.20 0.06
Qwen-VL-max 22.22 40.33 12.96 0.16 30.00 50.00 57.70 42.30 0.28 0.00 10.00 80.00 23.66 0.32

Table 3: Performance on multi-room setting for different room scenes. To assist in the more challenging setting, we provide models with a full successful escape path from Room 1 (9 steps) for self-reflection when they try to unlock Room 2. Hence, the Prop Gain (Prop (%)) in the results refers only to Room 2. Further challenges of escaping from the very beginning of multi-room setting are discussed in Supplementary Material F.

Metrics of MM-Escape

To comprehensively evaluate complex multimodal reasoning ability, we propose a set of metrics for the intermediate process, alongside an indicator for final task completion. For the room escape task, we use average escape rate (ER) as the indicator of game completion, and design five metrics for measuring intermediate interactions, including prop gain, average steps, grab count, grab success rate, and grab ratio. Denoting the total steps as S𝑆Sitalic_S, amount of succeeded grabbing action as Ng⁢r⁢a⁢bT⁢Psuperscriptsubscript𝑁𝑔𝑟𝑎𝑏𝑇𝑃N_{grab}^{TP}italic_N start_POSTSUBSCRIPT italic_g italic_r italic_a italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T italic_P end_POSTSUPERSCRIPT, grab success rate as GSR, grab ratio as Rg⁢r⁢a⁢bsubscriptR𝑔𝑟𝑎𝑏\text{R}_{grab}R start_POSTSUBSCRIPT italic_g italic_r italic_a italic_b end_POSTSUBSCRIPT we have,

Prop Gain=Ng⁢r⁢a⁢bT⁢P∑Prop count,Prop Gainsuperscriptsubscript𝑁𝑔𝑟𝑎𝑏𝑇𝑃Prop count\text{Prop Gain}=\frac{N_{grab}^{TP}}{\sum\text{Prop count}},Prop Gain = divide start_ARG italic_N start_POSTSUBSCRIPT italic_g italic_r italic_a italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T italic_P end_POSTSUPERSCRIPT end_ARG start_ARG ∑ Prop count end_ARG , (1)
GSR=Ng⁢r⁢a⁢bT⁢P∑Grabbing actions,GSRsuperscriptsubscript𝑁𝑔𝑟𝑎𝑏𝑇𝑃Grabbing actions\text{GSR}=\frac{N_{grab}^{TP}}{\sum\text{Grabbing actions}},GSR = divide start_ARG italic_N start_POSTSUBSCRIPT italic_g italic_r italic_a italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T italic_P end_POSTSUPERSCRIPT end_ARG start_ARG ∑ Grabbing actions end_ARG , (2)
Rg⁢r⁢a⁢b=∑Grabbing actionsS,subscriptR𝑔𝑟𝑎𝑏Grabbing actions𝑆\text{R}_{grab}=\frac{\sum\text{Grabbing actions}}{S},R start_POSTSUBSCRIPT italic_g italic_r italic_a italic_b end_POSTSUBSCRIPT = divide start_ARG ∑ Grabbing actions end_ARG start_ARG italic_S end_ARG , (3)

The debriefing task is only applied to cases where models successfully escaped as a bonus, to further investigate the multimodal understanding and reasoning process. Because models that successfully complete the game tend to achieve high scores regarding metrics mentioned above. We employ large language models as evaluator for this task, assessing the consistency ([0,1]01[0,1][ 0 , 1 ]) of reconstructed stories with the groundtruth stories.

4 Experiments

Evaluation Setups.

We investigate both open-source models and propriety models of different scales, and evaluate their multimodal reasoning ability in level-wise approaches as outlined in Section 3.3. Specifically, we employ both single room and multi-room settings. The multi-room setting can be viewed as combinations of two single rooms, with the second room featuring two doors—one for entry and one for exit. For robust comparison, we set the temperature to 0 to eliminate token decoding diversity. The prompt templates used for evaluation are detailed in Supplementary Material Sec. K. Since the game can grow infinitely, we limit the maximum steps for each difficulties as 50, 75, and 100 respectively, for quantitative measurements. We slightly increase the max step to 80 for the most challenging multi-room setting. Since the proposed room escape task requires models to fully utilize past trajectories to gain a comprehensive understanding of the room environment, we investigate models supporting long input contexts, at least 128k, as shown in Table 2. Proprietary models include, GPT-4o555https://platform.openai.com/docs/models/gpt-4o, Gemini-1.5-pro666https://ai.google.dev/gemini-api/docs/models#gemini-1.5-pro, Claude 3.5 Sonnet777https://www.anthropic.com/news/claude-3-5-sonnet and Doubao 1.5 Pro888https://www.volcengine.com/docs/82379/1330310#doubao-1-5-vision, open-source models include Phi-3-vision, Llama-3.2-11b and Qwen2.5-VL.

Refer to caption

Figure 4: Analysis on the grabbing behaviors with respect to the escape rate and the visibility of the exit at initial location.

Main Results of Game Completion

Results of standard single room setting, along with manual evaluation are reported in Table 2. See Supplementary Material Sec. B for details of human evaluation.

The results indicate that model performances falls significantly behind human capabilities across all metrics, highlighting the remarkable gap between current multimodal reasoning abilities and human-level capabilities. Among the models, GPT-4o demonstrates the strongest overall performance across all difficulty levels, achieving the highest average escape rate (81.36%), followed by Gemini-1.5-pro (61.06%). Other models limited success in task completion, except for Phi-3-vision-128k, which fails considering the task completion. Claude 3.5 Sonnet and Llama-3.2-11b-vision achieve moderate ER, but their performance drops significantly in Difficulty-2 and -3, suggesting weaknesses in multi-hop multimodal reasoning.

For the intermediate process, GPT-4o efficiently completes tasks with relatively fewer steps while maintaining a high grabbing success rate. Notably, Gemini-1.5-pro achieves a lower escape rate (ER) in Difficulty-2 compared to GPT-4o, yet it presents the highest Prop Gain, suggesting a strong visual search ability. For models do not survive the task completion, we can still measure their multimodal reasoning ability in terms of Prop Gain, GSR and Grab Ratio. Llama-3.2-11b-vision and Qwen-VL-Max receive identical Prop Gain, while the later presents higher GSR with a lower Grab Ratio. This implies that Qwen-VL-Max can precisely identify scenes with useful props, while Llama-3.2 is likely to adopt a greedy policy to increase grab actions so as to find more props, which is often unintentionally.

For the more challenging multi-room setting, we can derive similar conclusions. Additionally, we notice that by providing a successful path of the first room helps the model to better conduct multimodal reasoning in our task for most models, but the improvements present in different forms. For example, GPT-4o achieves an ER of 90% for Difficulty-2 when bootstrapped by a first room of Difficulty-1, and prop gain of Gemini and Llama 3.2 is improved in Difficulty-2 & -2 combination compared with the setting of a single room of Difficulty-2.

Table 4: Steps taken to get key props and to open the door, denoted as step. And the ratio of cost of each corresponding stages, denoted as cost, which is the proportion of corresponding steps out of the total steps. #PW: step/cost to obtain passwords from the beginning. #Key: step/cost to obtain keys from the beginning (Difficulty-2), and to obtain keys after the acquisition of password (Difficulty-3). #Exit: step/cost to escape after the acquisition of the last key prop.

5 Analysis and Discussions

Takeaway Observations

Please refer to Supplementary Material Sec. A for full version of overall observations. In the following of this section, we will further discuss model performance concerning the escape process, including the escaping path, the consistency of sub-goals, and post-game debriefing, and also provide cases to demonstrate the extensibility of our EscapeCraft.

5.1 Analysis of Entire Path

Refer to caption

Figure 5: The extended scenario where the required password is displayed via a numerical pattern on the wall, rather than explicitly written on notes. GPT-4o completes reading it at once and exits within five steps, while Gemini struggles to repetitively search the room. Moving the pattern away from the door further challenges GPT-4o, leading to a failure of escaping.

We investigate three three key questions in this section: Q1) How many steps required to obtain props? Q2) How many steps are needed to exit the room after acquiring the core prop (key or password)? Q3) What is the relationship between grab success rate and escape outcome for each test?

For Q1, GPT-4o presents a significant advantage in step counts required to obtain the key followed by Gemini as shown in Table 4. While Claude requires fewer steps to find props in Difficulty-3, this comes at the cost of a significant decrease in escape rate. The superior performance in locating and obtaining the core props can be attributed to better understanding of task objectives and the holistic environment, and its enhanced reasoning abilities in this context. For Q2, Gemini can locate and acquire the key at a lower cost in difficulty-2, but GPT-4o outperforms in Difficulty-3, which is more complex. GPT-4o benefits from its prior memory and understanding of the room environment, gained in the process of obtaining key props, which aids it to efficiently locate the exit and escape with fewer steps compared to other models. For Q3, escape success is positively correlated with grab success rate (GSR), as shown in Figure 4(a). Higher GSR implies that models have experienced more successful interactions with the environment, potentially indicating a clearer understanding of the overall environment and ultimate goals for our task. While GSR declines with difficulty, the scores of GPT-4o and Claude 3.5 remain relatively stable compared to others, with less variation in grabbing behavior and GSR across difficulties. The low GSR of Qwen in difficulty-2 and -3 can be partly caused by the ineffective perception of the environment, inferior reasoning and interacting decision in this complex tasks, while the low GSR of Llama 3.2 is limited by its input registration of only one image at a time.

Please refer to Supplementary Materials Sec. E for detailed discussion, and Sec. D for additional examination about the moving distance and arrangement of the room.

5.2 The Extensibility of EscapeCraft

We provide an extended case study in this section, and also discuss the fully autonomous version of multi-room setting in Supplementary Materials Sec. F, and an additional customizations of escaping path in Supplementary Materials G.

We introduced an extended scenario where the required password is displayed via a numerical pattern on the wall, rather than explicitly written on notes, as shown in Figure 5. Models should recognize the pattern on the wall (password) and infer its relevance to the door. When pattern appears near the door, GPT-4o quickly identifies it and exit in the following five steps, while Gemini, despite seeing the pattern, failed to recognize it as the password and instead searched the room exhaustively repeatedly.

We further move the pattern away from the location of the door, and observe unchanged behaviors of Gemini. However, GPT-4o performs differently, by repeatedly moving between the bed and the wall without recognizing the password. It also failed to interact with the door until the game stops by the max allowance, revealing limitations in its long-term reasoning and spatial reasoning.

5.3 Analysis of Intermediate Consistency

Table 5: Consistency compared with GSR on Difficulty-3.

During experiments, we observe that some successful interactions are completed by accident instead of a proper reasoning process. For example, a model aims at a TV but obtaining a key instead. To further analyze the reasoning process concerning successful sub-goals (i.e. successful grabbing), we introduce an additional metric, intent-outcome consistency, denoted by CI⁢Osubscript𝐶𝐼𝑂C_{IO}italic_C start_POSTSUBSCRIPT italic_I italic_O end_POSTSUBSCRIPT, ranging [0,1]. This metric quantitatively assesses the intermediate process of achieving sub-goals. Specifically, it measures if model generated rationale (intent) aligns with actual environment feedback (outcome). We use GPT-4o for auto-evaluation with prompt template provided in Supplementary Materials K, where score 0 implies accidental success, and 1 reflects true valid reasoning. We calculate scores of Difficulty-3, as it involves more steps which can better reflect the intermediate consistency, and report some of the results in Table 5. GPT-4o achieves only 26.51% consistency, despite a 31.36% of GSR, implying many completed sub-goals are accidental. And so do other models, suggesting common accidental completion of sub-goals, indicating a need to improve the consistency between reasoning and perception capabilities.

5.4 Analysis on Post-game Debriefing

The post-game debriefing task requires models to recall their escape process and obtained clues, and reconstruct the whole stories. As successful escape is necessary for post-game debriefing, we only evaluate models with high success rate, that is, GPT-4o and Gemini-1.5-pro. Results show that both models fall short of ability of retelling the stories. Models pay strong attention to the processes which are directly related to the completion of room escaping, such as password acquisition. Meanwhile, they ignore background stories which are less important but also helpful for escaping. For the reason of limited model abilities, this may be a effective strategy to complete tasks. However, with the enhancement of model abilities in the future, it is necessary to improve model ability of memorization of background information. For the experiment results, please refer to Supplementary Material Sec. I.

6 Conclusions

In this paper, we introduce MM-Escape, an extensible benchmark for investigating multimodal reasoning, inspired by real-world escape games. We also develop EscapeCraft that enables models to engage in free-form exploration for assessing multimodal reasoning, to construction our benchmark. We find that MLLMs can successfully complete the simplest level of tasks, and some models even exhibiting human-like behaviors and strategies. However, performance dramatically drops as task difficulty increases, while human testers consistently succeed. MM-Escape reveals distinct failure modes across models, such as repetitive trajectories without adaptive exploration, trapped in corners without good spatial awareness. We hope our work sheds light on new challenges, and uncovers potential improvements for MLLMs.

Contributions

Ziyue Wang: Design of the escape process and post-game debriefing, implementation of EscapeCraft, all reported experiments. Paper writing: all sections and figures.

Yurui Dong: Design of the escape process and post-game debriefing, construction of 3D environment, all engineering and coding works. Paper writing: method sections and appendix sections, figures of case study.

Fuwen Luo: Design of the escape process, design and experiments of post-game debriefing. Paper writing: related works, post-game debriefing experiments and analysis.

Minyuan Ruan: Implementation of EscapeCraft, scene generation, construction of homepage. Paper writing: analysis, human evaluations, figures of analysis.

Zhili Cheng: Construction of 3D environment, design of room escape process.

Chi Chen: Design of room escape process and post-game debriefing, support on experiments.

Peng Li: Project supervision, advising of all designs, engineering, experiments, and paper writing.

Yang Liu: Project supervision, advising of all designs, engineering, experiments, and paper writing. 0000

References

\thetitle

Supplementary Material

Appendix A Full Takeaway Observations

Appendix B Human Evaluation

Table 6: Detailed results for various levels.Since humans completed all escape tasks in the evaluation, the escape rate is 100% and not reflected above.

We conduct manual evaluation on MM-Escape, and report detailed results in Table 6. Human participants in EscapeCraft exhibit a clear understanding of how to complete tasks efficiently. By observing objects in the environment, they can make reasonable judgments about which items to pick up, leading to a higher success rate in effective item acquisition and usage. Additionally, when unable to open doors or interactable objects, humans are more adept at promptly shifting their approach to seek alternative clues in the environment rather than getting stuck. In terms of spatial awareness, they demonstrate a strong ability to perceive the relative positions of objects, enabling logically reasonable, smoother and more coordinated exploratory actions. Across Difficulty-1 to 3, human participants consistently identify all necessary items with less interaction attempts compared with MLLMs, and successfully complete all the tasks within a limited number of steps.

Appendix C Construction Details

C.1 Environment Construction

C.1.1 Room Generation

We adopted the automated 3D room generation method ProcTHOR [8], with additional improvements to enhance its flexibility and applicability regarding diverse type of scenes. Following Procthor, we generate 3D environments that can simulate diverse real-world scenes, such as bedrooms, living rooms, and offices by maintaining collections of typical objects that are common in different scenes. For instance, desks in offices, workbenches in laboratories, and other representative objects of corresponding scenes. We enable automatic creation of 3D rooms from the collections of each scenes, ensuring that the generated rooms accurately reflect their respective environments.

We use a configuration file to generate each room, specifying the items along with required styles, positions, sizes, and interactivity. This enables precise control over the placement of prop objects, ensuring that they are arranged in a manner aligning with real-world expectations on spatial arrangement.

Benefits of the Automated 3D Room Generation include:

The automated 3D room generation framework we developed not only enhances the diversity and realism of testing environments but also optimizes the efficiency of large-scale evaluations. By providing a mechanism for dynamically altering the environment and agent conditions, it offers a more robust and fair evaluation process, ensuring that models are evaluated under realistic, varied, and challenging conditions.

C.1.2 Action Space

In our EscapeCraft environment, the agent is allowed to perform a set of actions that facilitate its interaction with the environment. These actions include moving forward, rotating right, rotating down, looking at specific coordinates, grabbing objects, and interacting with elements in the environment. Each of these actions plays a distinct role in enabling the agent to explore and solve tasks within the escape scenario.

Throughout the agent’s exploration, its interactions with the environment yield varying types of feedback. The environment is populated with different types of objects, classified as follows:

These interaction dynamics are crucial for the agent’s progression in the environment, as they form the basis for decision-making, object management, and problem-solving. The design of these interactions reflects the need for both exploration and strategy, with the agent needing to acquire, manage, and apply various items in order to navigate and ultimately escape the environment.

C.2 Data Construction

C.2.1 Prop Chain

We proposed a procedural generation approach for constructing game settings tailored to overcome the inherent limitations of current language models, such as restricted context length and reduced reasoning capabilities. To address these constraints, we propose the concept of Prop Chain, a singly linked list that organizes interactive game elements in a sequence, ensuring a coherent flow of gameplay interactions. Each node in the linked list corresponds to a distinct interactive item or action, such as a key, a locked box, or a note with a password. The tail node of the chain signifies the game’s exit point, thereby serving as the conclusion of the sequence. Table 7 shows the the Prop Chain for the Difficulty-3 Level.

In our implementation of the Prop Chain, we initially focus on a set of fundamental game elements: a key, a locked box (which can only be opened with a key or password), a note (carrying both password and story-related information), and an exit (which is locked and requires either a key or password to access). These components are used to construct a series of interconnected nodes, where each item or action is represented by a node in the chain. The links between the nodes define the relationships between the props and the ways in which they can be obtained or used during the game. For instance, some props may be freely accessible, while others require specific conditions, such as possessing a key to unlock a box, or using a specific password to open the door.

The inter-node relationships can be annotated to represent different interaction modes. For example, a key can be placed within a box, requiring the player to first unlock the box before acquiring the key. Additionally, nodes are allowed to contain multiple conditions. A note revealing the password to the exit are both narrative (to see or infer the textual password) and functional (to open the door).

Each node has an additional show property set to indicate whether the item should appear directly in the scene (for example, a key placed in a box only needs to show the box in the scene, while a key that can be directly obtained independently needs to be shown in the scene), allowing us to determine which props need to be generated in the 3D scene by reading the game settings.

While our initial focus on a limited set of props and interactions, such as the key, locked box, note, and exit, suffices for creating a variety of escape game settings that challenge current language models, the system is highly extensible. The procedural nature of Prop Chain allows for the seamless integration of new props, interactions, and unlocking mechanisms. As such, the framework can easily accommodate additional types of interactive items, more intricate unlock conditions, and customized gameplay mechanics in future iterations. This scalability ensures that the approach remains adaptable to more complex and diverse game scenarios, further enhancing its applicability for testing language models in a variety of settings.

The Prop Chain framework provides a robust and flexible methodology for the procedural generation of game settings. By focusing on a set of core interactive elements and defining their relationships within a linked list structure, we have developed a scalable approach that can evolve to incorporate new game dynamics and meet the increasing demands of future language models.

Table 7: Representation of the Prop Chain for the Difficulty-3 Level. The level includes a sequence of interactive props where only box_1 and note_1 are visible in the room. The gameplay progression follows a structured sequence: the agent first discovers note_1, which contains the password_1 needed to unlock box_1. Inside box_1, the agent retrieves key_1 and note_2, the latter of which contains a story element of the game. Finally, the agent uses key_1 to unlock the exit and complete the game.

Appendix D Analysis of Moving Distance

GPT Gemini Claude LLaMA Qwen
Correlation - 0.06 0.06 0.49 0.63 - 0.48

Table 8: Correlation between optimal distance and model moving distance.

We calculate the optimal distance required for escape tasks in each scene and compare it with the real distance experienced by the models. Contrary to our expectations, the experienced distance does not exhibit a significant correlation to the distance among key props and the exit within the scene shown in Table 8. This discrepancy may be attributed to the lack of holistic environmental perception of models, which prevents them from further reasoning and planning based on current and ultimate goals, thereby failing to generate an effective and optimal route to complete the task.

Appendix E Analysis of Grabbing Behaviors

In Figure 4 (b)(c)(d), we analyzed three performance metrics, steps, GRS, and Rgrabsubscript𝑅grabR_{\text{grab}}italic_R start_POSTSUBSCRIPT grab end_POSTSUBSCRIPT, during the model task completion process under Visibility of Exits at initial locations and orientations . The results indicate that, under common trends, the ability to see the exit from the initial position aids the model in escaping the room with fewer steps.It aligns with our intuition, as the exit, crucially related to the ultimate task goal, plays a significant role in model’s visual recognition, reasoning and interaction with the environment to collect information. However, there exist exceptions. For Difficulty-1, some well-performed models still struggle to achieve high GSR and consume more steps despite being able to see the exit from their initial location. They do not interact directly with the exit at the very beginning, and instead choose to rotate around explore the environment for more information and to conduct reasoning and taking action. This is also evident in Difficulty-2 and -3, where these models, after acquiring the key prop, can locate the exit and escape more efficiently, as reflected in better performance in terms of GRS, Rg⁢r⁢a⁢bsubscript𝑅𝑔𝑟𝑎𝑏R_{grab}italic_R start_POSTSUBSCRIPT italic_g italic_r italic_a italic_b end_POSTSUBSCRIPT and steps.

We further raise three questions for the analysis of the reasoning process during escaping: i) How many steps it costs to obtain props? ii) How many steps it costs to exit the room after obtaining the core prop (key or password to the door)? iii) What is the relationship between grab success rate (GSR) and escape outcome for each test?

For question 1, GPT-4o demonstrates a significant advantage in the number of steps required to obtain the key followed by Gemini as shown in Table 4. Although Claude requires fewer average steps to find props in Difficulty-3, this comes at the cost of a significant decrease in escape rate. The superior performance in locating and obtaining the core prop can be attributed to model’s better understanding of task objectives and the environment in the escape room, as well as its enhanced reasoning abilities in this context.

For question 2, Gemini is able to locate and acquire the key at lower cost in difficulty-2. But in difficulty-3 which is more complex, GPT-4o performs better. It finds the core prop with fewer steps and its prior memory and understanding of the room environment—gained in the process of obtaining key props—aids it to locate the exit and escape using even fewer steps compared to other models.

For question 3, we observe that escape success is positively correlated with GSR, as shown in Figure 4(a). A higher Grab SR implies that models have experienced more successful interactions with the environment. It potentially indicates a clearer understanding of the overall environment and ultimate goals within the room escape task, leading to a higher success rate. As difficulty increases, the Grab SR of most models declines, and many of them fail to escape. However, GPT-4o and Claude 3.5 remain relatively stable, with less variation in grabbing behavior and success rate across difficulty settings compared to others. The low success rate of Qwen, and Llama 3.2 11B in difficulty 2 and 3 can be partly attributed to their inability to effectively perceive the environment, reason and make appropriate object interaction choices in more complex tasks.

Table 9: Detailed results of note-key and key-note settings of Difficulty-3.

Appendix F Discussion of Fully Autonomous Multi-room Escape

We discussed a simplified multi-room setting in Table 3. We further study how models behavior in this section. The ER of GPT-4o decreases to only 50% on average for the settings of applying Difficulty-2 to room 2. The grabbing behaviors also change, where both the Grab SR and Grab Ratio decreases. Similar trends are observed for Gemini and Claude. These indicate that models can learn from a successful escape history. We also note that by setting the two rooms to the same difficulty level further helps models to escape, while different levels do not benefit as expected.

Appendix G Discussion of Customizing Difficulties

We enable two different settings of Difficulty-3, a key-note setting and a note-key setting. We observe that human annotators perform equally for both settings (from Table 6), while some models present preferences regarding the key-first and the note-first (i.e. the password-first), as shown in Table 9. Gemini presents an approaching GPT-4o level results in the note-key setting, while scores the worst in all calculated metrics among the three reported models, presenting a preference towards searching for the note rather than recognizing and interacting with the key. Additionally, Claude scores higher in Grab SR regarding the key-first setting than the note-first setting, potentially indicating a better attention on the key (directly used to unlock the door) than on the note (with clues, implicitly assists with the escape process).

For the multi-room setting, whose results are reported in Table 3, we further extend the experiments to a full autonomous scenario to require models to escape both room in order all by themselves. This means, the first room no longer serves as a bootstrapping guidance. We notice a performance drop both in the escape rate and the grabbing behaviors.

Appendix H Experiments with Reasoning Models

Results of recently released reasoning models, such as Claude 3.7 and o1, are reported in Table 10. We also provide results fo GPT-4o and Claude 3.5 for comparison. Notably, o1 and Claude 3.7 attempt fewer grabs but yield higher GSR and Prop Gain, indicating more efficient and intelligent reasoning compared to there previous versions.

Table 10: Results of reasoning models on Difficulty-2.

Appendix I Results of Post-game Debriefing

We choose models with top-2 escaping rate for evaluation, that is, GPT-4o and Gemini-1.5-pro. We first leverage the debriefing prompt, as shown in Table 16, to get recovered stories from models. Then we use GPT-4o to assess the recovered stories according to their similarities to our given groundtruth stories, then assign a score between 0 and 5 to each recovered story. As shown in Table 11, average scores of both models are low. We further show a case in Table 12. We can see that models can recall and describe their escape process relatively well, but they do not remember clues about the background stories well, which leads to the failure of restoring the full picture of stories.

Model GPT-4o Gemini-1.5-pro
Average Score 0.33 0.44

Table 11: Average scores of models on post-game debriefing task. Scores are between 0 and 5.

Groundtruth
The story revolves around a secret society that hides within the walls of an ancient mansion. Once you enter, you must follow cryptic clues and navigate through a maze of rituals and secret rooms. The society’s true purpose is far darker than its intellectual allure, and the deeper you go, the harder it is to escape. The goal is to uncover the society’s mysteries and find a way out before becoming part of its twisted legacy.
Recovered Story from GPT-4o
The room appeared to be a deliberately designed escape challenge, possibly created for entertainment or as a test of problem-solving skills. The presence of the locked box and the notes suggested a sequence of puzzles that needed to be solved to escape. The sequence of events began with the discovery of Note 1, which provided the numerical code ”9926.” This code was used to unlock the box on the table, revealing Key 1 and Note 2. The key was then used to unlock the door, allowing for the escape from the room. The story on Note 2, while not explicitly detailed, might have provided a narrative or thematic context for the escape challenge, perhaps hinting at the creator’s intentions or the purpose of the room. The overall experience was one of exploration, deduction, and the satisfaction of solving a series of interconnected puzzles to achieve freedom.
Recovered Story from Gemini-1.5-pro
The room was clearly designed as an escape room challenge. The purpose was to test my problem-solving skills and ability to find and interpret clues. The creator of the challenge remains unknown, but the sterile and impersonal nature of the room suggests a deliberate design choice. The sequence of events was as follows: I started in the center of the room, explored the cabinet near the door, and then the sofa and stool. I then found the note with the number ”1148”. Realizing the box on the side table was likely password-protected, I used the number from the note to unlock it. Inside, I found the key (key_1) and a second note (note_2). I then used the key to unlock the door and escape the room. The story on note_2 likely served as a distraction or perhaps a thematic element to the escape room’s design, but ultimately, the key to escaping was the numerical code ”1148”.

Table 12: A case of post-game debriefing.

Appendix J Required Abilities

In our MM-Escape Benchmark, the free-form task challenges models with the following abilities:

Appendix K Prompt Template

System Prompt

The System Prompt consists of two primary components: the Instruction Prompt and the Operation Prompt. The Instruction Prompt provides the model with contextual information regarding the current environment, its overarching objective, and the approach required to achieve this objective. In contrast, the Operation Prompt delineates, in precise detail, the permissible actions and exploratory methods that the model can employ within the environment. Additionally, it specifies the format and structure of the structured data that the model is expected to generate in response. The complete prompt is shown in table 13.

Instruction Prompt
You find yourself locked inside a room, and your ultimate goal is to escape the room. i.e. the room escape game.
You can explore the room, interact with objects, inspect items, and resolve puzzles. If you find doors locked or uninteractable, you probably need to search for keys or passwords to unlock the door when interacting with the environment. You can adopt the following actions to explore the room and interact with objects:
Operation Prompt
- move_forward: float, ranged between [-10, 10]. This is the number of meters you want to move forward (negative value means moving backward).
- rotate_right: float, ranged between [-180, 180]. This is the number of degrees you want to turn right (negative value means turn left).
- rotate_down: float, ranged between [-90, 90]. This is the angle you want to adjust your view vertically. Positive value means looking downward, while a negative value means looking upward. Angle 0 means looking straight ahead.
- jump: bool, whether you want to jump (can be used together with moving forward), e.g., True represents the action ”to jump”.
- look_at: list[x: foat, y: float], the range of x and y is [0, 1]. This parameter is the coordinates of the point in the image you want to look at. For reference, the coordinates of the upper left corner of the scene are (0, 0) and the coordinates of the lower right corner are (1, 1). Also to mention that there are on clues on the ceiling.
- grab: bool, whether you require to interact with the object located exactly at the center of the scene (marked by a red dot). e.g., to grab the key or to interact with (or open) a box at the center of the scene, set grab=True. The red dot assists in locating the object you require to interact with. You might need to adjust the view or move closer to ensure the red dot is on your target object, through the rotate_right, rotate_down, and move_forward actions. To successfully grab an object, you should center the object via the red dot and be in a certain distance to it. If the grabbing fails, try move closer towards the object. If it fails multiple times at the same position, you should be aware that not all objects are interactable, do not get stucked in uninteractable position.
- interactions : dict:{”use_item_id”: str, this is the item_id you require to view or use (when used together with grab=True, it means to use this item to interact with the target object you want to grab, e.g. using item_id of the key to open the door in the scene), ”input”: str, this is the message you want to input when interacting with the center object}.
- read: str, this is the item_id that you want to get detailed information from your bag.
- rationale: str, represents the rationale of your action. This should explain your decision-making process and help the agent understand your thinking process.
You need to return data in the following format of JSON_string to interact with the scene:
{
‘‘move_forward’’: float,
‘‘rotate_right’’: float,
‘‘rotate_down’’: float,
‘‘jump’’: bool,
‘‘look_at’’: [x: float, y: float],
‘‘grab’’: bool,
‘‘interactions’’: {
‘‘use_item_id’’: str,
‘‘input’’: str
},
‘‘read’’: str,
‘‘rationale’’: str
}
All of the above operations are optional. If no value is passed in, the interactive operation will not be performed.
You must follow the above instructions and don’t say anything else except for the JSON_string of operations.

Table 13: The System Prompt

Step Prompt

The Step Prompt is designed to provide feedback to the model regarding the outcome of its previous interaction with the environment (if an interaction was attempted). Simultaneously, it informs the model in real-time about the items currently available in its inventory for potential use. Additionally, the prompt serves as a directive, encouraging the model to continue exploration or engage in further interactions.The complete prompt is shown in table 14.

Interaction Result
{interaction_result}
===
Inventory
The items in your bag usable include:
{bag_desc}
===
Step Prompt
Please determine the next action(s) that could help you observe the room or obtain useful tools or clues.
If you find yourself stuck in a corner, try turn around by passing rotate_right.
You need to return data in the following format of JSON_string to interact with the scene and don’t say anything else:
{
‘‘move_forward’’: float,
‘‘rotate_right’’: float,
‘‘rotate_down’’: float,
‘‘jump’’: bool,
‘‘look_at’’: [x: float, y: float],
‘‘grab’’: bool,
‘‘interactions’’: {
‘‘use_item_id’’: str,
‘‘input’’: str
},
‘‘read’’: str,
‘‘rationale’’: str
}

Table 14: The Step Prompt

Prompt for Consistency Evaluation

The Consistency Evaluation Prompt is designed to assess whether the multimodal agent’s reasoning aligns with the actual outcomes of its actions during an escape room interaction. After each interaction, the model is given the agent’s internal rationale, describing its belief or intended action, and the environment’s response, which records what actually occurred. The prompt guides the model to judge if the target object mentioned in the rationale matches the object that was truly interacted with, thereby evaluating whether the behavior is intentional or accidental. A special case is defined for successful escapes: the rationale must explicitly or implicitly indicate the agent’s goal to exit the room. The model outputs a binary judgment in JSON format, indicating consistency (‘1’) or inconsistency (‘0’). The complete prompt is shown in table 15.

Table 15: The Consistency Evaluation Prompt

Debriefing Prompt

The Story Recovery Prompt is used to guide the model to recall and infer the background and story of the entire game based on the interaction records after the model successfully escapes the room. The model is guided to describe the room environment, recall the items that may contain information or clues, and finally piece together the whole story to complete the story recovery. The complete prompt is shown in table 16.

Story Recovery
You have successfully escaped the room. Now, reconstruct the entire story based on the items you discovered during the game and the overall environment you observed. Follow the steps below to guide your recollection and piece together the full narrative.
Describe the room environment
Step 1: Describe the room environment ”Begin by describing the room where you started. What did the room look like? What was the overall atmosphere? Were there any notable features, such as furniture, lighting, or strange objects? Include sensory details like smells, sounds, and the arrangement of the room. This will help set the scene for the story.”
Recall the items that may contain
Step 2: Recall the items that may contain information or clues ”Think back to the objects you found throughout the game. What items did you come across? Were any of them unusual or seemed important? These could include physical items like keys, notes, or devices, or even abstract clues like symbols or markings on the wall. Reflect on how each item might have connected to the next step in your escape.”
Piece together the whole story
Step 3: Piece together the whole story ”Now, use the information from the room description and the items you’ve found to piece together the full story. What was the purpose of the room? Who or what might have created the escape challenge, and why? What was the sequence of events that led you to the escape? Try to connect the dots between the environment, the clues, and the items you encountered, and reconstruct the narrative from start to finish.”

Table 16: The Story Recovery Prompt