SWE-Compass: Towards Unified Evaluation of Agentic Coding Abilities for Large Language Models (original) (raw)

Jingxuan Xu∗, Ken Deng∗, Weihao Li∗, Songwei Yu∗, Huaixi Tang∗, Haoyang Huang∗, Zhiyi Lai∗, Zizheng Zhan∗, Yanan Wu∗, Chenchen Zhang∗, Kepeng Lei, Yifan Yao, Xinping Lei, Wenqiang Zhu, Zongxian Feng, Han Li, Junqi Xiong, Dailin Li, Zuchen Gao, Kun Wu, Wen Xiang, Ziqi Zhan, Yuanxing Zhang, Wuxuan Gong, Ziyuan Gao, Guanxiang Wang, Yirong Xue, Mengtong Li, Mengfei Xie, Xiaojiang Zhang, Jinghui Wang, Wenhao Zhuang, Zheng Lin, Huiming Wang, Zhaoxiang Zhang, Yuqun Zhang, Haotian Zhang, Bin Chen, Jiaheng Liu†
Kuaishou Technology, Nanjing University
xujingxuan05@kuaishou.com, dengken@kuaishou.com, liujiaheng@nju.edu.cn

Abstract

Evaluating large language models (LLMs) for software engineering has been limited by narrow task coverage, language bias, and insufficient alignment with real-world developer workflows. Existing benchmarks often focus on algorithmic problems or Python-centric bug fixing, leaving critical dimensions of software engineering underexplored. To address these gaps, we introduce SWE-Compass111https://huggingface.co/datasets/Kwaipilot/SWE-Compass, a comprehensive benchmark that unifies heterogeneous code-related evaluations into a structured and production-aligned framework. SWE-Compass spans 8 task types, 8 programming scenarios, and 10 programming languages, with 2000 high-quality instances curated from authentic GitHub pull requests and refined through systematic filtering and validation. We benchmark ten state-of-the-art LLMs under two agentic frameworks, SWE-Agent and Claude Code, revealing a clear hierarchy of difficulty across task types, languages, and scenarios. Moreover, by aligning evaluation with real-world developer practices, we hope SWE-Compass can provide a rigorous and reproducible foundation for diagnosing and advancing agentic coding capabilities in large language models.

††footnotetext: * Equal Contribution. † Corresponding Author.

1 Introduction

Large language models (LLMs) trained on code have rapidly advanced from solving algorithmic puzzles to assisting with production-scale software development. Modern coding LLMs (Team et al., 2025a; b; Anthropic, 2025; Team et al., 2025c; Team, 2025) now exhibit strong multi-turn reasoning, long-context handling, and tool-use capabilities, enabling them to serve as autonomous coding agents that plan, edit, test, and deploy software. This shift has motivated a wave of benchmarks designed to measure their utility. However, existing evaluations fall short in capturing the full scope of real-world software engineering: most remain restricted to single-file tasks, Python-centric bug fixing, or synthetic algorithmic problems (Zheng et al., 2023a; Austin et al., 2021; Li et al., 2022; Jain et al., 2024; Zhuo et al., 2025), leaving critical developer activities such as feature implementation, refactoring, configuration, and performance optimization underexplored.

Refer to caption

subfigureTask-Specific Resolve Rates

Refer to caption

subfigureLanguage Distribution Across Benchmarks

Figure 1: Comparative analysis: model performance across task types (left) and language coverage across benchmarks (right).

Recent repository-grounded benchmarks, such as SWE-bench and its variants (Jimenez et al., 2024; Yang et al., 2025a; Badertdinov et al., 2025; Yang et al., 2025a), have improved ecological validity by embedding evaluations in real issues, integrating test oracles, and introducing multi-language (Rashid et al., 2025; Yang et al., 2025a) or multimodal (Yang et al., 2025b) extensions. Yet these efforts largely converge on bug fixing as the dominant evaluation axis. As a result, they neglect the breadth of software engineering workflows that unfold across diverse scenarios—ranging from infrastructure and security engineering to machine learning system development—and across heterogeneous programming ecosystems. This narrowness prevents systematic capability diagnosis and obscures whether strong performance arises from generalizable reasoning or from artifact-specific adaptation.

To address these limitations, as shown in Figure 1, we present SWE-Compass, a unified benchmark comprising 2,000 verified instances for evaluating LLMs’ agentic coding abilities. SWE-Compass spans 8 task types, 10 programming scenarios, and 10 programming languages, combining broad coverage with rigorous evaluation fidelity. Each instance is paired with executable environments and reproducible tests, enabling fair comparison across prompting and agent-based methods under controlled budgets. Importantly, SWE-Compass is built upon four design principles: (i) real-world alignment, ensuring data originates from genuine developer interactions; (ii) comprehensive coverage across diverse tasks and languages; (iii) systematic taxonomy, providing structured labeling and balanced distributions; and (iv) evaluation fidelity, guaranteeing that all instances are executable and verifiable. Together, these principles yield a benchmark that reflects the complexity, diversity, and reproducibility demanded by modern software engineering.

Our contributions are threefold:

•
A comprehensive, execution-grounded benchmark for software engineering. We introduce SWE-Compass, a large-scale benchmark comprising 2,000 curated instances that span eight task types, eight programming scenarios, and ten programming languages. Each instance is drawn from real-world GitHub pull requests and paired with a reproducible execution environment, enabling rigorous and faithful evaluation of model performance in realistic development workflows.
•
A systematic evaluation framework aligned with real-world developer activities.SWE-Compass establishes a structured taxonomy to assess models across different dimensions such as feature implementation, refactoring, test generation, and deployment. This design enables fine-grained diagnosis of LLM capabilities and provides a principled foundation for comparing agentic coding systems under consistent conditions.
•
Comprehensive empirical analysis and insights into LLM coding behavior. Experiments with state-of-the-art LLMs and agentic systems reveal persistent gaps across tasks, languages, and scenarios, highlighting the difficulty of scaling beyond bug fixing and emphasizing the need for benchmarks that reflect the full complexity of real-world software engineering.

Coding LLMs and Agents. Code large language models (Code LLMs) (Chen et al., 2021; Zhao et al., 2024; Chowdhery et al., 2023; Nijkamp et al., 2023; Fried et al., 2023; Xu et al., 2022; Roziere et al., 2023; Hui et al., 2024a; Deng et al., 2025; Que et al., 2024) — excel at a wide range of programming tasks, including code generation, completion, repair, translation, code comprehension, documentation generation, and cross-language migration, among others. Crucially, modern Code LLMs combine ultra-long context support with robust tool-calling capabilities, enabling them to maintain global awareness across large codebases while actively invoking editors, shells, debuggers, or web browsers (Liu et al., 2024a; 2025; Wang et al., 2024). This synergy has fueled the rise of agentic coding systems — such as SWE-Agents (Yang et al., 2024), OpenHands (Wang et al., 2025), and Claude Code (Anthropic, 2025), QwenCode (Team, 2025), Codex (OpenAI, 2025), Cline (Cline, 2024) — that autonomously plan, search, edit, test, and even perform agentic browser use to fetch live API documentation or solutions. As evaluations move toward dynamic, repository-scale workflows, these agent-based systems are showing improved performance over traditional code-generation approaches — particularly in tasks requiring persistent context, environment interaction, and multi-step reasoning (Liu et al., 2024b; He et al., 2025).

Coding Benchmarks. Single-file code benchmarks — such as HumanEval (Zheng et al., 2023a), MBPP (Austin et al., 2021), CodeContests (Li et al., 2022), LiveCodeBench (Jain et al., 2024) and BigCodeBench (Zhuo et al., 2025) — evaluate models on isolated algorithmic problems, abstracting away the structural, contextual, and environmental complexity inherent in real-world software engineering (Liu et al., 2024c); while SWE-bench (Jimenez et al., 2024) and its variants — including Multimodal SWE-bench (Yang et al., 2025c), SWE-bench Multilingual (Yang et al., 2025a), SWE-bench-Live (Zhang et al., 2025a), SWE-Lancer (Miserendino et al., 2025),SWE-rebench (Badertdinov et al., 2025) and others — have substantially improved ecological validity by grounding evaluation in real repository issues and incorporating dimensions such as visual context, multi-language support, tool interaction, and repository-scale execution, they remain overwhelmingly confined to bug fixing as the de facto evaluation paradigm, neglecting the broader spectrum of developer activities such as feature implementation, refactoring, configuration, performance optimization, and test generation, which unfold across diverse engineering contexts including application and infrastructure development, ML/AI systems, security, UI/UX, and beyond — a critical omission that precludes fine-grained, scenario-aware capability analysis and obscures whether model performance stems from general reasoning, domain adaptation, or artifact overfitting; to address this gap, we introduce a benchmark that explicitly structures evaluation along orthogonal axes of task type and programming scenario, enabling systematic diagnosis of model strengths and weaknesses across the multifaceted reality of software development, rather than reducing it to a single, narrow slice.

3 SWE-Compass

3.1 Overview

Existing software engineering benchmarks primarily focus on Python-centric bug fixing tasks, exhibiting limited task coverage and insufficient alignment with real-world developer activities. In contrast to such benchmarks that concentrate on a single programming language and task type, SWE-Compass is constructed from authentic software engineering requirements, as shown in Table 1. It collects a large volume of high-quality repositories from GitHub pull requests and undergoes a multi-stage filtering and construction process. The resulting benchmark encompasses 2000 instances across 8 types of code-related tasks, 8 programming scenarios, and 10 programming languages, as shown in Figure 2. It enables a comprehensive evaluation of key software engineering capabilities, including bug fixing, performance optimization, and other related tasks, offering a holistic assessment of model performance in realistic software engineering contexts.

Refer to caption

Figure 2: Distributions across task types, programming scenarios and languages.

Table 1: Comprehensive comparison of SWE-Compass with existing benchmarks across different dimensions.

3.2 Design Principles

SWE-Compass is designed around four guiding principles that distinguish it from existing software engineering benchmarks:

•
Real-World Alignment: The benchmark is grounded in authentic developer workflows by collecting tasks from large-scale discussions on GitHub and Stack Overflow. This ensures that evaluation scenarios directly reflect the diversity and complexity of real-world software engineering requirements rather than simplified or synthetic problem settings.
•
Comprehensive and Balanced Coverage: SWE-Compass systematically spans the full spectrum of software engineering activities—including implementation, enhancement, maintenance, testing, and deployment—while ensuring balanced distributions across diverse programming scenarios and over 10 programming languages. Unlike prior benchmarks, which are heavily skewed toward Python-centric bug fixing tasks, SWE-Compass deliberately broadens coverage to underrepresented categories such as refactoring, performance optimization, and code understanding.
•
Systematic Taxonomy: Through an iterative active learning pipeline, SWE-Compass distills raw developer discussions into a structured taxonomy of task types, programming scenarios, and languages. This taxonomy provides a principled framework for data collection, classification, and synthesis, ensuring both granularity and scalability.
•
Evaluation Fidelity: All benchmark instances are tied to executable test patches and reproducible environments, ensuring that evaluation results reflect genuine functional correctness. For task categories underrepresented in real-world repositories, SWE-Compass supplements data through carefully controlled synthesis while maintaining consistency with real-world task characteristics.

Refer to caption

Figure 3: Construction of SWE-Compass.

3.3 Benchmark Construction

The construction of SWE-Compass follows a systematic and scalable approach organized into five major steps to ensure comprehensive coverage, balance, and real-world relevance: (1) user analysis, (2) data collection, (3) environment building, (4) task construction, and (5) data validation, as illustrated in Figure 3. Specifically, through an iterative Active Learning procedure applied to real-world coding conversations, we first identified that user needs predominantly fall into eight distinct task types, eight representative programming scenarios, and ten programming languages. We then collected a large volume of high-quality pull request (PR) data from GitHub repositories. By combining automated processing with expert annotation, we successfully built a set of executable development environments. Next, for each of the eight task types, we constructed and synthesized the corresponding task instances. Finally, after a multi-round filtering and quality validation process, we curated the SWE-Compass benchmark as the final dataset.

3.3.1 Step 1: User Analysis

To ensure that the evaluation accurately reflects model capabilities in realistic software development contexts, we collected repository-level coding discussions from two major platforms—Stack Overflow and GitHub. To discover emerging task categories, we designed an automated Active Learning framework for category discovery. Specifically, four popular software-related topics were chosen as initial label seeds for both task types and programming scenarios. Using an In-Context Learning (ICL)-based labeling approach, a large language model (LLM) was employed to annotate the collected conversations across three dimensions: task type, programming scenario, and programming language. Subsequently, tag clustering and LLM-guided seed optimization (via addition, modification, or deletion of tags) were applied to refine the label pool. The iterative process continued until convergence—when the updated seed pool no longer significantly differed from the previous ICL-generated pool. In our experiments, the Qwen3-Coder-30B-A3B-Instruct222https://huggingface.co/Qwen/Qwen3-Coder-30B-A3B-Instruct model was used as the LLM annotator, and five iterations were performed in total. Ultimately, we identified eight task types, eight programming scenarios, and ten major programming languages as follows.

Task Types:

•
Feature Implementation (FI): Developing features or modules from scratch, representing a core activity distinct from modifications or bug fixes.
•
Feature Enhancement (FE): Modifying or enhancing existing features to improve functionality, excluding any bug-related changes.
•
Bug Fixing (BF): Identifying, diagnosing, and resolving defects in the code, including troubleshooting and debugging.
•
Refactoring (RF): Improving the structure and maintainability of the code without altering its external behavior or functionality.
•
Performance Optimization (PO): Enhancing system efficiency and resource utilization, focusing specifically on performance improvements and distinct from refactoring.
•
Code Understanding (CU): Exploring, analyzing, and understanding code through static and dynamic analysis, including generating reports.
•
Test Case Generation (TG): Automatically generating unit and integration tests to validate code and ensure quality assurance.
•
Configuration & Deployment (CD): Setting up environments, managing dependencies, and writing deployment scripts to ensure smooth application operation.

Programming Scenarios:

•
Application Development (AD): Developing applications for specific environments such as web or desktop platforms, with an emphasis on feature implementation and platform adaptation.
•
Database Systems (DS): Designing, developing, managing, and optimizing databases to ensure efficient data storage, access, and consistency.
•
Data Science & Engineering (DE): Handling data processing, analysis, mining, ETL, and feature engineering, emphasizing data-driven decision-making and efficient pipeline construction.
•
Machine Learning & AI (ML): Training models, building recommendation systems, applying algorithms to enable intelligent decision-making and predictions.
•
Infrastructure Development (ID): Building foundational systems such as distributed architectures, system deployment, and DevOps tools, emphasizing stability, scalability, and automation.
•
Specialized Programming Domains (SPD): Addressing areas such as graphics, gaming, multimedia, and networking that require specialized technical expertise and tailored solutions.
•
Security Engineering (SE): Ensuring application and system security, identifying vulnerabilities, and implementing measures such as encryption to maintain compliance with security standards.
•
UI/UX Engineering (UI/UX): Designing and optimizing user interfaces and experiences across platforms to enhance visual appeal, usability, and consistency.

Programming Languages: Python, JavaScript, TypeScript, Java, C, C++, Go, Rust, Kotlin, C#.

3.3.2 Step 2: Data Collection

To ensure both coverage and realism in benchmark construction, we conducted the following data collection process. Specifically, we first gathered existing open-source SWE benchmarks (e.g., Python bug fixing datasets) and mapped them to our defined taxonomy of task types, programming scenarios, and languages. As shown in Appendix A.5, these benchmarks exhibit severe deficiencies across multiple dimensions: many task types are missing, scenario distributions are highly imbalanced, and programming languages are heavily skewed. To address these limitations, we further supplemented the dataset with high-quality repositories from GitHub in the following strategy.

•
High-Quality Repository Acquisition. Repositories were filtered using multiple quality indicators, including: valid open-source licenses, at least 500 stars, active maintenance within the past six months, at least three distinct contributors, more than 1000 issues and PRs, more than 200 forks, and the presence of executable unit tests. This process yielded a large set of diverse, actively maintained repositories across 10 programming languages.
•
High-Quality PR Acquisition. Within the filtered repositories, we extracted all associated PRs and applied multi-stage filtering to retain only those with clear and meaningful modification semantics. Specifically, we kept PRs that were successfully merged into the main branch, linked to descriptive Issues, and contained identifiable file- or line-level changes. Each retained PR was required to have complete metadata, including repository, issue description, commit, test patch, and code patch.

After all filtering stages, approximately 50,000 high-quality PRs were preserved, serving as the foundation for subsequent environment and task construction.

3.3.3 Step 3: Environment Building

To enable reproducible execution and evaluation of each software engineering instance, we constructed isolated containerized environments for all selected PRs. For each PR, we automatically extracted environment dependency information—such as package managers, required libraries, build tools, and runtime versions—from configuration files (e.g., requirements.txt, setup.py, Makefile, and CI/CD scripts). These dependencies were programmatically organized into corresponding Dockerfiles, from which initial Docker images were generated.

Each successfully built image was then validated by executing the repository’s native test suite to verify that it could run end-to-end and reproduce the functionality and performance behavior before and after patch application (F2P/P2P consistency). The initial automated build success rate was around 2%, reflecting the inherent complexity and dependency fragility in real-world repositories.

To address build failures, 30 expert annotators inspected the corresponding build logs, identified root causes (e.g., missing dependencies, version conflicts, or OS-level mismatches), and applied targeted fixes before re-triggering the build on Kubernetes. This expert-assisted retry process raised the overall retention rate to approximately 8%.

Finally, we obtained about 4,000 successfully runnable Docker images, each providing a fully reproducible and verifiable execution environment for downstream task synthesis and evaluation.

3.3.4 Step 4: Task Building

Given the heterogeneity of the eight software engineering task types, we designed three complementary strategies to construct diverse and representative task instances: (1) Checklist Synthesis, (2) Reverse Masking, and (3) Targeted Filtering. Each strategy was tailored to the specific characteristics of the corresponding task type, ensuring both task realism and evaluation reliability.

•
Checklist Synthesis. For the Code Understanding task, where the goal is to evaluate a model’s capability to comprehend and reason about code semantics, we adopted a data synthesis generation pipeline. Each instance was built from a combination of Issue, Code Patch, and Test Patch extracted from real PRs. We then prompted GPT-5 to generate multiple natural-language queries for each PR (e.g., Which functionality does this change affect?). These generated queries were filtered by a difficulty-aware scoring function to remove trivial or ambiguous cases. For each remaining query, GPT-5 was further instructed to produce checklists of key reasoning points, covering functional intent, dependency relationships, and potential code effects—to support consistent LLM-as-a-Judge evaluation. This approach ensures that each instance contains both high-quality queries and verifiable reasoning anchors, enhancing its diagnostic value in assessing model understanding.
•
Reverse Masking. For tasks related to deployment and test generation, we employed a reverse construction strategy starting from verified “golden” artifacts and introducing controlled perturbations. For Configuration & Deployment, we randomly removed or replaced dependency packages in the Dockerfile, generating cases that may produce Failed Docker Files. Only those triggering reproducible build failures or functional inconsistencies (P2F) were retained, enabling evaluation of models’ ability to detect and fix deployment issues. For Test Case Generation, we selected PRs with more than five new test functions and formulated prompts using the corresponding Code Patch and Test Patch. GPT-5 generated evaluation queries such as “Generate unit tests to verify the correctness of the following patch.”Instances that resulted in Incomplete Tests were retained to assess models’ capability to produce complete and correct test suites. Correctness and coverage metrics served as quantitative evaluation criteria.
•
Heuristic Filtering.For patch-based tasks such as Performance Optimization, Refactoring, Feature Enhancement, Feature Implementation, and Bug Fixing, we directly leveraged real-world PRs and applied targeted filtering rules to identify representative examples. Specifically, PRs that passed all unit tests both before and after patch application but exhibited runtime performance improvements exceeding 30% were labeled as Performance Optimization seeds. For Refactoring tasks, we selected PRs that introduced substantial structural or readability improvements (e.g., function decomposition, code abstraction, or naming consistency) without altering external functionality. We then used GPT-5 to verify whether the changes explicitly addressed performance concerns described in the associated Issue. For the remaining three task types, classification was guided by patch intent and behavioral context: (1) Feature Implementation instances corresponded to cases introducing entirely new modules or functionalities; (2) Feature Enhancement focused on improvements or extensions to existing components; and (3) Bug Fixing captured patches that directly resolved error logs, exceptions, or failing tests. Each identified instance was validated for logical consistency, build reproducibility, and semantic clarity to ensure task fidelity.

3.3.5 Step 5: Data Validation

To ensure both diversity and quality in the final benchmark, we applied a structured sampling and validation process designed to balance task coverage, control instance difficulty, and guarantee overall dataset reliability.

•
Difficulty Filtering. Each candidate instance was first evaluated based on the number of modified files, the number of changed lines, and additional signals derived from multiple model inferences. This screening process ensures that all retained samples exhibit moderate and meaningful problem complexity, making them suitable for rigorous model evaluation.
•
Task-balanced Sampling. Balanced sampling was performed to maintain diversity across task and scenario dimensions. Sampling weights were further adjusted to reflect realistic distributions across 10 major programming languages, aligning with real-world open-source practices.
•
Manual Verification. All sampled instances underwent expert validation to confirm executability, correctness, and semantic consistency between commits, queries, Docker images, and corresponding test cases. Only verified instances were retained in the benchmark.

As a result, we constructed the comprehensive benchmark SWE-Compass, which contains 2,000 high-quality instances, well-balanced across task categories, programming scenarios, and languages, providing a rigorous and representative evaluation framework for assessing the capabilities of large language models in real-world software engineering tasks.

3.4 Evaluation Metrics

For each type of task, we select appropriate evaluation metrics to measure the model’s performance. These include:

1. Pass@1: The fraction of resolved samples achieved under a single attempt with fixed decoding and resource budgets.
1. Performance Optimization Score: A binary indicator (0/1). The score is 1 if the model’s optimized code passes a single test and the time spent on execution is less than 80% of the time taken by the unoptimized code; otherwise, the score is 0.
1. Line Coverage: This metric evaluates the extent to which the program code has been executed during test case execution. The formula for calculating line coverage is:
  
  Line Coverage=Number of Executed Code LinesTotal Number of Code Lines×100%\text{Line Coverage}=\frac{\text{Number of Executed Code Lines}}{\text{Total Number of Code Lines}}\times 100\%
1. LLM-As-A-Judge Score: Following (Zheng et al., 2023b; Zhang et al., 2025b; c; Li et al., 2025)), we use a large language model (LLM) to review the model output according to a checklist; the final score is the proportion of checkpoints passed by the model output.

For specific tasks, the following metrics are used. For Feature Implementation, Feature Enhancement, Bug Fixing, and Refactoring, Pass@1 is used to measure the model’s performance. For Performance Optimization, the Performance Optimization Score is used to evaluate the model’s performance. For Test Case Generation, we employ Line Coverage to assess the quality of the test cases generated by the model. In our implementation, we use pytest (Pajankar, 2017) to compute line coverage for Python. For TypeScript and JavaScript, we use C8 (Vassudanagunta, 2025) to calculate line coverage. For Code Understanding, we use the LLM-As-A-Judge Score to evaluate the accuracy of the model’s understanding of the code. The specific prompt can be found in the Appendix A.1.

4 Experiments

4.1 Evaluated LLMs and Frameworks

Benchmarks and Tracks

We evaluate SWE-Compass under two tracks (Executable and Non-executable). By default, we aggregate distribution-aligned over Task Type ×\times Programming Scenario ×\times Language following §3 with fixed seeds. Construction scale and composition are in §3 (Table 1).

Frameworks

We evaluate two offline agent workflows with identical executors: SWE-Agent (hardened edit–diff–execute loop) and Claude Code (sandboxed, editor-centric with parallel tool calls). Both use containerized, network-disabled toolchains with standardized build/test commands and execution hardening for reproducibility; complete workflow notes (including the parallel tool-call prompt) and the command matrix are in Appendix A.4.

Environment, Budgets, and Metrics

All evaluations run in fixed offline containers with unified budgets and executors; networking is disabled, and retries are not used. We adopt a single-attempt setting with standard decoding and turn/time limits, and evaluate using the task-type–aligned decision rules and metrics defined in §3. Exact configuration (timeouts, hardware, container versions, context windows, caches) and any method-specific deviations are provided in Appendix A.4.

LLMs

We evaluate 10 models under a unified leaderboard (no reasoning/non-reasoning split): Claude-Sonnet-4-20250514, Qwen3-Coder-480B-A35B-Instruct, Qwen3-Coder-30B-A3B-Instruct, Qwen3-235B-A22B-Instruct-2507, Kimi-K2-Instruct-0905, Gemini-2.5-Pro, Gemini-2.5-Flash, GPT-4.1-2025-04-14, DeepSeek-V3-0324, and SWE-agent-LM-32B. Model API pages and open-source deployment links are provided in Appendix A.6.

4.2 Experimental Results

4.2.1 Main Results

Table 2: Main results by task types on SWE-Compass. AVG is the macro-average across task types. Abbreviations: FI=Feature Implementation; FE=Feature Enhancement; BF=Bug Fixing; RF=Refactoring; PO=Performance Optimization; CU=Code Understanding; TG=Test Case Generation; CD=Configuration & Deployment.

Table 2 reports Pass@1 by task type. Claude-Sonnet-4-20250514 ranks first under both workflows (32.9% with Claude Code; 31.8% with SWE-Agent). Scores largely cluster in the low-to-mid 20s (overall range roughly 10–33%). Contrary to a monotonic advantage, the two workflows are complementary: among the five overlapping models, only two achieve higher AVG with Claude Code, whereas three obtain higher AVG with SWE-Agent. Among open-weight systems, Qwen3-Coder-480B-A35B-Instruct reaches 27.2% with SWE-Agent and 21.9% with Claude Code, still below the best proprietary model. The findings across task types and workflows are as follows:

Findings by task type.

A consistent but nuanced hierarchy emerges (Table 2). Code Understanding (CU) is among the strongest categories across models. Configuration & Deployment (CD) can be high for some systems (e.g., Claude-Sonnet-4) but exhibits sizable cross-model variance, so it is not uniformly easy. Feature Enhancement (FE) and Refactoring (RF)—occupy a middle tier. Feature Implementation (FI) and Bug Fixing (BF) are harder, reflecting localization and integration challenges. Test Case Generation (TG) and Performance Optimization (PO) remain challenging, but not to single-digit averages; results typically fall in the mid-teens to mid-20s depending on the model. Method-wise, SWE-Agent tends to be stronger on BF and parts of FI that benefit from iterative localization, whereas Claude Code shows advantages on TG and some CD cases with more deterministic signals; CU is broadly comparable across the two.

Framework comparison.

Across most settings, the two agents exhibit complementary strengths. Mechanistically, SWE-Agent’s edit–diff–execute loop favors investigative, multi-file tasks that reward iterative localization, at the cost of higher timeout exposure; Claude Code’s sandboxed, editor-centric workflow yields strong performance on well-scoped, deterministic tasks (e.g., CD, CU, TG), benefitting from lower tool overhead. We also observe a trade-off with interaction efficiency: improvements in Eval score often coincide with higher average interaction turns (cf. Figure 4 and Figure 5), with diminishing returns beyond moderate turn counts, suggesting that future gains require better localization and hypothesis pruning rather than simply more exploration.

4.2.2 Further Analysis

Refer to caption

Figure 4: Comparison of Pass@1 (%) across the top programming languages for SWE-Agent. Bars represent Pass@1; languages are ordered by overall Pass@1. This plot highlights whether improvements are concentrated in specific languages.

Refer to caption

Figure 5: Distribution of interaction turns required per language for selected models to reveal trade-offs between effort (turns) and success. This highlights whether models achieve high Pass@1 by spending more turns on particular languages.

Language-level observations.

Figure 4and Appendix Table 4 indicate a consistent cross-language stratification across models and agents. JVM ecosystems and JavaScript tend to score higher (Java/Kotlin/JavaScript), while TypeScript is notably lower; systems languages (C/C++/Rust/Go) are harder; Python appears mid-tier overall, partly reflecting dataset selection effects—open-source benchmarks over-index on difficult Python bug-fixing cases (Appendix A.5. For Claude-Sonnet-4, Claude Code shows gains on Java/JavaScript, but this is not universal across models; in C#, C/C++/Rust/Go, SWE-Agent often matches or outperforms. These patterns suggest performance is governed more by tooling determinism and diagnosability than raw coding difficulty; prioritize repository-level localization and environment hardening for systems/Python stacks, and hypothesis pruning for deterministic JVM/JS pipelines. See Figure 6 for a visualization.

Interaction turns vs. success (by language).

Figure 5 shows per-language turn distributions with Pass@1 overlays. Deterministic ecosystems (Java/Kotlin/JavaScript/C#) have lower medians and tighter IQRs under Claude Code, while achieving similar or higher Pass@1—gains come from reliable signals rather than more turns. Systems languages (C/C++/Rust/Go) exhibit heavier tails, especially for SWE-Agent, with clear diminishing returns; Rust is most brittle. Python shows high variance: pinned environments converge quickly under Claude Code, while heterogeneity pushes SWE-Agent to many low-yield turns. Overall, prioritize repository-level localization for systems languages and environment hardening for Python; in JVM/JS, focus on sharper hypothesis pruning and parallel validation.

Refer to caption

Figure 6: (Two panels)(a) This plot illustrates the relationship between skill specialization and performance consistency. Skill specialization is measured by the variability in performance across different tasks/languages, with higher values indicating stronger performance in specific tasks/languages compared to others. Performance consistency, on the other hand, reflects how stable the performance is across different difficulty levels, with higher values indicating more consistent performance over time. (b) The graph shows the relationship between performance variability and overall performance. Performance variability measures how much performance fluctuates across different tasks/languages, with higher values indicating greater inconsistency. Overall performance is calculated as the average score across tasks/languages, with higher values indicating better overall performance.

Consistency and specialization across languages.

Figure 6 summarizes whether strong aggregate results are broad-based or concentrated. In panel (a), aggregate Pass@1 and the per-language median Pass@1 show a clear visual positive trend, indicating that higher-ranked systems tend to improve more consistently across languages rather than relying on a single language. Top systems cluster toward the right with higher consistency, whereas mid-tier models are more dispersed with lower medians. In panel (b), higher overall performance visually coincides with lower cross-language/task variability (coefficient of variation, CV), suggesting that stronger models are generally less variable. Together these observations support our earlier findings: (i) improvements at the top reflect broad gains in localization and execution reliability rather than narrow specialization; (ii) reducing cross-language variance is an effective lever for closing the gap, especially on systems languages that contribute disproportionately to variability; and (iii) evaluation protocols should report both central tendency and dispersion to avoid overstating gains driven by a subset of languages.

Notably, Claude-Sonnet-4-20250514 lies at the far right with one of the lowest variabilities, reflecting broad cross-language gains.

Table 3: Scores on different scenarios: Abbreviations: AD=Application Development; DE=Data Science & Engineering; DS=Database Systems; ID=Infrastructure Development; ML=Machine Learning & AI; SE=Security Engineering; SPD=Specialized Programming Domains; UI/UX=UI/UX Engineering; AVG=macro-average.

Fine-grained scenario analysis.

Table 3 shows that scenario difficulty closely tracks tooling determinism and the locality of required edits. High-scoring categories such as UI/UX Engineering, Security Engineering, and Application Development combine mature frameworks with clear oracles and fast-running tests, where Claude Code’s editor-centric workflow converts stable feedback into higher Pass@1 with fewer turns. In contrast, Database Systems, Infrastructure Development, ML/AI, and Specialized Programming Domains involve multi-stage builds, cross-process dependencies, or non-deterministic outputs; here, SWE-Agent’s iterative localization is often more resilient but also more exposed to timeouts. The consistent average advantage of Claude Code over SWE-Agent across scenarios in our setting) is therefore concentrated in pipelines with reliable, low-variance signals. To close the remaining gaps, future systems should (i) enhance repository-level observability and reproducibility for complex stacks (minimal repro scripts, pinned environments, artifact isolation), and (ii) invest in hypothesis pruning and parallel verification for deterministic stacks, where the bottleneck is search efficiency rather than raw exploration budget.

Failure Mode Analysis.

Refer to caption

Figure 7: Distribution of trajectory failure modes on SWE-Compass. Abbreviations: RMI=Requirement Misinterpretation, ISE=Incomplete Solution & Side Effects, TIE=Tool Invocation Error, IAT=Inadequate Testing, TKG=Technical Knowledge Gap, INF=Infinite Loop, OTH=Others.

To systematically understand the limitations of current coding agents, we perform a post-hoc failure analysis on SWE-Agent trajectories from our SWE-Compass benchmark. Following Yang et al. (2024)—who report 87% agreement between automated LLM judges and human experts—we adopt an LLM-as-Judge protocol with Claude-Sonnet-4 as the judge; the exact prompt is provided in Appendix A.3. We sample 600 failed trajectories per model for three representative systems: Claude-Sonnet-4, Qwen3-Coder-480B, and Gemini-2.5-Pro.

Specifically, through manual inspection of submitted-but-failed trajectories, we develop a comprehensive six-category taxonomy capturing actual root causes:

1. Requirement Misinterpretation: The agent failed to properly understand and locate the problem, including misidentifying affected files, misjudging severity, confusing problem types, or failing to identify root causes and understand dependencies, data flow, or system architecture.
1. Inadequate Testing: The agent provided incomplete test coverage, missing edge cases, compatibility issues, performance impacts, integration scenarios, or multi-platform testing requirements.
1. Incomplete Solution & Side Effects: The agent provided an incomplete fix that only addressed symptoms rather than root causes, or introduced new issues, including regressions, security vulnerabilities, environment configuration errors, data corruption risks, or breaking changes to existing functionality.
1. Technical Knowledge Gap: The agent demonstrated insufficient technical proficiency or violated domain-specific conventions, including lacking necessary knowledge in specialized domains (UI/frontend, security, accessibility, DevOps, performance, analytics) or incorrectly handling domain-specific issues (data processing, security implementations, UI/UX standards, API design, documentation synchronization).
1. Tool Invocation Error: The agent encountered errors while using tools due to incorrect syntax, context overflow from file operations, or parse/analysis tool failures.
1. Infinite Loop: The agent got stuck in loops without convergence, including repeated attempts at the same solution, oscillating between decisions, or endlessly reading files without making progress.

Note that OTH (Other) denotes rare cases, which are not covered by the above taxonomy (e.g., corrupted artifacts or external executor glitches). The figure caption lists all abbreviations for completeness. As shown in Figure 7, we report the per-model distribution of failure modes on SWE-Compass. Based on 600 error traces per model, we draw the following conclusions: (1) Shared bottlenecks in comprehension and implementation. All models exhibit high error rates in Requirement Misinterpretation (30–34%) and Incomplete Solution & Side Effects (29–42%), together accounting for >60%>60\% of failures. By contrast, Technical Knowledge Gap is consistently low (5–8%), suggesting the core limitations lie in requirement grounding and holistic solution design rather than basic coding proficiency. (2) Distinct model characteristics. Claude-Sonnet-4 is the most balanced, showing the lowest Technical Knowledge Gap (4.7%) but room to improve on Inadequate Testing (20.8%). Qwen3-Coder-480B has the highest Incomplete Solution & Side Effects rate (42%, vs. Claude-Sonnet-4’s 32.7%), revealing weaknesses in end-to-end design. Gemini-2.5-Pro shows the highest Requirement Misinterpretation (34%) and a notable Infinite Loop issue (8.3%), posing reliability risks in production.

5 Conclusion

We introduced SWE-Compass, a unified benchmark that enables systematic evaluation of large language models across diverse software engineering tasks, scenarios, and languages. By integrating 2,000 verified instances derived from real-world GitHub repositories with reproducible execution environments, SWE-Compass provides comprehensive coverage of the software development lifecycle. Our large-scale experiments with ten state-of-the-art LLMs under two agentic frameworks reveal consistent hierarchies of task difficulty, language-specific variability, and dominant failure modes rooted in requirement misinterpretation and incomplete solutions. These findings highlight that future progress in automated software engineering depends less on isolated code generation improvements and more on enhancing requirement grounding, environment reliability, and reasoning consistency. SWE-Compass offers a rigorous, scalable, and reproducible foundation for advancing the next generation of robust, general-purpose coding agents.

6 Future Works

We see several directions to extend SWE-Compass and strengthen the community’s ability to measure and drive progress:

•
Scale and coverage. Expand the dataset size, languages (e.g., mobile stacks and diverse SQL dialects), and repository types (monorepos, polyglot services), while maintaining distribution alignment across task, scenario, language, and difficulty.
•
Harder long-context settings. Introduce multi-module, cross-process, and build-pipeline tasks that stress architectural coherence, multi-file reasoning, and cross-session memory under strict executability.
•
Metrics and protocols. Enrich task-type–aligned metrics with long-context diagnostics (e.g., consistency and variance reporting), stabilize timing/coverage signals, and unify cost/efficiency reporting (turns, wall-clock, tool invocations) under fixed budgets.
•
Evaluation tracks. Explore safe online or incremental tracks with evolving repositories and dependency drift, paired with sandboxing, artifact isolation, and replayable logs for fair comparison over time.
•
Human-in-the-loop calibration. Establish human adjudication subsets and reliability audits for LLM-as-Judge, improving rubric calibration and model–human agreement on non-executable tasks.
•
Reproducibility, safety, and accessibility. Continue releasing containers, minimal repro scripts, and a lightweight subset with stable seeds; strengthen privacy/safety filtering and provide clearer documentation to lower the barrier to participation.

References

Team et al. [2025a] Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, Zhuofu Chen, Jialei Cui, Hao Ding, Mengnan Dong, Angang Du, Chenzhuang Du, Dikang Du, Yulun Du, Yu Fan, Yichen Feng, Kelin Fu, Bofei Gao, Hongcheng Gao, Peizhong Gao, Tong Gao, Xinran Gu, Longyu Guan, Haiqing Guo, Jianhang Guo, Hao Hu, Xiaoru Hao, Tianhong He, Weiran He, Wenyang He, Chao Hong, Yangyang Hu, Zhenxing Hu, Weixiao Huang, Zhiqi Huang, Zihao Huang, Tao Jiang, Zhejun Jiang, Xinyi Jin, Yongsheng Kang, Guokun Lai, Cheng Li, Fang Li, Haoyang Li, Ming Li, Wentao Li, Yanhao Li, Yiwei Li, Zhaowei Li, Zheming Li, Hongzhan Lin, Xiaohan Lin, Zongyu Lin, Chengyin Liu, Chenyu Liu, Hongzhang Liu, Jingyuan Liu, Junqi Liu, Liang Liu, Shaowei Liu, T. Y. Liu, Tianwei Liu, Weizhou Liu, Yangyang Liu, Yibo Liu, Yiping Liu, Yue Liu, Zhengying Liu, Enzhe Lu, Lijun Lu, Shengling Ma, Xinyu Ma, Yingwei Ma, Shaoguang Mao, Jie Mei, Xin Men, Yibo Miao, Siyuan Pan, Yebo Peng, Ruoyu Qin, Bowen Qu, Zeyu Shang, Lidong Shi, Shengyuan Shi, Feifan Song, Jianlin Su, Zhengyuan Su, Xinjie Sun, Flood Sung, Heyi Tang, Jiawen Tao, Qifeng Teng, Chensi Wang, Dinglu Wang, Feng Wang, Haiming Wang, Jianzhou Wang, Jiaxing Wang, Jinhong Wang, Shengjie Wang, Shuyi Wang, Yao Wang, Yejie Wang, Yiqin Wang, Yuxin Wang, Yuzhi Wang, Zhaoji Wang, Zhengtao Wang, Zhexu Wang, Chu Wei, Qianqian Wei, Wenhao Wu, Xingzhe Wu, Yuxin Wu, Chenjun Xiao, Xiaotong Xie, Weimin Xiong, Boyu Xu, Jing Xu, Jinjing Xu, L. H. Xu, Lin Xu, Suting Xu, Weixin Xu, Xinran Xu, Yangchuan Xu, Ziyao Xu, Junjie Yan, Yuzi Yan, Xiaofei Yang, Ying Yang, Zhen Yang, Zhilin Yang, Zonghan Yang, Haotian Yao, Xingcheng Yao, Wenjie Ye, Zhuorui Ye, Bohong Yin, Longhui Yu, Enming Yuan, Hongbang Yuan, Mengjie Yuan, Haobing Zhan, Dehao Zhang, Hao Zhang, Wanlu Zhang, Xiaobin Zhang, Yangkun Zhang, Yizhi Zhang, Yongting Zhang, Yu Zhang, Yutao Zhang, Yutong Zhang, Zheng Zhang, Haotian Zhao, Yikai Zhao, Huabin Zheng, Shaojie Zheng, Jianren Zhou, Xinyu Zhou, Zaida Zhou, Zhen Zhu, Weiyu Zhuang, and Xinxing Zu. Kimi k2: Open agentic intelligence, 2025a. URL https://arxiv.org/abs/2507.20534.
Team et al. [2025b] Meituan LongCat Team, Bayan, Bei Li, Bingye Lei, Bo Wang, Bolin Rong, Chao Wang, Chao Zhang, Chen Gao, Chen Zhang, Cheng Sun, Chengcheng Han, Chenguang Xi, Chi Zhang, Chong Peng, Chuan Qin, Chuyu Zhang, Cong Chen, Congkui Wang, Dan Ma, Daoru Pan, Defei Bu, Dengchang Zhao, Deyang Kong, Dishan Liu, Feiye Huo, Fengcun Li, Fubao Zhang, Gan Dong, Gang Liu, Gang Xu, Ge Li, Guoqiang Tan, Guoyuan Lin, Haihang Jing, Haomin Fu, Haonan Yan, Haoxing Wen, Haozhe Zhao, Hong Liu, Hongmei Shi, Hongyan Hao, Hongyin Tang, Huantian Lv, Hui Su, Jiacheng Li, Jiahao Liu, Jiahuan Li, Jiajun Yang, Jiaming Wang, Jian Yang, Jianchao Tan, Jiaqi Sun, Jiaqi Zhang, Jiawei Fu, Jiawei Yang, Jiaxi Hu, Jiayu Qin, Jingang Wang, Jiyuan He, Jun Kuang, Junhui Mei, Kai Liang, Ke He, Kefeng Zhang, Keheng Wang, Keqing He, Liang Gao, Liang Shi, Lianhui Ma, Lin Qiu, Lingbin Kong, Lingtong Si, Linkun Lyu, Linsen Guo, Liqi Yang, Lizhi Yan, Mai Xia, Man Gao, Manyuan Zhang, Meng Zhou, Mengxia Shen, Mingxiang Tuo, Mingyang Zhu, Peiguang Li, Peng Pei, Peng Zhao, Pengcheng Jia, Pingwei Sun, Qi Gu, Qianyun Li, Qingyuan Li, Qiong Huang, Qiyuan Duan, Ran Meng, Rongxiang Weng, Ruichen Shao, Rumei Li, Shizhe Wu, Shuai Liang, Shuo Wang, Suogui Dang, Tao Fang, Tao Li, Tefeng Chen, Tianhao Bai, Tianhao Zhou, Tingwen Xie, Wei He, Wei Huang, Wei Liu, Wei Shi, Wei Wang, Wei Wu, Weikang Zhao, Wen Zan, Wenjie Shi, Xi Nan, Xi Su, Xiang Li, Xiang Mei, Xiangyang Ji, Xiangyu Xi, Xiangzhou Huang, Xianpeng Li, Xiao Fu, Xiao Liu, Xiao Wei, Xiaodong Cai, Xiaolong Chen, Xiaoqing Liu, Xiaotong Li, Xiaowei Shi, Xiaoyu Li, Xili Wang, Xin Chen, Xing Hu, Xingyu Miao, Xinyan He, Xuemiao Zhang, Xueyuan Hao, Xuezhi Cao, Xunliang Cai, Xurui Yang, Yan Feng, Yang Bai, Yang Chen, Yang Yang, Yaqi Huo, Yerui Sun, Yifan Lu, Yifan Zhang, Yipeng Zang, Yitao Zhai, Yiyang Li, Yongjing Yin, Yongkang Lv, Yongwei Zhou, Yu Yang, Yuchen Xie, Yueqing Sun, Yuewen Zheng, Yuhuai Wei, Yulei Qian, Yunfan Liang, Yunfang Tai, Yunke Zhao, Zeyang Yu, Zhao Zhang, Zhaohua Yang, Zhenchao Zhang, Zhikang Xia, Zhiye Zou, Zhizhao Zeng, Zhongda Su, Zhuofan Chen, Zijian Zhang, Ziwen Wang, Zixu Jiang, Zizhe Zhao, Zongyu Wang, and Zunhai Su. Longcat-flash technical report, 2025b. URL https://arxiv.org/abs/2509.01322.
Anthropic [2025] Anthropic. System Card: Claude Opus 4 & Claude Sonnet 4. https://www-cdn.anthropic.com/6be99a52cb68eb70eb9572b4cafad13df32ed995.pdf, May 2025. Accessed: 2024-05-22.
Team et al. [2025c] 5 Team, Aohan Zeng, Xin Lv, Qinkai Zheng, Zhenyu Hou, Bin Chen, Chengxing Xie, Cunxiang Wang, Da Yin, Hao Zeng, Jiajie Zhang, Kedong Wang, Lucen Zhong, Mingdao Liu, Rui Lu, Shulin Cao, Xiaohan Zhang, Xuancheng Huang, Yao Wei, Yean Cheng, Yifan An, Yilin Niu, Yuanhao Wen, Yushi Bai, Zhengxiao Du, Zihan Wang, Zilin Zhu, Bohan Zhang, Bosi Wen, Bowen Wu, Bowen Xu, Can Huang, Casey Zhao, Changpeng Cai, Chao Yu, Chen Li, Chendi Ge, Chenghua Huang, Chenhui Zhang, Chenxi Xu, Chenzheng Zhu, Chuang Li, Congfeng Yin, Daoyan Lin, Dayong Yang, Dazhi Jiang, Ding Ai, Erle Zhu, Fei Wang, Gengzheng Pan, Guo Wang, Hailong Sun, Haitao Li, Haiyang Li, Haiyi Hu, Hanyu Zhang, Hao Peng, Hao Tai, Haoke Zhang, Haoran Wang, Haoyu Yang, He Liu, He Zhao, Hongwei Liu, Hongxi Yan, Huan Liu, Huilong Chen, Ji Li, Jiajing Zhao, Jiamin Ren, Jian Jiao, Jiani Zhao, Jianyang Yan, Jiaqi Wang, Jiayi Gui, Jiayue Zhao, Jie Liu, Jijie Li, Jing Li, Jing Lu, Jingsen Wang, Jingwei Yuan, Jingxuan Li, Jingzhao Du, Jinhua Du, Jinxin Liu, Junkai Zhi, Junli Gao, Ke Wang, Lekang Yang, Liang Xu, Lin Fan, Lindong Wu, Lintao Ding, Lu Wang, Man Zhang, Minghao Li, Minghuan Xu, Mingming Zhao, Mingshu Zhai, Pengfan Du, Qian Dong, Shangde Lei, Shangqing Tu, Shangtong Yang, Shaoyou Lu, Shijie Li, Shuang Li, Shuang-Li, Shuxun Yang, Sibo Yi, Tianshu Yu, Wei Tian, Weihan Wang, Wenbo Yu, Weng Lam Tam, Wenjie Liang, Wentao Liu, Xiao Wang, Xiaohan Jia, Xiaotao Gu, Xiaoying Ling, Xin Wang, Xing Fan, Xingru Pan, Xinyuan Zhang, Xinze Zhang, Xiuqing Fu, Xunkai Zhang, Yabo Xu, Yandong Wu, Yida Lu, Yidong Wang, Yilin Zhou, Yiming Pan, Ying Zhang, Yingli Wang, Yingru Li, Yinpei Su, Yipeng Geng, Yitong Zhu, Yongkun Yang, Yuhang Li, Yuhao Wu, Yujiang Li, Yunan Liu, Yunqing Wang, Yuntao Li, Yuxuan Zhang, Zezhen Liu, Zhen Yang, Zhengda Zhou, Zhongpei Qiao, Zhuoer Feng, Zhuorui Liu, Zichen Zhang, Zihan Wang, Zijun Yao, Zikang Wang, Ziqiang Liu, Ziwei Chai, Zixuan Li, Zuodong Zhao, Wenguang Chen, Jidong Zhai, Bin Xu, Minlie Huang, Hongning Wang, Juanzi Li, Yuxiao Dong, and Jie Tang. Glm-4.5: Agentic, reasoning, and coding (arc) foundation models, 2025c. URL https://arxiv.org/abs/2508.06471.
Team [2025] Qwen Team. Qwen3 technical report, 2025. URL https://arxiv.org/abs/2505.09388.
Zheng et al. [2023a] Qinkai Zheng, Xiao Xia, Xu Zou, Yuxiao Dong, Shan Wang, Yufei Xue, Zihan Wang, Lei Shen, Andi Wang, Yang Li, Teng Su, Zhilin Yang, and Jie Tang. Codegeex: A pre-trained model for code generation with multilingual evaluations on humaneval-x. arXiv preprint arXiv:2303.17568, abs/2303.17568, 2023a. doi: 10.48550/ARXIV.2303.17568. URL https://doi.org/10.48550/arXiv.2303.17568.
Austin et al. [2021] Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with large language models, 2021. URL https://arxiv.org/abs/2108.07732.
Li et al. [2022] Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy, Daniel J. Mankowitz, Esme Sutherland Robson, Pushmeet Kohli, Nando de Freitas, Koray Kavukcuoglu, and Oriol Vinyals. Competition-level code generation with alphacode. Science, 378(6624):1092–1097, December 2022. ISSN 1095-9203. doi: 10.1126/science.abq1158. URL http://dx.doi.org/10.1126/science.abq1158.
Jain et al. [2024] Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code, 2024. URL https://arxiv.org/abs/2403.07974.
Zhuo et al. [2025] Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, Simon Brunner, Chen Gong, Thong Hoang, Armel Randy Zebaze, Xiaoheng Hong, Wen-Ding Li, Jean Kaddour, Ming Xu, Zhihan Zhang, Prateek Yadav, Naman Jain, Alex Gu, Zhoujun Cheng, Jiawei Liu, Qian Liu, Zijian Wang, Binyuan Hui, Niklas Muennighoff, David Lo, Daniel Fried, Xiaoning Du, Harm de Vries, and Leandro Von Werra. Bigcodebench: Benchmarking code generation with diverse function calls and complex instructions, 2025. URL https://arxiv.org/abs/2406.15877.
Jimenez et al. [2024] Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. SWE-bench: Can language models resolve real-world github issues? In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=VTF8yNQM66.
Yang et al. [2025a] John Yang, Kilian Lieret, Carlos E. Jimenez, Alexander Wettig, Kabir Khandpur, Yanzhe Zhang, Binyuan Hui, Ofir Press, Ludwig Schmidt, and Diyi Yang. Swe-smith: Scaling data for software engineering agents, 2025a. URL https://arxiv.org/abs/2504.21798.
Badertdinov et al. [2025] Ibragim Badertdinov, Alexander Golubev, Maksim Nekrashevich, Anton Shevtsov, Simon Karasik, Andrei Andriushchenko, Maria Trofimova, Daria Litvintseva, and Boris Yangel. Swe-rebench: An automated pipeline for task collection and decontaminated evaluation of software engineering agents, 2025. URL https://arxiv.org/abs/2505.20411.
Rashid et al. [2025] Muhammad Shihab Rashid, Christian Bock, Yuan Zhuang, Alexander Buchholz, Tim Esler, Simon Valentin, Luca Franceschi, Martin Wistuba, Prabhu Teja Sivaprasad, Woo Jung Kim, Anoop Deoras, Giovanni Zappella, and Laurent Callot. Swe-polybench: A multi-language benchmark for repository level evaluation of coding agents, 2025. URL https://arxiv.org/abs/2504.08703.
Yang et al. [2025b] John Yang, Carlos E Jimenez, Alex L Zhang, Kilian Lieret, Joyce Yang, Xindi Wu, Ori Press, Niklas Muennighoff, Gabriel Synnaeve, Karthik R Narasimhan, Diyi Yang, Sida Wang, and Ofir Press. Swe-bench multimodal: Do ai systems generalize to visual software domains? In The Thirteenth International Conference on Learning Representations, 2025b. URL https://openreview.net/forum?id=riTiq3i21b.
Chen et al. [2021] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
Zhao et al. [2024] CodeGemma Team Heri Zhao, Jeffrey Hui, Joshua Howland, Nam Nguyen, Siqi Zuo, Andrea Hu, Christopher A. Choquette-Choo, Jingyue Shen, Joe Kelley, Kshi tij Bansal, Luke Vilnis, Mateo Wirth, Paul Michel, Peter Choy, Pratik Joshi, Ravin Kumar, Sarmad Hashmi, Shubham Agrawal, Zhitao Gong, Jane Fine, Tris Brian Warkentin, Ale Jakse Hartman, Bin Ni, Kathy Korevec, Kelly Schaefer, and Scott Huffman. Codegemma: Open code models based on gemma. ArXiv, abs/2406.11409, 2024. URL https://api.semanticscholar.org/CorpusID:270560319.
Chowdhery et al. [2023] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. 24(240):1–113, 2023.
Nijkamp et al. [2023] Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong. Codegen: An open large language model for code with multi-turn program synthesis. In International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=iaYcJKpY2B_.
Fried et al. [2023] Daniel Fried, Armen Aghajanyan, Jessy Lin, Sida Wang, Eric Wallace, Freda Shi, Ruiqi Zhong, Scott Yih, Luke Zettlemoyer, and Mike Lewis. Incoder: A generative model for code infilling and synthesis. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=hQwb-lbM6EL.
Xu et al. [2022] Frank F Xu, Uri Alon, Graham Neubig, and Vincent Josua Hellendoorn. A systematic evaluation of large language models of code. In Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming, pages 1–10, 2022.
Roziere et al. [2023] Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, et al. Code llama: Open foundation models for code. 2023.
Hui et al. [2024a] Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Kai Dang, et al. Qwen2. 5-coder technical report. arXiv preprint arXiv:2409.12186, 2024a.
Deng et al. [2025] Ken Deng, Zizheng Zhan, Wen Xiang, Wenqiang Zhu, Tianhao Peng, Xinping Lei, Weihao Li, Jingxuan Xu, Kun Wu, Yifan Yao, et al. Hipo: Hybrid policy optimization for dynamic reasoning in llms. arXiv preprint arXiv:2509.23967, 2025.
Que et al. [2024] Haoran Que, Jiaheng Liu, Ge Zhang, Chenchen Zhang, Xingwei Qu, Yinghao Ma, Feiyu Duan, Zhiqi Bai, Jiakai Wang, Yuanxing Zhang, et al. D-cpt law: Domain-specific continual pre-training scaling law for large language models. Advances in Neural Information Processing Systems, 37:90318–90354, 2024.
Liu et al. [2024a] Jiaheng Liu, Chenchen Zhang, Jinyang Guo, Yuanxing Zhang, Haoran Que, Ken Deng, Jie Liu, Ge Zhang, Yanan Wu, Congnan Liu, et al. Ddk: Distilling domain knowledge for efficient large language models. Advances in Neural Information Processing Systems, 37:98297–98319, 2024a.
Liu et al. [2025] Jiaheng Liu, Dawei Zhu, Zhiqi Bai, Yancheng He, Huanxuan Liao, Haoran Que, Zekun Wang, Chenchen Zhang, Ge Zhang, Jiebin Zhang, et al. A comprehensive survey on long context language modeling. arXiv preprint arXiv:2503.17407, 2025.
Wang et al. [2024] Pei Wang, Yanan Wu, Zekun Wang, Jiaheng Liu, Xiaoshuai Song, Zhongyuan Peng, Ken Deng, Chenchen Zhang, Jiakai Wang, Junran Peng, et al. Mtu-bench: A multi-granularity tool-use benchmark for large language models. arXiv preprint arXiv:2410.11710, 2024.
Yang et al. [2024] John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering, 2024. URL https://arxiv.org/abs/2405.15793.
Wang et al. [2025] Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, and Graham Neubig. Openhands: An open platform for AI software developers as generalist agents. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=OJd3ayDDoF.
Anthropic [2025] Anthropic. claude-code, 2025. URL https://github.com/anthropics/claude-code.
OpenAI [2025] OpenAI. Codex, 2025. URL https://github.com/openai/codex.
Cline [2024] Cline. Cline, 2024. URL <github.comc/cline/cline>.
Liu et al. [2024b] Jiaheng Liu, Zehao Ni, Haoran Que, Tao Sun, Zekun Wang, Jian Yang, Jiakai Wang, Hongcheng Guo, Zhongyuan Peng, Ge Zhang, et al. Roleagent: Building, interacting, and benchmarking high-quality role-playing agents from scripts. Advances in Neural Information Processing Systems, 37:49403–49428, 2024b.
He et al. [2025] Yancheng He, Shilong Li, Jiaheng Liu, Weixun Wang, Xingyuan Bu, Ge Zhang, Zhongyuan Peng, Zhaoxiang Zhang, Zhicheng Zheng, Wenbo Su, et al. Can large language models detect errors in long chain-of-thought reasoning? arXiv preprint arXiv:2502.19361, 2025.
Liu et al. [2024c] Jiaheng Liu, Ken Deng, Congnan Liu, Jian Yang, Shukai Liu, He Zhu, Peng Zhao, Linzheng Chai, Yanan Wu, Ke Jin, et al. M2rc-eval: Massively multilingual repository-level code completion evaluation. arXiv preprint arXiv:2410.21157, 2024c.
Yang et al. [2025c] John Yang, Carlos E. Jimenez, Alex L. Zhang, Kilian Lieret, Joyce Yang, Xindi Wu, Ori Press, Niklas Muennighoff, Gabriel Synnaeve, Karthik R. Narasimhan, Diyi Yang, Sida I. Wang, and Ofir Press. SWE-bench multimodal: Do ai systems generalize to visual software domains? In The Thirteenth International Conference on Learning Representations, 2025c. URL https://openreview.net/forum?id=riTiq3i21b.
Zhang et al. [2025a] Linghao Zhang, Shilin He, Chaoyun Zhang, Yu Kang, Bowen Li, Chengxing Xie, Junhao Wang, Maoquan Wang, Yufan Huang, Shengyu Fu, Elsie Nallipogu, Qingwei Lin, Yingnong Dang, Saravan Rajmohan, and Dongmei Zhang. Swe-bench goes live! arXiv preprint arXiv:2505.23419, 2025a.
Miserendino et al. [2025] Samuel Miserendino, Michele Wang, Tejal Patwardhan, and Johannes Heidecke. Swe-lancer: Can frontier llms earn 1 million from real-world freelance software engineering?, 2025.
Zheng et al. [2023b] Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023b. URL https://arxiv.org/abs/2306.05685.
Zhang et al. [2025b] Alexander Zhang, Marcus Dong, Jiaheng Liu, Wei Zhang, Yejie Wang, Jian Yang, Ge Zhang, Tianyu Liu, Zhongyuan Peng, Yingshui Tan, et al. Codecriticbench: A holistic code critique benchmark for large language models. arXiv preprint arXiv:2502.16614, 2025b.
Zhang et al. [2025c] Chenchen Zhang, Yuhang Li, Can Xu, Jiaheng Liu, Ao Liu, Shihui Hu, Dengpeng Wu, Guanhua Huang, Kejiao Li, Qi Yi, et al. Artifactsbench: Bridging the visual-interactive gap in llm code generation evaluation. arXiv preprint arXiv:2507.04952, 2025c.
Li et al. [2025] Yuhang Li, Chenchen Zhang, Ruilin Lv, Ao Liu, Ken Deng, Yuanxing Zhang, Jiaheng Liu, Wiggin Zhou, and Bo Zhou. Relook: Vision-grounded rl with a multimodal llm critic for agentic web coding. arXiv preprint arXiv:2510.11498, 2025.
Pajankar [2017] Ashwin Pajankar. pytest. In Python Unit Test Automation: Practical Techniques for Python Developers and Testers, pages 87–100. Springer, 2017.
Vassudanagunta [2025] Vassudanagunta. c8 - native v8 code-coverage, 2025. URL https://github.com/bcoe/c8. Accessed: 2025-10-30.
Gemini Team and Google [2023] Gemini Team and Google. Gemini: A family of highly capable multimodal models, 2023.
OpenAI et al. [2024] OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Christopher Berner, Lenny Bogdonoff, Oleg Boiko, Madelaine Boyd, Anna-Luisa Brakman, Greg Brockman, Tim Brooks, Miles Brundage, Kevin Button, Trevor Cai, Rosie Campbell, Andrew Cann, Brittany Carey, Chelsea Carlson, Rory Carmichael, Brooke Chan, Che Chang, Fotis Chantzis, Derek Chen, Sully Chen, Ruby Chen, Jason Chen, Mark Chen, Ben Chess, Chester Cho, Casey Chu, Hyung Won Chung, Dave Cummings, Jeremiah Currier, Yunxing Dai, Cory Decareaux, Thomas Degry, Noah Deutsch, Damien Deville, Arka Dhar, David Dohan, Steve Dowling, Sheila Dunning, Adrien Ecoffet, Atty Eleti, Tyna Eloundou, David Farhi, Liam Fedus, Niko Felix, Simón Posada Fishman, Juston Forte, Isabella Fulford, Leo Gao, Elie Georges, Christian Gibson, Vik Goel, Tarun Gogineni, Gabriel Goh, Rapha Gontijo-Lopes, Jonathan Gordon, Morgan Grafstein, Scott Gray, Ryan Greene, Joshua Gross, Shixiang Shane Gu, Yufei Guo, Chris Hallacy, Jesse Han, Jeff Harris, Yuchen He, Mike Heaton, Johannes Heidecke, Chris Hesse, Alan Hickey, Wade Hickey, Peter Hoeschele, Brandon Houghton, Kenny Hsu, Shengli Hu, Xin Hu, Joost Huizinga, Shantanu Jain, Shawn Jain, Joanne Jang, Angela Jiang, Roger Jiang, Haozhun Jin, Denny Jin, Shino Jomoto, Billie Jonn, Heewoo Jun, Tomer Kaftan, Łukasz Kaiser, Ali Kamali, Ingmar Kanitscheider, Nitish Shirish Keskar, Tabarak Khan, Logan Kilpatrick, Jong Wook Kim, Christina Kim, Yongjik Kim, Jan Hendrik Kirchner, Jamie Kiros, Matt Knight, Daniel Kokotajlo, Łukasz Kondraciuk, Andrew Kondrich, Aris Konstantinidis, Kyle Kosic, Gretchen Krueger, Vishal Kuo, Michael Lampe, Ikai Lan, Teddy Lee, Jan Leike, Jade Leung, Daniel Levy, Chak Ming Li, Rachel Lim, Molly Lin, Stephanie Lin, Mateusz Litwin, Theresa Lopez, Ryan Lowe, Patricia Lue, Anna Makanju, Kim Malfacini, Sam Manning, Todor Markov, Yaniv Markovski, Bianca Martin, Katie Mayer, Andrew Mayne, Bob McGrew, Scott Mayer McKinney, Christine McLeavey, Paul McMillan, Jake McNeil, David Medina, Aalok Mehta, Jacob Menick, Luke Metz, Andrey Mishchenko, Pamela Mishkin, Vinnie Monaco, Evan Morikawa, Daniel Mossing, Tong Mu, Mira Murati, Oleg Murk, David Mély, Ashvin Nair, Reiichiro Nakano, Rajeev Nayak, Arvind Neelakantan, Richard Ngo, Hyeonwoo Noh, Long Ouyang, Cullen O’Keefe, Jakub Pachocki, Alex Paino, Joe Palermo, Ashley Pantuliano, Giambattista Parascandolo, Joel Parish, Emy Parparita, Alex Passos, Mikhail Pavlov, Andrew Peng, Adam Perelman, Filipe de Avila Belbute Peres, Michael Petrov, Henrique Ponde de Oliveira Pinto, Michael, Pokorny, Michelle Pokrass, Vitchyr H. Pong, Tolly Powell, Alethea Power, Boris Power, Elizabeth Proehl, Raul Puri, Alec Radford, Jack Rae, Aditya Ramesh, Cameron Raymond, Francis Real, Kendra Rimbach, Carl Ross, Bob Rotsted, Henri Roussez, Nick Ryder, Mario Saltarelli, Ted Sanders, Shibani Santurkar, Girish Sastry, Heather Schmidt, David Schnurr, John Schulman, Daniel Selsam, Kyla Sheppard, Toki Sherbakov, Jessica Shieh, Sarah Shoker, Pranav Shyam, Szymon Sidor, Eric Sigler, Maddie Simens, Jordan Sitkin, Katarina Slama, Ian Sohl, Benjamin Sokolowsky, Yang Song, Natalie Staudacher, Felipe Petroski Such, Natalie Summers, Ilya Sutskever, Jie Tang, Nikolas Tezak, Madeleine B. Thompson, Phil Tillet, Amin Tootoonchian, Elizabeth Tseng, Preston Tuggle, Nick Turley, Jerry Tworek, Juan Felipe Cerón Uribe, Andrea Vallone, Arun Vijayvergiya, Chelsea Voss, Carroll Wainwright, Justin Jay Wang, Alvin Wang, Ben Wang, Jonathan Ward, Jason Wei, CJ Weinmann, Akila Welihinda, Peter Welinder, Jiayi Weng, Lilian Weng, Matt Wiethoff, Dave Willner, Clemens Winter, Samuel Wolrich, Hannah Wong, Lauren Workman, Sherwin Wu, Jeff Wu, Michael Wu, Kai Xiao, Tao Xu, Sarah Yoo, Kevin Yu, Qiming Yuan, Wojciech Zaremba, Rowan Zellers, Chong Zhang, Marvin Zhang, Shengjia Zhao, Tianhao Zheng, Juntang Zhuang, William Zhuk, and Barret Zoph. Gpt-4 technical report, 2024. URL https://arxiv.org/abs/2303.08774.
Hui et al. [2024b] Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Kai Dang, et al. Qwen2. 5-coder technical report. arXiv preprint arXiv:2409.12186, 2024b.
Liu et al. [2024d] Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024d.

Appendix A Appendix

A.1 Judge Prompt for Code Understanding Task

Judge Prompt for Code Understanding Task

Evaluate if the answer satisfies the question requirements using a natural language explanation.

QUESTION: {question_text}

REQUIREMENTS (checklist items): {chr(10).join(checklist_text)} {patch_section} ANSWER: {truncated_answer}

EVALUATION RULES:

Answer MUST use clear English explanations, NOT just code diffs, or other types of content
Only give 1.0 when ALL checklist items are thoroughly satisfied with clear explanations.
Score = (satisfied items) / (total items)
Penalize: code diffs without explanation, vague statements, wrong info. Give 0.0 when the answer is just code diffs or completely wrong.

JSON response: {{ "reasoning": "Brief explanation of which items satisfied/unsatisfied and why", "score": , "satisfied_items": ["item_id1", ...] }}

A.2 Claude Code: Parallel Tool-Calls System Prompt

The following prompt is appended via SDK to encourage parallel tool invocations when operations are independent.

Claude Code: Parallel Tool-Calls System Prompt

For maximum efficiency, whenever you perform multiple independent operations, invoke all relevant tools simultaneously rather than sequentially. Prioritize calling tools in parallel whenever possible. For example, when reading 3 files, run 3 tool calls in parallel to read all 3 files into context at the same time. When running multiple read-only commands like ‘ls‘ or ‘list_dir‘, always run all of the commands in parallel. Err on the side of maximizing parallel tool calls rather than running too many tools sequentially.

A.3 Trajectory Failure Analysis Prompt

Trajectory Failure Analysis Prompt System Prompt:

You are an expert software engineer analyzing why a software engineering agent failed to resolve an issue.

AVAILABLE AGENT ACTIONS:

---- BEGIN FUNCTION #1: bash ---- Description: Execute a bash command in the terminal.

Can generate very large outputs when listing files (ls, find, grep)
Output contributes directly to context window usage
Commands like ’find /repo -name "*.py"’ can list thousands of files
Large outputs can quickly fill the context window

Parameters: (1) command (string, required): The bash command to execute. Can be empty to view additional logs when previous exit code is ‘-1‘. Can be ‘ctrl+c‘ to interrupt the currently running process. ---- END FUNCTION #1 ----

---- BEGIN FUNCTION #2: submit ---- Description: Finish the interaction when the task is complete OR if the assistant cannot proceed further with the task.

Used when agent thinks task is done (may be correct or incorrect solution)
Also used when agent is stuck and cannot make progress
No parameters are required for this function. ---- END FUNCTION #2 ----

---- BEGIN FUNCTION #3: str_replace_editor ---- Description: Custom editing tool for viewing, creating and editing files

State is persistent across command calls and discussions with the user
If ‘path‘ is a file, ‘view‘ displays the result of applying ‘cat -n‘. If ‘path‘ is a directory, ‘view‘ lists non-hidden files and directories up to 2 levels deep
Directory views can generate large outputs contributing to context usage
The ‘create‘ command cannot be used if the specified ‘path‘ already exists as a file
If a ‘command‘ generates a long output, it will be truncated and marked with ‘‘
The ‘undo_edit‘ command will revert the last edit made to the file at ‘path‘

Notes for using the ‘str_replace‘ command:

The ‘old_str‘ parameter should match EXACTLY one or more consecutive lines from the original file. Be mindful of whitespaces!
If the ‘old_str‘ parameter is not unique in the file, the replacement will not be performed. Make sure to include enough context in ‘old_str‘ to make it unique
The ‘new_str‘ parameter should contain the edited lines that should replace the ‘old_str‘

Parameters: (1) command (string, required): The commands to run. Allowed options are: ‘view‘, ‘create‘, ‘str_replace‘, ‘insert‘, ‘undo_edit‘. (2) path (string, required): Absolute path to file or directory, e.g. ‘/repo/file.py‘ or ‘/repo‘. (3) file_text (string, optional): Required parameter of ‘create‘ command, with the content of the file to be created. (4) old_str (string, optional): Required parameter of ‘str_replace‘ command containing the string in ‘path‘ to replace. (5) new_str (string, optional): Optional parameter of ‘str_replace‘ command containing the new string (if not given, no string will be added). Required parameter of ‘insert‘ command containing the string to insert. (6) insert_line (integer, optional): Required parameter of ‘insert‘ command. The ‘new_str‘ will be inserted AFTER the line ‘insert_line‘ of ‘path‘. (7) view_range (array, optional): Optional parameter of ‘view‘ command when ‘path‘ points to a file. If none is given, the full file is shown. If provided, the file will be shown in the indicated line number range, e.g. [11, 12] will show lines 11 and 12. Indexing at 1 to start. Setting ‘[start_line, -1]‘ shows all lines from ‘start_line‘ to the end of the file. ---- END FUNCTION #3 ----

---- BEGIN FUNCTION #4: file_viewer ---- Description: Interactive file viewer for opening and navigating files in the editor.

open []: Opens the file at path. If line_number is provided, the view moves to include that line.
goto : Moves the window to show the specified line number.
scroll_down: Moves the window down 100 lines.
scroll_up: Moves the window up 100 lines.

Parameters: (1) command (string, required): One of ‘open‘, ‘goto‘, ‘scroll_down‘, ‘scroll_up‘. (2) path_or_line (string/int, optional): For ‘open‘, a path (and optional line). For ‘goto‘, a line number. ---- END FUNCTION #4 ----

---- BEGIN FUNCTION #5: search_tools ---- Description: Searching utilities for locating text or files within the workspace.

search_file []: Searches for search_term in file. If file is not provided, searches the current open file.
search_dir []: Searches for search_term in all files in dir. If dir is not provided, searches in the current directory.
find_file []: Finds all files with the given name in dir. If dir is not provided, searches in the current directory.

Parameters: (1) subcommand (string, required): One of ‘search_file‘, ‘search_dir‘, ‘find_file‘. (2) arg1 (string, required): The search term or file name, depending on subcommand. (3) arg2 (string, optional): Target file (for search_file) or directory (for search_dir/find_file). ---- END FUNCTION #5 ----

---- BEGIN FUNCTION #6: edit_block ---- Description: Block editor for replacing ranges in the current open file and finalizing edits.

edit : : Replaces lines n through m (inclusive) with the given text in the open file. Ensure indentation is correct.
end_of_edit: Applies the pending changes. Python files are syntax-checked after the edit; if an error is found, the edit is rejected.

Parameters: (1) command (string, required): ‘edit‘ or ‘end_of_edit‘. (2) range_and_text (varies): For ‘edit‘, a line range ‘n:m‘ and the replacement text. ---- END FUNCTION #6 ----

---- BEGIN FUNCTION #7: create_file ---- Description: Creates and opens a new file with the given name.

Parameters: (1) filename (string, required): Absolute or workspace-relative path to create. The file must not already exist. ---- END FUNCTION #7 ----

##PROBLEM STATEMENT## {problem_statement}

##TRAJECTORY SUMMARY##

Total steps: {total_steps}
Final state: Failed (no successful patch generated / failed on some unit test)

##ANALYSIS INSTRUCTIONS##

IMPORTANT: This trajectory FAILED in final evaluation. The agent likely believed it succeeded, but it was WRONG.

The agent may have:

Claimed the issue was resolved or fixed
Written custom tests that passed
Expressed high confidence in the solution
Stated "the implementation is complete" or "all tests pass"
Created demo scripts showing the fix "works"
Manually verified outputs that looked correct

Despite these apparent indicators of success, the final evaluation proves the solution was INCORRECT. Therefore, ignore the agent’s self-assessment and focus on identifying the actual flaws. Select ONE category below that best describes the actual flaw: Requirement Misinterpretation: The agent failed to properly understand and locate the problem, including misidentifying affected files, misjudging severity, confusing problem types, or failing to identify root causes and understand dependencies, data flow, or system architecture. Inadequate Testing: The agent provided incomplete test coverage, missing edge cases, compatibility issues, performance impacts, integration scenarios, or multi-platform testing requirements. Incomplete Solution & Side Effects: The agent provided an incomplete fix that only addressed symptoms rather than root causes, or introduced new issues including regressions, security vulnerabilities, environment configuration errors, data corruption risks, or breaking changes to existing functionality. Technical Knowledge Gap: The agent demonstrated insufficient technical proficiency or violated domain-specific conventions, including lacking necessary knowledge in specialized domains (UI/frontend, security, accessibility, DevOps, i18n, performance, analytics) or incorrectly handling domain-specific issues (data processing, security implementations, UI/UX standards, API design, documentation synchronization). Tool Invocation Error: The agent encountered errors while using tools due to incorrect syntax, context overflow from file operations, or parse/analysis tool failures. infinite_loop: The agent got stuck in loops without convergence, including repeated attempts at the same solution, oscillating between decisions, or endlessly reading files without making progress.

other: The agent failed to resolve the issue for reasons not covered by the above categories.

Do NOT invent or propose new categories. If none fits, use "other". Category must be all lowercase with underscores. Remember to write two new lines before the category.

User Prompt:

##INSTANCE INFORMATION## Instance ID: {instance_id}

##The complete trajectory of the interaction (to be analyzed)## {traj_text}

##OUTPUT FORMAT## You MUST provide your response in this exact format: xxx

xxx If the Assistant gets stuck in a loop or encounters a tool_error error, indicate the incorrect action and parameters. If the Assistant misunderstands the question, set error_action="None".

A.4 Executor Details and Method-Specific Settings

Unless otherwise noted, all runs are strictly offline. Below we record method-specific configurations referenced in §4:

•
SWE-Agent. max turns =150=150; per-tool step timeout =600=600 s; parse_function set to function calling; long observations truncated; compiled artifacts filtered via “.gitignore”; language-specific build/test commands repaired for stability.
•
Claude Code. max turns =150=150; permission_mode == bypassPermissions; system prompt appended to encourage parallel tool calls (Appx. A.2); may internally invoke a SubAgent; networking disabled.

Standardized offline build/test commands (per language).

We standardize non-interactive commands to ensure reproducible builds and comparable feedback signals across languages:

•
Python: pytest -q
•
JavaScript/TypeScript: npm ci && npm test --run
•
Java: mvn -B -DskipTests=false test
•
Go: go test ./...
•
Rust: cargo test --locked
•
C/C++: cmake --build && ctest -j1

To minimize offline flakiness and improve determinism, we apply:

•
Pinned toolchains inside containers; pre-populated offline caches/proxies for pip, npm, cargo, Maven/Gradle, and Go modules.
•
Truncation of long observations and logs; filtering of binaries and dev servers via “.gitignore”.
•
Normalization of EOL/encoding (LF, UTF-8); git safe.directory set; whitespace-tolerant patching.
•
Repository navigation pruning by extensions; optional function/class extraction (read-only) to speed up localization.
•
Budgets and quotas: max 150 turns; per-step 600 s; global job limits; auto-kill of long-lived processes.
•
Language-specific repairs for brittle stacks (e.g., Java multi-module builds, Node lockfile drift, Rust workspaces, C/C++ out-of-tree builds).

Table 4: Top-10 languages: Pass@1 (%) per model. Columns are languages; rows are models grouped by agent.

Refer to caption

Figure 8: Distribution of modified files and lines involved in golden patches.

A.5 Analysis of Open-Source Benchmark Distributions

Refer to caption

Figure 9: Distributions across task types, programming scenarios, and languages in Open-Source SWE Benchmarks and Github PR & Issue.Abbreviations: FE: Feature Enhancement, FI: Feature Implementation, CD: Configuration & Deployment, CU: Code Understanding, PO: Performance Optimization, TG: Test Case Generation, BF: Bug Fixing, RF: Refactoring; ID: Infrastructure Development, SPD: Specialized Programming Domains, DE: Data Science & Engineering, SE: Security Engineering, AD: Application Development, DS: Database Systems, ML: Machine Learning & AI, UI/UX: UI/UX Engineering.

We annotated several repository-level SWE benchmark datasets, including SWE-bench-Verified (n=500n=500), SWE-bench-Live (n=500n=500), SWE-bench-Multilingual (n=300n=300), SWE-bench-Pro (n=731n=731), and SWE-rebench (n=449n=449), totaling 2,480 instances across multiple programming languages and scenarios. Figure 9(a) presents the distribution of these open-source datasets across task types, program scenarios, and languages. Through detailed analysis, we identified the following limitations:

•
Incomplete Task Type Coverage. Existing benchmarks are entirely focused on Bug Fixing tasks, which comprise 100% of all instances. In contrast, several important task types—including Feature Enhancement, Feature Implementation, Configuration & Deployment, Code Understanding, Performance Optimization, Test Case Generation, and Refactoring—are completely absent.
•
Imbalanced Scenario Distribution. A large portion of the data focuses on Application Development (32.6%), whereas other critical scenarios such as UI/UX Engineering (3.0%), Database Systems (2.9%), and Security Engineering (7.5%) receive significantly less coverage. Meanwhile, Infrastructure Development accounts for 22.1%.
•
Severe Programming Language Imbalance. The datasets are overwhelmingly dominated by Python (71.7%), with minimal coverage of other programming languages such as Go (13.5%), JavaScript (8.7%), and others combined accounting for less than 6%.

A.6 Model link list

We evaluate our approach using a diverse set of state-of-the-art language models, including both closed-source and open-source models. The closed-source models include Claude-Sonnet-4-20250514 [Anthropic, 2025], Gemini-2.5-Flash, Gemini-2.5-Pro [Gemini Team and Google, 2023], and GPT-4.1-2025-04-14 [OpenAI et al., 2024]. For open-source models, we utilize Qwen3-Coder series [Team, 2025, Hui et al., 2024b], Kimi-K2-Instruct-0905 [Team et al., 2025a], Deepseek-V3-0324 [Liu et al., 2024d], and SWE-agent-LM-32B [Yang et al., 2025a]. The complete list of models with their official links is provided in Table 5.

Table 5: Model List.

SWE-Compass: Towards Unified Evaluation of Agentic Coding Abilities for Large Language Models (original) (raw)

Abstract

1 Introduction

2 Related Works

3 SWE-Compass

3.1 Overview

3.2 Design Principles

3.3 Benchmark Construction

3.3.1 Step 1: User Analysis

3.3.2 Step 2: Data Collection

3.3.3 Step 3: Environment Building

3.3.4 Step 4: Task Building

3.3.5 Step 5: Data Validation

3.4 Evaluation Metrics

4 Experiments

4.1 Evaluated LLMs and Frameworks

Benchmarks and Tracks

Frameworks

Environment, Budgets, and Metrics

LLMs

4.2 Experimental Results

4.2.1 Main Results

Findings by task type.

Framework comparison.

4.2.2 Further Analysis

Language-level observations.

Interaction turns vs. success (by language).

Consistency and specialization across languages.

Fine-grained scenario analysis.

Failure Mode Analysis.

5 Conclusion

6 Future Works

References

Appendix A Appendix

A.1 Judge Prompt for Code Understanding Task

A.2 Claude Code: Parallel Tool-Calls System Prompt

A.3 Trajectory Failure Analysis Prompt

A.4 Executor Details and Method-Specific Settings

Standardized offline build/test commands (per language).

Execution hardening and navigation controls.

A.5 Analysis of Open-Source Benchmark Distributions

A.6 Model link list