SWE-Compass: Towards Unified Evaluation of Agentic Coding Abilities for Large Language Models (original) (raw)

Jingxuan Xu∗, Ken Deng∗, Weihao Li∗, Songwei Yu∗, Huaixi Tang∗, Haoyang Huang∗, Zhiyi Lai∗, Zizheng Zhan∗, Yanan Wu∗, Chenchen Zhang∗, Kepeng Lei, Yifan Yao, Xinping Lei, Wenqiang Zhu, Zongxian Feng, Han Li, Junqi Xiong, Dailin Li, Zuchen Gao, Kun Wu, Wen Xiang, Ziqi Zhan, Yuanxing Zhang, Wuxuan Gong, Ziyuan Gao, Guanxiang Wang, Yirong Xue, Mengtong Li, Mengfei Xie, Xiaojiang Zhang, Jinghui Wang, Wenhao Zhuang, Zheng Lin, Huiming Wang, Zhaoxiang Zhang, Yuqun Zhang, Haotian Zhang, Bin Chen, Jiaheng Liu†
Kuaishou Technology, Nanjing University
xujingxuan05@kuaishou.com, dengken@kuaishou.com, liujiaheng@nju.edu.cn

Abstract

Evaluating large language models (LLMs) for software engineering has been limited by narrow task coverage, language bias, and insufficient alignment with real-world developer workflows. Existing benchmarks often focus on algorithmic problems or Python-centric bug fixing, leaving critical dimensions of software engineering underexplored. To address these gaps, we introduce SWE-Compass111https://huggingface.co/datasets/Kwaipilot/SWE-Compass, a comprehensive benchmark that unifies heterogeneous code-related evaluations into a structured and production-aligned framework. SWE-Compass spans 8 task types, 8 programming scenarios, and 10 programming languages, with 2000 high-quality instances curated from authentic GitHub pull requests and refined through systematic filtering and validation. We benchmark ten state-of-the-art LLMs under two agentic frameworks, SWE-Agent and Claude Code, revealing a clear hierarchy of difficulty across task types, languages, and scenarios. Moreover, by aligning evaluation with real-world developer practices, we hope SWE-Compass can provide a rigorous and reproducible foundation for diagnosing and advancing agentic coding capabilities in large language models.

††footnotetext: * Equal Contribution. † Corresponding Author.

1 Introduction

Large language models (LLMs) trained on code have rapidly advanced from solving algorithmic puzzles to assisting with production-scale software development. Modern coding LLMs (Team et al., 2025a; b; Anthropic, 2025; Team et al., 2025c; Team, 2025) now exhibit strong multi-turn reasoning, long-context handling, and tool-use capabilities, enabling them to serve as autonomous coding agents that plan, edit, test, and deploy software. This shift has motivated a wave of benchmarks designed to measure their utility. However, existing evaluations fall short in capturing the full scope of real-world software engineering: most remain restricted to single-file tasks, Python-centric bug fixing, or synthetic algorithmic problems (Zheng et al., 2023a; Austin et al., 2021; Li et al., 2022; Jain et al., 2024; Zhuo et al., 2025), leaving critical developer activities such as feature implementation, refactoring, configuration, and performance optimization underexplored.

Refer to caption

subfigureTask-Specific Resolve Rates

Refer to caption

subfigureLanguage Distribution Across Benchmarks

Figure 1: Comparative analysis: model performance across task types (left) and language coverage across benchmarks (right).

Recent repository-grounded benchmarks, such as SWE-bench and its variants (Jimenez et al., 2024; Yang et al., 2025a; Badertdinov et al., 2025; Yang et al., 2025a), have improved ecological validity by embedding evaluations in real issues, integrating test oracles, and introducing multi-language (Rashid et al., 2025; Yang et al., 2025a) or multimodal (Yang et al., 2025b) extensions. Yet these efforts largely converge on bug fixing as the dominant evaluation axis. As a result, they neglect the breadth of software engineering workflows that unfold across diverse scenarios—ranging from infrastructure and security engineering to machine learning system development—and across heterogeneous programming ecosystems. This narrowness prevents systematic capability diagnosis and obscures whether strong performance arises from generalizable reasoning or from artifact-specific adaptation.

To address these limitations, as shown in Figure 1, we present SWE-Compass, a unified benchmark comprising 2,000 verified instances for evaluating LLMs’ agentic coding abilities. SWE-Compass spans 8 task types, 10 programming scenarios, and 10 programming languages, combining broad coverage with rigorous evaluation fidelity. Each instance is paired with executable environments and reproducible tests, enabling fair comparison across prompting and agent-based methods under controlled budgets. Importantly, SWE-Compass is built upon four design principles: (i) real-world alignment, ensuring data originates from genuine developer interactions; (ii) comprehensive coverage across diverse tasks and languages; (iii) systematic taxonomy, providing structured labeling and balanced distributions; and (iv) evaluation fidelity, guaranteeing that all instances are executable and verifiable. Together, these principles yield a benchmark that reflects the complexity, diversity, and reproducibility demanded by modern software engineering.

Our contributions are threefold:

Coding LLMs and Agents. Code large language models (Code LLMs) (Chen et al., 2021; Zhao et al., 2024; Chowdhery et al., 2023; Nijkamp et al., 2023; Fried et al., 2023; Xu et al., 2022; Roziere et al., 2023; Hui et al., 2024a; Deng et al., 2025; Que et al., 2024) — excel at a wide range of programming tasks, including code generation, completion, repair, translation, code comprehension, documentation generation, and cross-language migration, among others. Crucially, modern Code LLMs combine ultra-long context support with robust tool-calling capabilities, enabling them to maintain global awareness across large codebases while actively invoking editors, shells, debuggers, or web browsers (Liu et al., 2024a; 2025; Wang et al., 2024). This synergy has fueled the rise of agentic coding systems — such as SWE-Agents (Yang et al., 2024), OpenHands (Wang et al., 2025), and Claude Code (Anthropic, 2025), QwenCode (Team, 2025), Codex (OpenAI, 2025), Cline (Cline, 2024) — that autonomously plan, search, edit, test, and even perform agentic browser use to fetch live API documentation or solutions. As evaluations move toward dynamic, repository-scale workflows, these agent-based systems are showing improved performance over traditional code-generation approaches — particularly in tasks requiring persistent context, environment interaction, and multi-step reasoning (Liu et al., 2024b; He et al., 2025).

Coding Benchmarks. Single-file code benchmarks — such as HumanEval (Zheng et al., 2023a), MBPP (Austin et al., 2021), CodeContests (Li et al., 2022), LiveCodeBench (Jain et al., 2024) and BigCodeBench (Zhuo et al., 2025) — evaluate models on isolated algorithmic problems, abstracting away the structural, contextual, and environmental complexity inherent in real-world software engineering (Liu et al., 2024c); while SWE-bench (Jimenez et al., 2024) and its variants — including Multimodal SWE-bench (Yang et al., 2025c), SWE-bench Multilingual (Yang et al., 2025a), SWE-bench-Live (Zhang et al., 2025a), SWE-Lancer (Miserendino et al., 2025),SWE-rebench (Badertdinov et al., 2025) and others — have substantially improved ecological validity by grounding evaluation in real repository issues and incorporating dimensions such as visual context, multi-language support, tool interaction, and repository-scale execution, they remain overwhelmingly confined to bug fixing as the de facto evaluation paradigm, neglecting the broader spectrum of developer activities such as feature implementation, refactoring, configuration, performance optimization, and test generation, which unfold across diverse engineering contexts including application and infrastructure development, ML/AI systems, security, UI/UX, and beyond — a critical omission that precludes fine-grained, scenario-aware capability analysis and obscures whether model performance stems from general reasoning, domain adaptation, or artifact overfitting; to address this gap, we introduce a benchmark that explicitly structures evaluation along orthogonal axes of task type and programming scenario, enabling systematic diagnosis of model strengths and weaknesses across the multifaceted reality of software development, rather than reducing it to a single, narrow slice.

3 SWE-Compass

3.1 Overview

Existing software engineering benchmarks primarily focus on Python-centric bug fixing tasks, exhibiting limited task coverage and insufficient alignment with real-world developer activities. In contrast to such benchmarks that concentrate on a single programming language and task type, SWE-Compass is constructed from authentic software engineering requirements, as shown in Table 1. It collects a large volume of high-quality repositories from GitHub pull requests and undergoes a multi-stage filtering and construction process. The resulting benchmark encompasses 2000 instances across 8 types of code-related tasks, 8 programming scenarios, and 10 programming languages, as shown in Figure 2. It enables a comprehensive evaluation of key software engineering capabilities, including bug fixing, performance optimization, and other related tasks, offering a holistic assessment of model performance in realistic software engineering contexts.

Refer to caption

Figure 2: Distributions across task types, programming scenarios and languages.

Table 1: Comprehensive comparison of SWE-Compass with existing benchmarks across different dimensions.

3.2 Design Principles

SWE-Compass is designed around four guiding principles that distinguish it from existing software engineering benchmarks:

Refer to caption

Figure 3: Construction of SWE-Compass.

3.3 Benchmark Construction

The construction of SWE-Compass follows a systematic and scalable approach organized into five major steps to ensure comprehensive coverage, balance, and real-world relevance: (1) user analysis, (2) data collection, (3) environment building, (4) task construction, and (5) data validation, as illustrated in Figure 3. Specifically, through an iterative Active Learning procedure applied to real-world coding conversations, we first identified that user needs predominantly fall into eight distinct task types, eight representative programming scenarios, and ten programming languages. We then collected a large volume of high-quality pull request (PR) data from GitHub repositories. By combining automated processing with expert annotation, we successfully built a set of executable development environments. Next, for each of the eight task types, we constructed and synthesized the corresponding task instances. Finally, after a multi-round filtering and quality validation process, we curated the SWE-Compass benchmark as the final dataset.

3.3.1 Step 1: User Analysis

To ensure that the evaluation accurately reflects model capabilities in realistic software development contexts, we collected repository-level coding discussions from two major platforms—Stack Overflow and GitHub. To discover emerging task categories, we designed an automated Active Learning framework for category discovery. Specifically, four popular software-related topics were chosen as initial label seeds for both task types and programming scenarios. Using an In-Context Learning (ICL)-based labeling approach, a large language model (LLM) was employed to annotate the collected conversations across three dimensions: task type, programming scenario, and programming language. Subsequently, tag clustering and LLM-guided seed optimization (via addition, modification, or deletion of tags) were applied to refine the label pool. The iterative process continued until convergence—when the updated seed pool no longer significantly differed from the previous ICL-generated pool. In our experiments, the Qwen3-Coder-30B-A3B-Instruct222https://huggingface.co/Qwen/Qwen3-Coder-30B-A3B-Instruct model was used as the LLM annotator, and five iterations were performed in total. Ultimately, we identified eight task types, eight programming scenarios, and ten major programming languages as follows.

Task Types:

Programming Scenarios:

Programming Languages: Python, JavaScript, TypeScript, Java, C, C++, Go, Rust, Kotlin, C#.

3.3.2 Step 2: Data Collection

To ensure both coverage and realism in benchmark construction, we conducted the following data collection process. Specifically, we first gathered existing open-source SWE benchmarks (e.g., Python bug fixing datasets) and mapped them to our defined taxonomy of task types, programming scenarios, and languages. As shown in Appendix A.5, these benchmarks exhibit severe deficiencies across multiple dimensions: many task types are missing, scenario distributions are highly imbalanced, and programming languages are heavily skewed. To address these limitations, we further supplemented the dataset with high-quality repositories from GitHub in the following strategy.

After all filtering stages, approximately 50,000 high-quality PRs were preserved, serving as the foundation for subsequent environment and task construction.

3.3.3 Step 3: Environment Building

To enable reproducible execution and evaluation of each software engineering instance, we constructed isolated containerized environments for all selected PRs. For each PR, we automatically extracted environment dependency information—such as package managers, required libraries, build tools, and runtime versions—from configuration files (e.g., requirements.txt, setup.py, Makefile, and CI/CD scripts). These dependencies were programmatically organized into corresponding Dockerfiles, from which initial Docker images were generated.

Each successfully built image was then validated by executing the repository’s native test suite to verify that it could run end-to-end and reproduce the functionality and performance behavior before and after patch application (F2P/P2P consistency). The initial automated build success rate was around 2%, reflecting the inherent complexity and dependency fragility in real-world repositories.

To address build failures, 30 expert annotators inspected the corresponding build logs, identified root causes (e.g., missing dependencies, version conflicts, or OS-level mismatches), and applied targeted fixes before re-triggering the build on Kubernetes. This expert-assisted retry process raised the overall retention rate to approximately 8%.

Finally, we obtained about 4,000 successfully runnable Docker images, each providing a fully reproducible and verifiable execution environment for downstream task synthesis and evaluation.

3.3.4 Step 4: Task Building

Given the heterogeneity of the eight software engineering task types, we designed three complementary strategies to construct diverse and representative task instances: (1) Checklist Synthesis, (2) Reverse Masking, and (3) Targeted Filtering. Each strategy was tailored to the specific characteristics of the corresponding task type, ensuring both task realism and evaluation reliability.

3.3.5 Step 5: Data Validation

To ensure both diversity and quality in the final benchmark, we applied a structured sampling and validation process designed to balance task coverage, control instance difficulty, and guarantee overall dataset reliability.

As a result, we constructed the comprehensive benchmark SWE-Compass, which contains 2,000 high-quality instances, well-balanced across task categories, programming scenarios, and languages, providing a rigorous and representative evaluation framework for assessing the capabilities of large language models in real-world software engineering tasks.

3.4 Evaluation Metrics

For each type of task, we select appropriate evaluation metrics to measure the model’s performance. These include:

    1. Pass@1: The fraction of resolved samples achieved under a single attempt with fixed decoding and resource budgets.
    1. Performance Optimization Score: A binary indicator (0/1). The score is 1 if the model’s optimized code passes a single test and the time spent on execution is less than 80% of the time taken by the unoptimized code; otherwise, the score is 0.
    1. Line Coverage: This metric evaluates the extent to which the program code has been executed during test case execution. The formula for calculating line coverage is:
      Line Coverage=Number of Executed Code LinesTotal Number of Code Lines×100%\text{Line Coverage}=\frac{\text{Number of Executed Code Lines}}{\text{Total Number of Code Lines}}\times 100\%
    1. LLM-As-A-Judge Score: Following (Zheng et al., 2023b; Zhang et al., 2025b; c; Li et al., 2025)), we use a large language model (LLM) to review the model output according to a checklist; the final score is the proportion of checkpoints passed by the model output.

For specific tasks, the following metrics are used. For Feature Implementation, Feature Enhancement, Bug Fixing, and Refactoring, Pass@1 is used to measure the model’s performance. For Performance Optimization, the Performance Optimization Score is used to evaluate the model’s performance. For Test Case Generation, we employ Line Coverage to assess the quality of the test cases generated by the model. In our implementation, we use pytest (Pajankar, 2017) to compute line coverage for Python. For TypeScript and JavaScript, we use C8 (Vassudanagunta, 2025) to calculate line coverage. For Code Understanding, we use the LLM-As-A-Judge Score to evaluate the accuracy of the model’s understanding of the code. The specific prompt can be found in the Appendix A.1.

4 Experiments

4.1 Evaluated LLMs and Frameworks

Benchmarks and Tracks

We evaluate SWE-Compass under two tracks (Executable and Non-executable). By default, we aggregate distribution-aligned over Task Type ×\times Programming Scenario ×\times Language following §3 with fixed seeds. Construction scale and composition are in §3 (Table 1).

Frameworks

We evaluate two offline agent workflows with identical executors: SWE-Agent (hardened edit–diff–execute loop) and Claude Code (sandboxed, editor-centric with parallel tool calls). Both use containerized, network-disabled toolchains with standardized build/test commands and execution hardening for reproducibility; complete workflow notes (including the parallel tool-call prompt) and the command matrix are in Appendix A.4.

Environment, Budgets, and Metrics

All evaluations run in fixed offline containers with unified budgets and executors; networking is disabled, and retries are not used. We adopt a single-attempt setting with standard decoding and turn/time limits, and evaluate using the task-type–aligned decision rules and metrics defined in §3. Exact configuration (timeouts, hardware, container versions, context windows, caches) and any method-specific deviations are provided in Appendix A.4.

LLMs

We evaluate 10 models under a unified leaderboard (no reasoning/non-reasoning split): Claude-Sonnet-4-20250514, Qwen3-Coder-480B-A35B-Instruct, Qwen3-Coder-30B-A3B-Instruct, Qwen3-235B-A22B-Instruct-2507, Kimi-K2-Instruct-0905, Gemini-2.5-Pro, Gemini-2.5-Flash, GPT-4.1-2025-04-14, DeepSeek-V3-0324, and SWE-agent-LM-32B. Model API pages and open-source deployment links are provided in Appendix A.6.

4.2 Experimental Results

4.2.1 Main Results

Table 2: Main results by task types on SWE-Compass. AVG is the macro-average across task types. Abbreviations: FI=Feature Implementation; FE=Feature Enhancement; BF=Bug Fixing; RF=Refactoring; PO=Performance Optimization; CU=Code Understanding; TG=Test Case Generation; CD=Configuration & Deployment.

Table 2 reports Pass@1 by task type. Claude-Sonnet-4-20250514 ranks first under both workflows (32.9% with Claude Code; 31.8% with SWE-Agent). Scores largely cluster in the low-to-mid 20s (overall range roughly 10–33%). Contrary to a monotonic advantage, the two workflows are complementary: among the five overlapping models, only two achieve higher AVG with Claude Code, whereas three obtain higher AVG with SWE-Agent. Among open-weight systems, Qwen3-Coder-480B-A35B-Instruct reaches 27.2% with SWE-Agent and 21.9% with Claude Code, still below the best proprietary model. The findings across task types and workflows are as follows:

Findings by task type.

A consistent but nuanced hierarchy emerges (Table 2). Code Understanding (CU) is among the strongest categories across models. Configuration & Deployment (CD) can be high for some systems (e.g., Claude-Sonnet-4) but exhibits sizable cross-model variance, so it is not uniformly easy. Feature Enhancement (FE) and Refactoring (RF)—occupy a middle tier. Feature Implementation (FI) and Bug Fixing (BF) are harder, reflecting localization and integration challenges. Test Case Generation (TG) and Performance Optimization (PO) remain challenging, but not to single-digit averages; results typically fall in the mid-teens to mid-20s depending on the model. Method-wise, SWE-Agent tends to be stronger on BF and parts of FI that benefit from iterative localization, whereas Claude Code shows advantages on TG and some CD cases with more deterministic signals; CU is broadly comparable across the two.

Framework comparison.

Across most settings, the two agents exhibit complementary strengths. Mechanistically, SWE-Agent’s edit–diff–execute loop favors investigative, multi-file tasks that reward iterative localization, at the cost of higher timeout exposure; Claude Code’s sandboxed, editor-centric workflow yields strong performance on well-scoped, deterministic tasks (e.g., CD, CU, TG), benefitting from lower tool overhead. We also observe a trade-off with interaction efficiency: improvements in Eval score often coincide with higher average interaction turns (cf. Figure 4 and Figure 5), with diminishing returns beyond moderate turn counts, suggesting that future gains require better localization and hypothesis pruning rather than simply more exploration.

4.2.2 Further Analysis

Refer to caption

Figure 4: Comparison of Pass@1 (%) across the top programming languages for SWE-Agent. Bars represent Pass@1; languages are ordered by overall Pass@1. This plot highlights whether improvements are concentrated in specific languages.

Refer to caption

Figure 5: Distribution of interaction turns required per language for selected models to reveal trade-offs between effort (turns) and success. This highlights whether models achieve high Pass@1 by spending more turns on particular languages.

Language-level observations.

Figure 4and Appendix Table 4 indicate a consistent cross-language stratification across models and agents. JVM ecosystems and JavaScript tend to score higher (Java/Kotlin/JavaScript), while TypeScript is notably lower; systems languages (C/C++/Rust/Go) are harder; Python appears mid-tier overall, partly reflecting dataset selection effects—open-source benchmarks over-index on difficult Python bug-fixing cases (Appendix A.5. For Claude-Sonnet-4, Claude Code shows gains on Java/JavaScript, but this is not universal across models; in C#, C/C++/Rust/Go, SWE-Agent often matches or outperforms. These patterns suggest performance is governed more by tooling determinism and diagnosability than raw coding difficulty; prioritize repository-level localization and environment hardening for systems/Python stacks, and hypothesis pruning for deterministic JVM/JS pipelines. See Figure 6 for a visualization.

Interaction turns vs. success (by language).

Figure 5 shows per-language turn distributions with Pass@1 overlays. Deterministic ecosystems (Java/Kotlin/JavaScript/C#) have lower medians and tighter IQRs under Claude Code, while achieving similar or higher Pass@1—gains come from reliable signals rather than more turns. Systems languages (C/C++/Rust/Go) exhibit heavier tails, especially for SWE-Agent, with clear diminishing returns; Rust is most brittle. Python shows high variance: pinned environments converge quickly under Claude Code, while heterogeneity pushes SWE-Agent to many low-yield turns. Overall, prioritize repository-level localization for systems languages and environment hardening for Python; in JVM/JS, focus on sharper hypothesis pruning and parallel validation.

Refer to caption

Figure 6: (Two panels)(a) This plot illustrates the relationship between skill specialization and performance consistency. Skill specialization is measured by the variability in performance across different tasks/languages, with higher values indicating stronger performance in specific tasks/languages compared to others. Performance consistency, on the other hand, reflects how stable the performance is across different difficulty levels, with higher values indicating more consistent performance over time. (b) The graph shows the relationship between performance variability and overall performance. Performance variability measures how much performance fluctuates across different tasks/languages, with higher values indicating greater inconsistency. Overall performance is calculated as the average score across tasks/languages, with higher values indicating better overall performance.

Consistency and specialization across languages.

Figure 6 summarizes whether strong aggregate results are broad-based or concentrated. In panel (a), aggregate Pass@1 and the per-language median Pass@1 show a clear visual positive trend, indicating that higher-ranked systems tend to improve more consistently across languages rather than relying on a single language. Top systems cluster toward the right with higher consistency, whereas mid-tier models are more dispersed with lower medians. In panel (b), higher overall performance visually coincides with lower cross-language/task variability (coefficient of variation, CV), suggesting that stronger models are generally less variable. Together these observations support our earlier findings: (i) improvements at the top reflect broad gains in localization and execution reliability rather than narrow specialization; (ii) reducing cross-language variance is an effective lever for closing the gap, especially on systems languages that contribute disproportionately to variability; and (iii) evaluation protocols should report both central tendency and dispersion to avoid overstating gains driven by a subset of languages.

Notably, Claude-Sonnet-4-20250514 lies at the far right with one of the lowest variabilities, reflecting broad cross-language gains.

Table 3: Scores on different scenarios: Abbreviations: AD=Application Development; DE=Data Science & Engineering; DS=Database Systems; ID=Infrastructure Development; ML=Machine Learning & AI; SE=Security Engineering; SPD=Specialized Programming Domains; UI/UX=UI/UX Engineering; AVG=macro-average.

Fine-grained scenario analysis.

Table 3 shows that scenario difficulty closely tracks tooling determinism and the locality of required edits. High-scoring categories such as UI/UX Engineering, Security Engineering, and Application Development combine mature frameworks with clear oracles and fast-running tests, where Claude Code’s editor-centric workflow converts stable feedback into higher Pass@1 with fewer turns. In contrast, Database Systems, Infrastructure Development, ML/AI, and Specialized Programming Domains involve multi-stage builds, cross-process dependencies, or non-deterministic outputs; here, SWE-Agent’s iterative localization is often more resilient but also more exposed to timeouts. The consistent average advantage of Claude Code over SWE-Agent across scenarios in our setting) is therefore concentrated in pipelines with reliable, low-variance signals. To close the remaining gaps, future systems should (i) enhance repository-level observability and reproducibility for complex stacks (minimal repro scripts, pinned environments, artifact isolation), and (ii) invest in hypothesis pruning and parallel verification for deterministic stacks, where the bottleneck is search efficiency rather than raw exploration budget.

Failure Mode Analysis.

Refer to caption

Figure 7: Distribution of trajectory failure modes on SWE-Compass. Abbreviations: RMI=Requirement Misinterpretation, ISE=Incomplete Solution & Side Effects, TIE=Tool Invocation Error, IAT=Inadequate Testing, TKG=Technical Knowledge Gap, INF=Infinite Loop, OTH=Others.

To systematically understand the limitations of current coding agents, we perform a post-hoc failure analysis on SWE-Agent trajectories from our SWE-Compass benchmark. Following Yang et al. (2024)—who report 87% agreement between automated LLM judges and human experts—we adopt an LLM-as-Judge protocol with Claude-Sonnet-4 as the judge; the exact prompt is provided in Appendix A.3. We sample 600 failed trajectories per model for three representative systems: Claude-Sonnet-4, Qwen3-Coder-480B, and Gemini-2.5-Pro.

Specifically, through manual inspection of submitted-but-failed trajectories, we develop a comprehensive six-category taxonomy capturing actual root causes:

    1. Requirement Misinterpretation: The agent failed to properly understand and locate the problem, including misidentifying affected files, misjudging severity, confusing problem types, or failing to identify root causes and understand dependencies, data flow, or system architecture.
    1. Inadequate Testing: The agent provided incomplete test coverage, missing edge cases, compatibility issues, performance impacts, integration scenarios, or multi-platform testing requirements.
    1. Incomplete Solution & Side Effects: The agent provided an incomplete fix that only addressed symptoms rather than root causes, or introduced new issues, including regressions, security vulnerabilities, environment configuration errors, data corruption risks, or breaking changes to existing functionality.
    1. Technical Knowledge Gap: The agent demonstrated insufficient technical proficiency or violated domain-specific conventions, including lacking necessary knowledge in specialized domains (UI/frontend, security, accessibility, DevOps, performance, analytics) or incorrectly handling domain-specific issues (data processing, security implementations, UI/UX standards, API design, documentation synchronization).
    1. Tool Invocation Error: The agent encountered errors while using tools due to incorrect syntax, context overflow from file operations, or parse/analysis tool failures.
    1. Infinite Loop: The agent got stuck in loops without convergence, including repeated attempts at the same solution, oscillating between decisions, or endlessly reading files without making progress.

Note that OTH (Other) denotes rare cases, which are not covered by the above taxonomy (e.g., corrupted artifacts or external executor glitches). The figure caption lists all abbreviations for completeness. As shown in Figure 7, we report the per-model distribution of failure modes on SWE-Compass. Based on 600 error traces per model, we draw the following conclusions: (1) Shared bottlenecks in comprehension and implementation. All models exhibit high error rates in Requirement Misinterpretation (30–34%) and Incomplete Solution & Side Effects (29–42%), together accounting for >60%>60\% of failures. By contrast, Technical Knowledge Gap is consistently low (5–8%), suggesting the core limitations lie in requirement grounding and holistic solution design rather than basic coding proficiency. (2) Distinct model characteristics. Claude-Sonnet-4 is the most balanced, showing the lowest Technical Knowledge Gap (4.7%) but room to improve on Inadequate Testing (20.8%). Qwen3-Coder-480B has the highest Incomplete Solution & Side Effects rate (42%, vs. Claude-Sonnet-4’s 32.7%), revealing weaknesses in end-to-end design. Gemini-2.5-Pro shows the highest Requirement Misinterpretation (34%) and a notable Infinite Loop issue (8.3%), posing reliability risks in production.

5 Conclusion

We introduced SWE-Compass, a unified benchmark that enables systematic evaluation of large language models across diverse software engineering tasks, scenarios, and languages. By integrating 2,000 verified instances derived from real-world GitHub repositories with reproducible execution environments, SWE-Compass provides comprehensive coverage of the software development lifecycle. Our large-scale experiments with ten state-of-the-art LLMs under two agentic frameworks reveal consistent hierarchies of task difficulty, language-specific variability, and dominant failure modes rooted in requirement misinterpretation and incomplete solutions. These findings highlight that future progress in automated software engineering depends less on isolated code generation improvements and more on enhancing requirement grounding, environment reliability, and reasoning consistency. SWE-Compass offers a rigorous, scalable, and reproducible foundation for advancing the next generation of robust, general-purpose coding agents.

6 Future Works

We see several directions to extend SWE-Compass and strengthen the community’s ability to measure and drive progress:

References

Appendix A Appendix

A.1 Judge Prompt for Code Understanding Task

Judge Prompt for Code Understanding Task

Evaluate if the answer satisfies the question requirements using a natural language explanation.

QUESTION: {question_text}

REQUIREMENTS (checklist items): {chr(10).join(checklist_text)} {patch_section} ANSWER: {truncated_answer}

EVALUATION RULES:

  1. Answer MUST use clear English explanations, NOT just code diffs, or other types of content
  2. Only give 1.0 when ALL checklist items are thoroughly satisfied with clear explanations.
  3. Score = (satisfied items) / (total items)
  4. Penalize: code diffs without explanation, vague statements, wrong info. Give 0.0 when the answer is just code diffs or completely wrong.

JSON response: {{ "reasoning": "Brief explanation of which items satisfied/unsatisfied and why", "score": , "satisfied_items": ["item_id1", ...] }}

A.2 Claude Code: Parallel Tool-Calls System Prompt

The following prompt is appended via SDK to encourage parallel tool invocations when operations are independent.

Claude Code: Parallel Tool-Calls System Prompt

For maximum efficiency, whenever you perform multiple independent operations, invoke all relevant tools simultaneously rather than sequentially. Prioritize calling tools in parallel whenever possible. For example, when reading 3 files, run 3 tool calls in parallel to read all 3 files into context at the same time. When running multiple read-only commands like ‘ls‘ or ‘list_dir‘, always run all of the commands in parallel. Err on the side of maximizing parallel tool calls rather than running too many tools sequentially.

A.3 Trajectory Failure Analysis Prompt

Trajectory Failure Analysis Prompt System Prompt:

You are an expert software engineer analyzing why a software engineering agent failed to resolve an issue.

AVAILABLE AGENT ACTIONS:

---- BEGIN FUNCTION #1: bash ---- Description: Execute a bash command in the terminal.

Parameters: (1) command (string, required): The bash command to execute. Can be empty to view additional logs when previous exit code is ‘-1‘. Can be ‘ctrl+c‘ to interrupt the currently running process. ---- END FUNCTION #1 ----

---- BEGIN FUNCTION #2: submit ---- Description: Finish the interaction when the task is complete OR if the assistant cannot proceed further with the task.

---- BEGIN FUNCTION #3: str_replace_editor ---- Description: Custom editing tool for viewing, creating and editing files

Notes for using the ‘str_replace‘ command:

Parameters: (1) command (string, required): The commands to run. Allowed options are: ‘view‘, ‘create‘, ‘str_replace‘, ‘insert‘, ‘undo_edit‘. (2) path (string, required): Absolute path to file or directory, e.g. ‘/repo/file.py‘ or ‘/repo‘. (3) file_text (string, optional): Required parameter of ‘create‘ command, with the content of the file to be created. (4) old_str (string, optional): Required parameter of ‘str_replace‘ command containing the string in ‘path‘ to replace. (5) new_str (string, optional): Optional parameter of ‘str_replace‘ command containing the new string (if not given, no string will be added). Required parameter of ‘insert‘ command containing the string to insert. (6) insert_line (integer, optional): Required parameter of ‘insert‘ command. The ‘new_str‘ will be inserted AFTER the line ‘insert_line‘ of ‘path‘. (7) view_range (array, optional): Optional parameter of ‘view‘ command when ‘path‘ points to a file. If none is given, the full file is shown. If provided, the file will be shown in the indicated line number range, e.g. [11, 12] will show lines 11 and 12. Indexing at 1 to start. Setting ‘[start_line, -1]‘ shows all lines from ‘start_line‘ to the end of the file. ---- END FUNCTION #3 ----

---- BEGIN FUNCTION #4: file_viewer ---- Description: Interactive file viewer for opening and navigating files in the editor.

Parameters: (1) command (string, required): One of ‘open‘, ‘goto‘, ‘scroll_down‘, ‘scroll_up‘. (2) path_or_line (string/int, optional): For ‘open‘, a path (and optional line). For ‘goto‘, a line number. ---- END FUNCTION #4 ----

---- BEGIN FUNCTION #5: search_tools ---- Description: Searching utilities for locating text or files within the workspace.

Parameters: (1) subcommand (string, required): One of ‘search_file‘, ‘search_dir‘, ‘find_file‘. (2) arg1 (string, required): The search term or file name, depending on subcommand. (3) arg2 (string, optional): Target file (for search_file) or directory (for search_dir/find_file). ---- END FUNCTION #5 ----

---- BEGIN FUNCTION #6: edit_block ---- Description: Block editor for replacing ranges in the current open file and finalizing edits.

Parameters: (1) command (string, required): ‘edit‘ or ‘end_of_edit‘. (2) range_and_text (varies): For ‘edit‘, a line range ‘n:m‘ and the replacement text. ---- END FUNCTION #6 ----

---- BEGIN FUNCTION #7: create_file ---- Description: Creates and opens a new file with the given name.

Parameters: (1) filename (string, required): Absolute or workspace-relative path to create. The file must not already exist. ---- END FUNCTION #7 ----

##PROBLEM STATEMENT## {problem_statement}

##TRAJECTORY SUMMARY##

##ANALYSIS INSTRUCTIONS##

IMPORTANT: This trajectory FAILED in final evaluation. The agent likely believed it succeeded, but it was WRONG.

The agent may have:

Despite these apparent indicators of success, the final evaluation proves the solution was INCORRECT. Therefore, ignore the agent’s self-assessment and focus on identifying the actual flaws. Select ONE category below that best describes the actual flaw: Requirement Misinterpretation: The agent failed to properly understand and locate the problem, including misidentifying affected files, misjudging severity, confusing problem types, or failing to identify root causes and understand dependencies, data flow, or system architecture. Inadequate Testing: The agent provided incomplete test coverage, missing edge cases, compatibility issues, performance impacts, integration scenarios, or multi-platform testing requirements. Incomplete Solution & Side Effects: The agent provided an incomplete fix that only addressed symptoms rather than root causes, or introduced new issues including regressions, security vulnerabilities, environment configuration errors, data corruption risks, or breaking changes to existing functionality. Technical Knowledge Gap: The agent demonstrated insufficient technical proficiency or violated domain-specific conventions, including lacking necessary knowledge in specialized domains (UI/frontend, security, accessibility, DevOps, i18n, performance, analytics) or incorrectly handling domain-specific issues (data processing, security implementations, UI/UX standards, API design, documentation synchronization). Tool Invocation Error: The agent encountered errors while using tools due to incorrect syntax, context overflow from file operations, or parse/analysis tool failures. infinite_loop: The agent got stuck in loops without convergence, including repeated attempts at the same solution, oscillating between decisions, or endlessly reading files without making progress.

other: The agent failed to resolve the issue for reasons not covered by the above categories.

Do NOT invent or propose new categories. If none fits, use "other". Category must be all lowercase with underscores. Remember to write two new lines before the category.

User Prompt:

##INSTANCE INFORMATION## Instance ID: {instance_id}

##The complete trajectory of the interaction (to be analyzed)## {traj_text}

##OUTPUT FORMAT## You MUST provide your response in this exact format: xxx

xxx If the Assistant gets stuck in a loop or encounters a tool_error error, indicate the incorrect action and parameters. If the Assistant misunderstands the question, set error_action="None".

A.4 Executor Details and Method-Specific Settings

Unless otherwise noted, all runs are strictly offline. Below we record method-specific configurations referenced in §4:

Standardized offline build/test commands (per language).

We standardize non-interactive commands to ensure reproducible builds and comparable feedback signals across languages:

Execution hardening and navigation controls.

To minimize offline flakiness and improve determinism, we apply:

Table 4: Top-10 languages: Pass@1 (%) per model. Columns are languages; rows are models grouped by agent.

Refer to caption

Figure 8: Distribution of modified files and lines involved in golden patches.

A.5 Analysis of Open-Source Benchmark Distributions

Refer to caption

Figure 9: Distributions across task types, programming scenarios, and languages in Open-Source SWE Benchmarks and Github PR & Issue.Abbreviations: FE: Feature Enhancement, FI: Feature Implementation, CD: Configuration & Deployment, CU: Code Understanding, PO: Performance Optimization, TG: Test Case Generation, BF: Bug Fixing, RF: Refactoring; ID: Infrastructure Development, SPD: Specialized Programming Domains, DE: Data Science & Engineering, SE: Security Engineering, AD: Application Development, DS: Database Systems, ML: Machine Learning & AI, UI/UX: UI/UX Engineering.

We annotated several repository-level SWE benchmark datasets, including SWE-bench-Verified (n=500n=500), SWE-bench-Live (n=500n=500), SWE-bench-Multilingual (n=300n=300), SWE-bench-Pro (n=731n=731), and SWE-rebench (n=449n=449), totaling 2,480 instances across multiple programming languages and scenarios. Figure 9(a) presents the distribution of these open-source datasets across task types, program scenarios, and languages. Through detailed analysis, we identified the following limitations:

We evaluate our approach using a diverse set of state-of-the-art language models, including both closed-source and open-source models. The closed-source models include Claude-Sonnet-4-20250514 [Anthropic, 2025], Gemini-2.5-Flash, Gemini-2.5-Pro [Gemini Team and Google, 2023], and GPT-4.1-2025-04-14 [OpenAI et al., 2024]. For open-source models, we utilize Qwen3-Coder series [Team, 2025, Hui et al., 2024b], Kimi-K2-Instruct-0905 [Team et al., 2025a], Deepseek-V3-0324 [Liu et al., 2024d], and SWE-agent-LM-32B [Yang et al., 2025a]. The complete list of models with their official links is provided in Table 5.

Table 5: Model List.