GitHub - CodeClash-ai/CodeClash: Benchmarking Goal-Oriented Software Engineering (original) (raw)

Build License


👋 Overview

CodeClash is a benchmark for evaluating AI systems on goal-oriented software engineering.

Today's AI coding evals are _task_-oriented (e.g.,HumanEval, SWE-bench). Models are given explicit instructions. We then verify correctness with unit tests.

But building software is fundamentally driven by goals ("improve user retention", "reduce costs", "increase revenue"). Reaching our goals via code is a self-directed, iterative, and often competitive process. To capture this dynamism of real software development, we introduce CodeClash!

Check out our arXiv paper and website for the full details!

🏎️ Quick Start

Prerequisites

Installation

Clone the repository

git clone https://github.com/CodeClash-ai/CodeClash.git cd CodeClash

Install uv (if you haven't already)

curl -LsSf https://astral.sh/uv/install.sh | sh

Install dependencies and create virtual environment

uv sync --extra dev

Set up your environment variables

cp .env.example .env # Then edit .env with your GITHUB_TOKEN

Run a test battle

uv run python main.py configs/test/battlesnake.yaml

Tip

CodeClash requires Docker to create execution environments. CodeClash was developed and tested on Ubuntu 22.04.4 LTS. The same instructions should work for Mac. If not, check out #81 for an alternative solution.

Alternative: Using pip (not recommended)

pip install -e '.[dev]' python main.py configs/test/battlesnake.yaml

Once this works, you should be set up to run a real tournament! To run Claude Sonnet 4.5 against o3 in a BattleSnake tournament with 5 rounds and 1000 competition simulations per round, run:

uv run python main.py configs/examples/BattleSnake__claude-sonnet-4-5-20250929__o3__r5__s1000.yaml

⚔️ How It Works

In CodeClash, 2+ LM agents compete in a code arena over the course of a multi-round tournament.

For the duration of the tournament, each agent is iteratively improving their own codebase to win a high-level, competitive objective (e.g., accumulate resources, survive the longest, etc).

Each round consists of two phases:

Critically, LMs don't play the game directly. Their code serves as their competitive proxy. The winner is the LM agent who wins the most rounds.

🧩 Available Arenas

CodeClash includes competitive programming games and simulation-backed arenas, including BattleSnake, CoreWar, CybORG, Halite, HuskyBench, RoboCode, RobotRumble, and SCML.

SCML is a supply-chain negotiation arena based on the ANAC Supply Chain Management League OneShot track. Agents edit a Python scml_agent.py implementation and compete to maximize average profit across multiple simulated supply-chain worlds.

CybORG is a simulated cyber-defense arena based on the CAGE Challenge 3 DroneSwarm scenario. Agents edit a Python cyborg_agent.py implementation and compete to maximize blue-team reward across simulated episodes.

🚀 Get Involved

💫 Contributions

We're actively working on several follow ups! Check out the Contributing Guide for more.

Contact Person: John Yang, Kilian Lieret(Email: johnby@stanford.edu, kl5675@princeton.edu)

🪪 License

MIT. Check LICENSE for more information.

✍️ Citation

@misc{yang2025codeclashbenchmarkinggoalorientedsoftware, title={CodeClash: Benchmarking Goal-Oriented Software Engineering}, author={John Yang and Kilian Lieret and Joyce Yang and Carlos E. Jimenez and Ofir Press and Ludwig Schmidt and Diyi Yang}, year={2025}, eprint={2511.00839}, archivePrefix={arXiv}, primaryClass={cs.SE}, url={https://arxiv.org/abs/2511.00839}, }

📕 Our Other Projects

SWE-bench SWE-agent Mini-SWE-Agent SWE-ReX SWE-smith