SWE-bench Leaderboards (original) (raw)

SWE-bench Verified is a human-filtered subset of 500 instances; use the Agent dropdown to compare LMs with mini-SWE-agent or view all agents [Post].
SWE-bench Multilingual features 300 tasks across 9 programming languages [Post].
SWE-bench Lite is a subset curated for less costly evaluation [Post].
SWE-bench Multimodal features issues with visual elements [Post].

Each entry reports the % Resolved metric, the percentage of instances solved (out of 2294 Full, 500 Verified, 300 Lite & Multilingual, 517 Multimodal).

News

Acknowledgements

We thank the following institutions for their generous support: Open Philanthropy, AWS, Modal, Andreessen Horowitz, OpenAI, and Anthropic.