Tejal Patwardhan (@tejalpatwardhan) on X (original) (raw)
Excited to open-source PaperBench, our latest frontier eval to measure AI research ability! Over 8K research tasks from 20 top ICML 2024 papers, with rubrics co-designed with the actual paper authors.
We’re releasing PaperBench, a benchmark evaluating the ability of AI agents to replicate state-of-the-art AI research, as part of our Preparedness Framework. Agents must replicate top ICML 2024 papers, including understanding the paper, writing code, and executing experiments.

