A new initiative for developing third-party model evaluations (original) (raw)

A computer chip inside a flame

A robust, third-party evaluation ecosystem is essential for assessing AI capabilities and risks, but the current evaluations landscape is limited. Developing high-quality, safety-relevant evaluations remains challenging, and the demand is outpacing the supply. To address this, today we're introducing a new initiative to fund evaluations developed by third-party organizations that can effectively measure advanced capabilities in AI models. Our investment in these evaluations is intended to elevate the entire field of AI safety, providing valuable tools that benefit the whole ecosystem.

In this post, we describe our initiative to source new evaluations for measuring advanced model capabilities and outline our motivations and the specific types of evaluations we're prioritizing.

If you have a proposal,apply through our application form.

Our highest priority focus areas

We are interested in sourcing three key areas of evaluation development, which we'll describe further in the post:

  1. AI Safety Level assessments
  2. Advanced capability and safety metrics
  3. Infrastructure, tools, and methods for developing evaluations

AI Safety Level assessments

We're seeking evaluations that help us measure the AI Safety Levels (ASLs) defined in our Responsible Scaling Policy. These levels determine the safety and security requirements for models with specific capabilities. Robust ASL evaluations are crucial for ensuring we develop and deploy our models responsibly. This category includes:

Advanced capability and safety metrics

Beyond our ASL assessments, we want to develop evaluations that assess advanced model capabilities and relevant safety criteria. These metrics will provide a more comprehensive understanding of our models' strengths and potential risks. This category includes:

Infrastructure, tools, and methods for developing evaluations

We're interested in funding tools and infrastructure that streamline the development of high-quality evaluations. These will be critical to achieve more efficient and effective testing across the AI community. This category includes:

Principles of good evaluations

Developing great evaluations is hard. Even some of the most experienced developers fall into common traps, and even the best evaluations are not always indicative of risks they purport to measure. Below we list some of the characteristics of good evaluations that we’ve learned through trial-and-error:

1. Sufficiently difficult: Evaluations should be relevant for measuring the capabilities listed for levels ASL-3 or ASL-4 in our Responsible Scaling Policy, and/or human-expert level behavior.

2. Not in the training data: Too often, evaluations end up measuring model memorization because the data is in its training set. Where possible and useful, make sure the model hasn’t seen the evaluation. This helps indicate that the evaluation is capturing behavior that generalizes beyond the training data.

3. Efficient, scalable, ready-to-use: Evaluations should be optimized for efficient execution, leveraging automation where possible. They should be easily deployable using existing infrastructure with minimal setup.

4. High volume where possible: All else equal, evaluations with 1,000 or 10,000 tasks or questions are preferable to those with 100. However, high-quality, low-volume evaluations are also valuable.

  1. Domain expertise: If the evaluation is about expert performance on a particular subject matter (e.g. science), make sure to use subject matter experts to develop or review the evaluation.

6. Diversity of formats: Consider using formats that go beyond multiple choice, such as task-based evaluations (for example, seeing if code passes a test or a flag is captured in a CTF), model-graded evaluations, or human trials.

7. Expert baselines for comparison: It is often useful to compare the model’s performance to the performance of human experts on that domain.

  1. Good documentation and reproducibility: We recommend documenting exactly how the evaluation was developed and any limitations or pitfalls it is likely to have. Use standards like the Inspect or the METR standard where possible.

9. Start small, iterate, and scale: Start by writing just one to five questions or tasks, run a model on the evaluation, and read the model transcripts. Frequently, you’ll realize the evaluation doesn’t capture what you want to test, or it’s too easy.

  1. Realistic, safety-relevant threat modeling: Safety evaluations should ideally have the property that if a model scored highly, experts would believe that a major incident could be caused. Most of the time, when models have performed highly, experts have realized that high performance on that version of the evaluation is not sufficient to worry them.

How to submit a proposal

You can submit a proposal on our application form. For any questions, please reach out to eval-initiative@anthropic.com.

Our team will review submissions on a rolling basis and follow up with selected proposals to discuss next steps. We offer a range of funding options tailored to the needs and stage of each project.

Our experience has shown that refining an evaluation typically requires several iterations. You will have the opportunity to interact directly with our domain experts from the Frontier Red Team, Finetuning, Trust & Safety, and other relevant teams. Our teams can provide guidance to help shape your evaluations for maximum impact.

We hope this initiative serves as a catalyst for progress towards a future where comprehensive AI evaluation is an industry standard. We invite you to join us in this important work and help shape the path forward.

Statement on the US government directive to suspend access to Fable 5 and Mythos 5

The US government has issued an export control directive to suspend all access to Fable 5 and Mythos 5.

Read more

Results from the first Anthropic Public Record

Read more

TCS and Anthropic partner to bring Claude to regulated industries

We’re announcing a partnership with Tata Consultancy Services (TCS). TCS will provide Claude to 50,000 of its own employees across 56 countries; build Claude-powered products for clients in financial services, healthcare, the public sector, and other regulated industries; and join the Claude Partner Network.

Read more