Alignment Science Blog (original) (raw)

Petri 2.0: New Scenarios, New Model Comparisons, and Improved Eval-Awareness Mitigations

We've improved our Petri automated-behavioral-auditing tool with improved realism mitigations to counter eval-awareness, an expanded seed library with 70 new scenarios, and evaluation results for more recent frontier models.

Stress-testing model specs reveals character differences among language models

We generated 300,000+ queries testing value trade-offs in AI models from Anthropic, OpenAI, Google DeepMind, and xAI. Each model showed distinct value prioritization patterns, and we found thousands of cases of direct contradictions or interpretive ambiguities in model specifications.

Auditing Language Models for Hidden Objectives

Marks,* Treutlein,* et al., 2025

We deliberately train a language model with a hidden objective and use it as a testbed for studying alignment audits.

Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming

Sharma,* Tong,* Mu,* Wei,* Kruthoff,* Goodfriend,* Ong,* Peng et al., 2025

We built a system of constitutional classifiers to prevent jailbreaks. A prototype version of our system withstood over 3,000 hours of expert red teaming with no universal jailbreaks found. Newer versions of our system also have minimal over-refusals and moderate run-time overhead.

Forecasting Rare Language Model Behaviors

Jones*, Tong* et al., 2025

We forecast whether risks will occur after a model is deployed — using even very limited sets of test data.

Sabotage Evaluations for Frontier Models

Benton et al., 2024

How well could AI models mislead us, or secretly sabotage tasks, if they were trying to? We describe a novel set of evaluations that test a model's capacity for sabotage.

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Hubinger et al., 2024

We train LLMs to act secretly malicious. We find that, despite our best efforts at alignment training, deception still slipped through.

Measuring Faithfulness in Chain-of-Thought Reasoning

Lanham et al., 2023

We investigate hypotheses for how language models' Chain-of-Thought (CoT) reasoning may be unfaithful, by examining how the model predictions change when we intervene on the CoT (e.g., by adding mistakes or paraphrasing it).

Question Decomposition Improves the Faithfulness of Model-Generated Reasoning

Radhakrishnan et al., 2023

To improve the faithfulness of Chain-of-Thought (CoT) reasoning, we have models generate their CoT by decomposing questions into subquestions.

Discovering Language Model Behaviors with Model-Written Evaluations

Perez et al., 2022

We develop an automated way to generate language model (LM) evaluations with LMs, significantly reducing the effort involved. We test LMs using >150 LM-written evaluations, uncovering novel LM behaviors.

Constitutional AI: Harmlessness from AI Feedback

Bai et al., 2022

We introduce Constitutional AI, allowing us to give language models explicit values determined by a constitution, rather than values determined implicitly via large-scale human feedback.