Alignment Science Blog (original) (raw)
Petri 2.0: New Scenarios, New Model Comparisons, and Improved Eval-Awareness Mitigations
We've improved our Petri automated-behavioral-auditing tool with improved realism mitigations to counter eval-awareness, an expanded seed library with 70 new scenarios, and evaluation results for more recent frontier models.
Stress-testing model specs reveals character differences among language models
We generated 300,000+ queries testing value trade-offs in AI models from Anthropic, OpenAI, Google DeepMind, and xAI. Each model showed distinct value prioritization patterns, and we found thousands of cases of direct contradictions or interpretive ambiguities in model specifications.
Auditing Language Models for Hidden Objectives
Marks,* Treutlein,* et al., 2025
We deliberately train a language model with a hidden objective and use it as a testbed for studying alignment audits.
Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming
Sharma,* Tong,* Mu,* Wei,* Kruthoff,* Goodfriend,* Ong,* Peng et al., 2025
We built a system of constitutional classifiers to prevent jailbreaks. A prototype version of our system withstood over 3,000 hours of expert red teaming with no universal jailbreaks found. Newer versions of our system also have minimal over-refusals and moderate run-time overhead.
Forecasting Rare Language Model Behaviors
Jones*, Tong* et al., 2025
We forecast whether risks will occur after a model is deployed — using even very limited sets of test data.
Sabotage Evaluations for Frontier Models
Benton et al., 2024
How well could AI models mislead us, or secretly sabotage tasks, if they were trying to? We describe a novel set of evaluations that test a model's capacity for sabotage.
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
Hubinger et al., 2024
We train LLMs to act secretly malicious. We find that, despite our best efforts at alignment training, deception still slipped through.
Measuring Faithfulness in Chain-of-Thought Reasoning
Lanham et al., 2023
We investigate hypotheses for how language models' Chain-of-Thought (CoT) reasoning may be unfaithful, by examining how the model predictions change when we intervene on the CoT (e.g., by adding mistakes or paraphrasing it).
Question Decomposition Improves the Faithfulness of Model-Generated Reasoning
Radhakrishnan et al., 2023
To improve the faithfulness of Chain-of-Thought (CoT) reasoning, we have models generate their CoT by decomposing questions into subquestions.
Discovering Language Model Behaviors with Model-Written Evaluations
Perez et al., 2022
We develop an automated way to generate language model (LM) evaluations with LMs, significantly reducing the effort involved. We test LMs using >150 LM-written evaluations, uncovering novel LM behaviors.
Constitutional AI: Harmlessness from AI Feedback
Bai et al., 2022
We introduce Constitutional AI, allowing us to give language models explicit values determined by a constitution, rather than values determined implicitly via large-scale human feedback.