Redwood Research (original) (raw)

Pioneering threat assessment and mitigation for AI systems

Redwood Research is a nonprofit AI safety and security research organization

In the coming years or decades, AI systems will very plausibly match or exceed human capabilities across most intellectual tasks, fundamentally transforming society. Our research specifically addresses the risks that could arise if these powerful AI systems purposefully act against the interests of their developers and human institutions broadly.

We work to better understand these risks, and to develop methodologies that will allow us to manage them while still realizing the benefits of AI.

Our Focus Areas

Research Area

Evaluations and demonstrations of risk from strategic deception

In Alignment Faking in Large Language Models, we (in collaboration with Anthropic) demonstrated that Claude sometimes hides misaligned intentions. This work is the strongest concrete evidence that LLMs might naturally fake alignment in order to resist attempts to train them.

Strategic Deception Illustration

Consulting Illustration

Applied Work

Consulting on risks from misalignment

We collaborate with governments and advise AI companies including Google DeepMind and Anthropic on practices for assessing and mitigating risks from misaligned AI agents. For example, we partnered with UK AISI to produceA sketch of an AI control safety case. This describes how developers can construct a structured argument that models are incapable of subverting control measures.

Highlighted Research

Our most impactful work on AI safety and security

View All Research