Alignment Research Center (original) (raw)
The Alignment Research Center (ARC) is a non-profit research organization whose mission is to align future machine learning systems with human interests.
Our current research focus is developing a theoretical foundation for mechanistic explanations of neural network behavior.
Recent research
Over the past 15 months or so, ARC's technical agenda has developed quite a bit. The advent of the Matching Sampling Principle (MSP), and ideas like it, has begotten a host… »
ARC has teamed up with AIcrowd to launch the ARC White-Box Estimation Challenge, a contest to improve upon our estimation algorithms for random MLPs. The warm-up round begins this week, and… »
This post covers joint work with Wilson Wu, George Robinson, Mike Winer, Victor Lecomte and Paul Christiano. Thanks to Geoffrey Irving and Jess Riedel for comments on the post. In ARC's… »
In 2025, ARC has been making conceptual and theoretical progress at the fastest pace that I've seen since I first interned in 2022. Most of this progress has come about because… »
About ARC
What is “alignment”? ML systems can exhibit goal-directed behavior, but it is difficult to understand or control what they are “trying” to do. Powerful models could cause harm if they were trying to manipulate and deceive humans. The goal of intent alignment is to instead train these models to be helpful and honest.
Motivation: We expect that new techniques will be needed to align AI systems as they surpass human capabilities. As AI progress accelerates it may become increasingly difficult to adapt quickly enough to keep up with the changing technology. We would be better prepared if we had methods that could be safely scaled over many orders of magnitude, in the same way that generative pretraining and reinforcement learning have been scaled up dramatically over the last decade.
What we’re working on: We're designing algorithms that predict neural network behavior by mechanistically analyzing a network’s weights rather than running it on a large set of samples. Our main focus is building methods that are more computationally efficient than sampling while using only mechanistic analysis; we believe these methods will more gracefully handle critical issues like predicting out of distribution performance and detecting anomalies.
Looking for ARC Evals? See METR.