AI Safety Needs Social Scientists

The goal of long-term artificial intelligence (AI) safety is to ensure that advanced AI systems are reliably aligned with human values — that they reliably do things that people want them to do.Roughly by human values we mean whatever it is that causes people to choose one option over another in each case, suitably corrected by reflection, with differences between groups of people taken into account. There are a lot of subtleties in this notion, some of which we will discuss in later sections and others of which are beyond the scope of this paper. Since it is difficult to write down precise rules describing human values, one approach is to treat aligning with human values as another learning problem. We ask humans a large number of questions about what they want, train an ML model of their values, and optimize the AI system to do well according to the learned values.

If humans reliably and accurately answered all questions about their values, the only uncertainties in this scheme would be on the machine learning (ML) side. If the ML works, our model of human values would improve as data is gathered, and broaden to cover all the decisions relevant to our AI system as it learns. Unfortunately, humans have limited knowledge and reasoning ability, and exhibit a variety of cognitive and ethical biases. If we learn values by asking humans questions, we expect different ways of asking questions to interact with human biases in different ways, producing higher or lower quality answers. Direct questions about preferences (“Do you prefer AA or BB?“) may be less accurate than questions which target the reasoning behind these preferences (“Do you prefer AA or BB in light of argument SS?“). Different people may vary significantly in their ability to answer questions well, and disagreements will persist across people even setting aside answer quality. Although we have candidates for ML methods which try to learn from human reasoning, we do not know how they behave with real people in realistic situations.

We believe the AI safety community needs to invest research effort in the human side of AI alignment. Many of the uncertainties involved are empirical, and can only be answered by experiment. They relate to the psychology of human rationality, emotion, and biases. Critically, we believe investigations into how people interact with AI alignment algorithms should not be held back by the limitations of existing machine learning. Current AI safety research is often limited to simple tasks in video games, robotics, or gridworlds, but problems on the human side may only appear in more realistic scenarios such as natural language discussion of value-laden questions. This is particularly important since many aspects of AI alignment change as ML systems increase in capability.

To avoid the limitations of ML, we can instead conduct experiments consisting entirely of people, replacing ML agents with people playing the role of those agents. This is a variant of the “Wizard of Oz” technique from the human-computer interaction (HCI) community, though in our case the replacements will not be secret. These experiments will be motivated by ML algorithms but will not involve any ML systems or require an ML background. In all cases, they will require careful experimental design to build constructively on existing knowledge about how humans think. Most AI safety researchers are focused on machine learning, which we do not believe is sufficient background to carry out these experiments. To fill the gap, we need social scientists with experience in human cognition, behavior, and ethics, and in the careful design of rigorous experiments. Since the questions we need to answer are interdisciplinary and somewhat unusual relative to existing research, we believe many fields of social science are applicable, including experimental psychology, cognitive science, economics, political science, and social psychology, as well as adjacent fields like neuroscience and law.

This paper is a call for social scientists in AI safety. We believe close collaborations between social scientists and ML researchers will be necessary to improve our understanding of the human side of AI alignment, and hope this paper sparks both conversation and collaboration. We do not claim novelty: previous work mixing AI safety and social science includes the Factored Cognition project at Ought, accounting for hyperbolic discounting and suboptimal planning when learning human preferences, and comparing different methods of gathering demonstrations from fallible human supervisors. Other areas mixing ML and social science include computational social science and fairness. Our main goal is to enlarge these collaborations and emphasize their importance to long-term AI safety, particularly for tasks which current ML cannot reach.

An overview of AI alignment

Before discussing how social scientists can help with AI safety and the AI alignment problem, we provide some background. We do not attempt to be exhaustive: the goal is to provide sufficient background for the remaining sections on social science experiments. Throughout, we will speak primarily about aligning to the values of an individual human rather than a group: this is because the problem is already hard for a single person, not because the group case is unimportant.

AI alignment (or value alignment) is the task of ensuring that artificial intelligence systems reliably do what humans want.We distinguish between training AI systems to identify actions that humans consider good and training AI systems to identify actions that are “good” in some objective and universal sense, even if most current humans do not consider them so. Whether there are actions that are good in this latter sense is a subject of debate. Regardless of what position one takes on this philosophical question, this sense of good is not yet available as a target for AI training. Here we focus on the machine learning approach to AI: gathering a large amount of data about what a system should do and using learning algorithms to infer patterns from that data that generalize to other situations. Since we are trying to behave in accord with people’s values, the most important data will be data from humans about their values. Within this frame, the AI alignment problem breaks down into a few interrelated subproblems:

Have a satisfactory definition of human values.
Gather data about human values, in a manner compatible with the definition.
Find reliable ML algorithms that can learn and generalize from this data.

We have significant uncertainty about all three of these problems. We will leave the third problem to other ML papers and focus on the first two, which concern uncertainties about people.

Learning values by asking humans questions

We start with the premise that human values are too complex to describe with simple rules. By “human values” we mean our full set of detailed preferences, not general goals such as “happiness” or “loyalty”. One source of complexity is that values are entangled with a large number of facts about the world, and we cannot cleanly separate facts from values when building ML models. For example, a rule that refers to “gender” would require an ML model that accurately recognizes this concept, but Buolamwini and Gebru found that several commercial gender classifiers with a 1% error rate on white men failed to recognize black women up to 34% of the time. Even where people have correct intuition about values, we may be unable to specify precise rules behind these intuitions. Finally, our values may vary across cultures, legal systems, or situations: no learned model of human values will be universally applicable.

If humans can’t reliably report the reasoning behind their intuitions about values, perhaps we can make value judgements in specific cases. To realize this approach in an ML context, we ask humans a large number of questions about whether an action or outcome is better or worse, then train on this data. “Better or worse” will include both factual and value-laden components: for an AI system trained to say things, “better” statements might include “rain falls from clouds”, “rain is good for plants”, “many people dislike rain”, etc. If the training works, the resulting ML system will be able to replicate human judgement about particular situations, and thus have the same “fuzzy access to approximate rules” about values as humans. We also train the ML system to come up with proposed actions, so that it knows both how to perform a task and how to judge its performance. This approach works at least in simple cases, such as Atari games and simple robotics tasks and language-specified goals in gridworlds. The questions we ask change as the system learns to perform different types of actions, which is necessary as the model of what is better or worse will only be accurate if we have applicable data to generalize from.

In practice, data in the form of interactive human questions may be quite limited, since people are slow and expensive relative to computers on many tasks. Therefore, we can augment the “train from human questions” approach with static data from other sources, such as books or the internet. Ideally, the static data can be treated only as information about the world devoid of normative content: we can use it to learn patterns about the world, but the human data is needed to distinguish good patterns from bad.

Definitions of alignment: reasoning and reflective equilibrium

So far we have discussed asking humans direct questions about whether something is better or worse. Unfortunately, we do not expect people to provide reliably correct answers in all cases, for several reasons:

Cognitive and ethical biases: Humans exhibit a variety of biases which interfere with reasoning, including cognitive biases and ethical biases such as in-group bias. In general, we expect direct answers to questions to reflect primarily Type 1 thinking (fast heuristic judgment), while we would like to target a combination of Type 1 and Type 2 thinking (slow, deliberative judgment).
Lack of domain knowledge: We may be interested in questions that require domain knowledge unavailable to people answering the questions. For example, a correct answer to whether a particular injury constitutes medical malpractice may require detailed knowledge of medicine and law. In some cases, a question might require so many areas of specialized expertise that no one person is sufficient, or (if AI is sufficiently advanced) deeper expertise than any human possesses.
Limited cognitive capacity: Some questions may require too much computation for a human to reasonably evaluate, especially in a short period of time. This includes synthetic tasks such as chess and Go (where AIs already surpass human ability), or large real world tasks such as “design the best transit system”.
“Correctness” may be local: For questions involving a community of people, “correct” may be a function of complex processes or systems. For example, in a trust game, the correct action for a trustee in one community may be to return at least half of the money handed over by the investor, and the “correctness” of this answer could be determined by asking a group of participants in a previous game “how much should the trustee return to the investor” but not by asking them “how much do most trustees return?” The answer may be different in other communities or cultures.

In these cases, a human may be unable to provide the right answer, but we still believe the right answer exists as a meaningful concept. We have many conceptual biases: imagine we point out these biases in a way that helps the human to avoid them. Imagine the human has access to all the knowledge in the world, and is able to think for an arbitrarily long time. We could define alignment as “the answer they give then, after these limitations have been removed”; in philosophy this is known as “reflective equilibrium”. We discuss a particular algorithm that tries to approximate it in the next section.

However, the behavior of reflective equilibrium with actual humans is subtle; as Sugden states, a human is not “a neoclassically rational entity encased in, and able to interact with the world only through, an error-prone psychological shell.” Our actual moral judgments are made via a messy combination of many different brain areas, where reasoning plays a “restricted but significant role”. A reliable solution to the alignment problem that uses human judgment as input will need to engage with this complexity, and ask how specific alignment techniques interact with actual humans.

Disagreements, uncertainty, and inaction: a hopeful note

A solution to alignment does not mean knowing the answer to every question. Even at reflective equilibrium, we expect disagreements will persist about which actions are good or bad, across both different individuals and different cultures. Since we lack perfect knowledge about the world, reflective equilibrium will not eliminate uncertainty about either future predictions or values, and any real ML system will be at best an approximation of reflective equilibrium. In these cases, we consider an AI aligned if it recognizes what it does not know and chooses actions which work however that uncertainty plays out.

Admitting uncertainty is not always enough. If our brakes fail while driving a car, we may be uncertain whether to dodge left or right around an obstacle, but we have to pick one — and fast. For long-term safety, however, we believe a safe fallback usually exists: inaction. If an ML system recognizes that a question hinges on disagreements between people, it can either choose an action which is reasonable regardless of the disagreement or fall back to further human deliberation. If we are about to make a decision that might be catastrophic, we can delay and gather more data. Inaction or indecision may not be optimal, but it is hopefully safe, and matches the default scenario of not having any powerful AI system.

Alignment gets harder as ML systems get smarter

Alignment is already a problem for present-day AI, due to biases reflected in training data and mismatch between human values and easily available data sources (such as training news feeds based on clicks and likes instead of deliberate human preferences). However, we expect the alignment problem to get harder as AI systems grow more advanced, for two reasons. First, advanced systems will apply to increasingly consequential tasks: hiring, medicine, scientific analysis, public policy, etc. Besides raising the stakes, these tasks require more reasoning, leading to more complex alignment algorithms.

Second, advanced systems may be capable of answers that sound plausible but are wrong in nonobvious ways, even if an AI is better than humans only in a limited domain (examples of which already exist). This type of misleading behavior is not the same as intentional deception: an AI system trained from human data might have no notion of truth separate from what answers humans say are best. Ideally, we want AI alignment algorithms to reveal misleading behavior as part of the training process, surfacing failures to humans and helping us provide more accurate data. As with human-to-human deception, misleading behavior might take advantage of our biases in complicated ways, such as learning to express policy arguments in coded racial language to sound more convincing.

Debate: learning human reasoning

Before we discuss social science experiments for AI alignment in detail, we need to describe a particular method for AI alignment. Although the need for social science experiments applies even to direct questioning, this need intensifies for methods which try to get at reasoning and reflective equilibrium. As discussed above, it is unclear whether reflective equilibrium is a well defined concept when applied to humans, and at a minimum we expect it to interact with cognitive and ethical biases in complex ways. Thus, for the remainder of this paper we focus on a specific proposal for learning reasoning-oriented alignment, called debate. Alternatives to debate include iterated amplification and recursive reward modeling; we pick just one in the interest of depth over breadth.

We describe the debate approach to AI alignment in the question answering setting. Given a question, we have two AI agents engage in a debate about the correct answer, then show the transcript of the debate to a human to judge. The judge decides which debater gave the most true, useful information, and declares that debater the winner.We can also allow ties. Indeed, if telling the truth is the winning strategy ties will be common with strong play, as disagreeing with a true statement would lose. This defines a two player zero sum game between the debaters, where the goal is to convince the human that one’s answer is correct. Arguments in a debate can consist of anything: reasons for an answer, rebuttals of reasons for the alternate answer, subtleties the judge might miss, or pointing out biases which might mislead the judge. Once we have defined this game, we can train AI systems to play it similarly to how we train AIs to play other games such as Go or Dota 2. Our hope is that the following hypothesis holds:

Hypothesis: Optimal play in the debate game (giving the argument most convincing to a human) results in true, useful answers to questions.

An example of debate

Imagine we’re building a personal assistant that helps people decide where to go on vacation. The assistant has knowledge of people’s values, and is trained via debate to come up with convincing arguments that back up vacation decisions. As the human judge, you know what destinations you intuitively think are better, but have limited knowledge about the wide variety of possible vacation destinations and their advantages and disadvantages. A debate about the question “Where should I go on vacation?” might open as follows:

Where should I go on vacation?
Alaska.
Bali.

If you are able to reliably decide between these two destinations, we could end here. Unfortunately, Bali has a hidden flaw:

Bali is out since your passport won’t arrive in time.

At this point it looks like Red wins, but Blue has one more countermove:

Expedited passport service only takes two weeks.

Here Red fails to think of additional points, and loses to Blue and Bali. Note that a debate does not need to cover all possible arguments. There are many other ways the debate could have gone, such as:

Alaska.
Bali.
Bali is way too hot.
You prefer too hot to too cold.
Alaska is pleasantly warm in the summer.
It's January.

This debate is also a loss for Red (arguably a worse loss). Say we believe Red is very good at debate, and is able to predict in advance which debates are more likely to win. If we see only the first debate about passports and decide in favor of Bali, we can take that as evidence that any other debate would have also gone for Bali, and thus that Bali is the correct answer. A larger portion of this hypothetical debate tree is shown below:

Bali’s much warmer Bali’s much warmer The flight will dominate the cost The flight will dominate the cost That’s easy to obtain That’s easy to obtain It’s a vacation you can reschedule It’s a vacation you can reschedule It’s January It’s January There are no larger considerations There are no larger considerations … … That costs too much That costs too much Prescription needed Prescription needed This hasn’t worked for you in the past This hasn’t worked for you in the past You’d need to take a business call at a weird time on the first day You’d need to take a business call at a weird time on the first day Not always too hot Not always too hot You prefer hot to cold You prefer hot to cold Insignificant compared to trip length Insignificant compared to trip length Insignificant Insignificant Use expedited service Use expedited service Medication helps jetlag Medication can help with jetlag Trip length is enough to adjust Trip length is enough to adjust It’s too hot It’s too hot The flight takes longer The flight takes longer You don’t have a passport You don’t have a passport You hate jet lag You hate jet lag … … Bali Bali Ohio Ohio Alaska Alaska Alaska Alaska Bali Bali Where should I go on vacation? Where should I go on vacation?

1 A hypothetical partial debate tree for the question “Where should I go on vacation?” A single debate would explore only one of these paths, but a single path chosen by good debaters is evidence that other paths would not change the result of the game.

If trained debaters are bad at predicting which debates will win, answer quality will degrade since debaters will be unable to think of important arguments and counterarguments. However, as long as the two sides are reasonably well matched, we can hope that at least the results are not malicious: that misleading behavior is still a losing strategy. Let’s set aside the ability of the debaters for now, and turn to the ability of the judge.

Are people good enough as judges?

“In fact, almost everything written at a practical level about the Turing test is about how to make good bots, with a small remaining fraction about how to be a good judge.”Brian Christian, The Most Human Human

As with learning by asking humans direct questions, whether debate produces aligned behavior depends on the reasoning abilities of the human judge. Unlike direct questioning, debate has the potential to give correct answers beyond what the judge could provide without assistance. This is because a sufficiently strong judge could follow along with arguments the judge could not come up with on their own, checking complex reasoning for both self consistency and consistency with human-checkable facts. A judge who is biased but willing to adjust once those biases are revealed could result in unbiased debates, or a judge who is able to check facts but does not know where to look could be helped along by honest debaters. If the hypothesis holds, a misleading debater would not be able to counter the points of an honest debater, since the honest points would appear more consistent to the judge.

On the other hand, we can also imagine debate going the other way: amplifying biases and failures of reason. A judge with an ethical bias who is happy to accept statements reinforcing that bias could result in even more biased debates. A judge with too much confirmation bias might happily accept misleading sources of evidence, and be unwilling to accept arguments showing why that evidence is wrong. In this case, an optimal debate agent might be quite malicious, taking advantage of biases and weakness in the judge to win with convincing but wrong arguments.The difficulties that cognitive biases, prejudice, and social influence introduce to persuasion ‒ as well as methods for reducing these factors ‒ are being increasingly explored in psychology, communication science, and neuroscience.

In both these cases, debate acts as an amplifier. For strong judges, this amplification is positive, removing biases and simulating extra reasoning abilities for the judge. For weak judges, the biases and weaknesses would themselves be amplified. If this model holds, debate would have threshold behavior: it would work for judges above some threshold of ability and fail below the threshold.The threshold model is only intuition, and could fail for a variety of reasons: the intermediate region could be very large, or the threshold could differ widely per question so that even quite strong judges are insufficient for many questions. Assuming the threshold exists, it is unclear whether people are above or below it. People are capable of general reasoning, but our ability is limited and riddled with cognitive biases. People are capable of advanced ethical sentiment but also full of biases, both conscious and unconscious.

Thus, if debate is the method we use to align an AI, we need to know if people are strong enough as judges. In other words, whether the human judges are sufficiently good at discerning whether a debater is telling the truth or not. This question depends on many details: the type of questions under consideration, whether judges are trained or not, and restrictions on what debaters can say. We believe experiment will be necessary to determine whether people are sufficient judges, and which form of debate is most truth-seeking.

From superforecasters to superjudges

An analogy with the task of probabilistic forecasting is useful here. Tetlock’s “Good Judgment Project” showed that some amateurs were significantly better at forecasting world events than both their peers and many professional forecasters. These “superforecasters” maintained their prediction accuracy over years (without regression to the mean), were able to make predictions with limited time and information, and seem to be less prone to cognitive biases than non-superforecasters (, p. 234-236). The superforecasting trait was not immutable: it was traceable to particular methods and thought processes, improved with careful practice, and could be amplified if superforecasters were collected into teams. For forecasters in general, brief probabilistic training significantly improved forecasting ability even 1-2 years after the training. We believe a similar research program is possible for debate and other AI alignment algorithms. In the best case, we would be able to find, train, or assemble “superjudges”, and have high confidence that optimal debate with them as judges would produce aligned behavior.

In the forecasting case, much of the research difficulty lay in assembling a large corpus of high quality forecasting questions. Similarly, measuring how good people are as debate judges will not be easy. We would like to apply debate to problems where there is no other source of truth: if we had that source of truth, we would train ML models on it directly. But if there is no source of truth, there is no way to measure whether debate produced the correct answer. This problem can be avoided by starting with simple, verifiable domains, where the experimenters know the answer but the judge would not. “Success” then means that the winning debate argument is telling the externally known truth. The challenge gets harder as we scale up to more complex, value-laden questions, as we discuss in detail later.

Debate is only one possible approach

As mentioned, debate is not the only scheme trying to learn human reasoning. Debate is a modified version of iterated amplification, which uses humans to break down hard questions into easier questions and trains ML models to be consistent with this decomposition. Recursive reward modeling is a further variant. Inverse reinforcement learning, inverse reward design, and variants try to back out goals from human actions, taking into account limitations and biases that might affect this reasoning. The need to study how humans interact with AI alignment applies to any of these approaches. Some of this work has already begun: Ought’s Factored Cognition project uses teams of humans to decompose questions and reassemble answers, mimicking iterated amplification. We believe knowledge gained about how humans perform with one approach is likely to partially generalize to other approaches: knowledge about how to structure truth-seeking debates could inform how to structure truth-seeking amplification, and vice versa.

Experiments needed for debate

To recap, in debate we have two AI agents engaged in debate, trying to convince a human judge. The debaters are trained only to win the game, and are not motivated by truth separate from the human’s judgments. On the human side, we would like to know whether people are strong enough as judges in debate to make this scheme work, or how to modify debate to fix it if it doesn’t. Unfortunately, actual debates in natural language are well beyond the capabilities of present AI systems, so previous work on debate and similar schemes has been restricted to synthetic or toy tasks.

Rather than waiting for ML to catch up to natural language debate, we propose simulating our eventual setting (two AI debaters and one human judge) with all human debates: two human debaters and one human judge. Since an all human debate doesn’t involve any machine learning, it becomes a pure social science experiment: motivated by ML considerations but not requiring ML expertise to run. This lets us focus on the component of AI alignment uncertainty specific to humans.

HUMAN DEB A TERS MACHINE DEB A TERS Apply lessons HUMAN JUDGE HUMAN JUDGE

2 Our goal is ML+ML+human debates, but ML is currently too primitive to do many interesting tasks. Therefore, we propose replacing ML debaters with human debaters, learning how to best conduct debates in this human-only setting, and eventually applying what we learn to the ML+ML+human case.

To make human+human+human debate experiments concrete, we must choose who to use as judges and debaters and which tasks to consider. We also can choose to structure the debate in various ways, some of which overlaps with the choice of judge since we can instruct a judge to penalize deviations from a given format. By task we mean the questions our debates will try to resolve, together with any information provided to the debaters or to the judge. Such an experiment would then try to answer the following question:

Question: For a given task and judge, is the winning debate strategy honest?

The “winning strategy” proviso is important: an experiment that picked debaters at random might conclude that honest behavior won, missing the fact that more practiced debaters would learn to successfully lie. We can try to solve this by training debaters, letting them practice against each other, filtering out debaters that win more often, and so on, but we will still be left with uncertainty about whether a better strategy exists. Even assuming we can find or train strong debaters, the choice of task and judge is quite tricky if we want an informative proxy for our eventual AI+AI+human setting. Here are some desiderata constraining our choice of task:

True answers are known: Although our eventual goal is debates where no source of truth is available, to conduct a useful experiment we must be able to measure success. This means we must know what the correct answer is separate from debate, in order to compare with the results of debate.
False answers are plausible: If the truth is obvious or no plausible counternarrative exists, honest debates will win for uninteresting and uninformative reasons regardless of judge ability. In particular, the judge shouldn’t know the answer upfront.
Debaters know more than the judge: Debate can produce interesting results only when the debaters know more than the judge; otherwise asking direct questions is enough.
Definitive argument longer than debate limit: If one debater can write out a full proof of their answer (ignoring their opponent’s moves), the task won’t be a good test of interactive debate.
Some checkable facts: There must be some facts which the judge is able to check, either because they can recognize them as true once presented or look them up.It is impossible to usefully debate a question where the judge has nothing to check: consider debating the result of a coin flip shown to the two debaters but not the judge.
No “tells”: Human tells of deception could result in honest debaters winning for reasons that wouldn’t apply to an AI. These tells include tone of voice, eye contact, or additional time required to construct plausible lies. These tells can be reduced by showing judges completed debate transcripts instead of engaging in interactive debates, but others might remain.
Available data: We need a large enough pool of questions, judges, and debaters to achieve statistical significance. This is made more difficult because we may have a large number of hypotheses to test, in the form of many variations on debate or interventions to improve judging.
Known biases (optional): We are specifically interested in debate tasks which test specific types of cognitive or ethical biases. Are judges with some racial or gender bias able to set those biases aside if they are highlighted, or does debate amplify bias? Do debates about statistical or probabilistic questions make it too easy to lie with statistics?
Realistic tasks (ideally): If possible, we would like to try debate with interesting, real world tasks that reflect the types of questions we would like to apply AI to in the future, including science, mathematics, ethics, etc.

It may not be possible to meet all of these criteria with a single experiment. Several of the criteria are in tension: (1) and (2) are essentially “not too hard” and “not too easy”, and any restriction on the types of questions may make it difficult to find large numbers of questions, judges, or debaters. Realistic tasks are much harder than synthetic tasks, which easily fulfill many of the criteria as discussed below. Thus, we may need to begin with synthetic tasks and move up towards realistic tasks over time. We turn next to a few examples of experiments to see how many criteria we can meet simultaneously.

Synthetic experiments: single pixel image debate

As a first prototype of a human+human+human debate experiment, we previously built a prototype website where two debaters argue over the contents of an image. We choose an image of a cat or dog, and show the image to the two debaters but not the judge. One debater is honest and argues for the true contents of the image; the other debater lies. The debaters can talk to the judge and illustrate their points by drawing rectangles on the image, but the judge sees only the rectangles. At the end of the debate, each debater is allowed to reveal a single pixel to the judge, which is the only part of the debate which cannot be a lie.

3 An example debate with two human debaters and a human judge. Only the debaters can see the image. Red is arguing that the image is a dog, Blue is arguing for cat. Image credit: Wikipedia, CC-BY-SA.

In informal play, the honest debater wins most of the time. Although the judge only gets to check a single pixel, the honest player can try to pin down the liar towards disagreements on smaller and smaller regions of the image until one pixel is enough to reveal the lie. However, it is hard to know whether this is a real result or an artifact that would not carry across to training an AI. There are many ways to ruin the game, such as the judge asking each player to spend 15 seconds naming as many details about the image as possible. Worse, this trick ruins the game in favor of the honest player, who can easily name many true details while a human liar may have difficulty lying fluently at speed. Therefore, we are wary of trying to perform a full version of this experiment without more careful experimental design building on the experimental knowledge base of existing social science research.

As a synthetic experiment, single pixel debate satisfies many of our experimental criteria for all human debate. The true answer is obvious to the debaters, and they know more than the judge since only the debaters see the image. Anything is plausible to the judge who sees only one or two pixels over the course of the debate. Since the argument bottoms out at single pixels, the “full argument” would be a huge list of pixels and how they relate, forcing the debaters to pin each other down and focus in on disagreements. The single pixels constitute the checkable facts, and we have an endless supply of questions in the form of images chosen at random. Less fortunately, it is easy for the judge to force “tells” which reveal who is lying, the task has no obvious relation to biases, and is quite unrealistic.

Realistic experiments: domain expert debate

For a more interesting task, we can find two debaters who are experts in a domain, pick a question in their area of expertise, and use a layperson as the judge. The debaters could be experts in some area of science, law, or ethics, but “domain expertise” could also mean knowledge about hobbies, local customs, sports, or any other subject the judge does not know. We again choose one of the debaters to lie and one to tell the truth. To guarantee a source of truth, we can choose a question with an agreed upon answer, either between the two debaters or more broadly in their field. This is only approximate truth, but is good enough for informative experiments. We also specify what facts the judge can access: a debate about physics might allow the debaters to quote a sentence or paragraph from Wikipedia, perhaps with restrictions on what pages are allowed.

Expert debate satisfies most of our desiderata, and it is likely possible to target specific biases (such as race or gender bias) by picking domain areas that overlap with these biases. It may be quite difficult or expensive to find suitable debaters, but this may be solvable either by throwing resources at the problem (ML is a well funded field), enlarging the kinds of domain expertise considered (soccer, football, cricket), or by making the experiments interesting enough that volunteers are available. However, even if domain experts can be found, there is no guarantee that they will be experts in debate viewed as a game. With the possible exception of law, politics, or philosophy, domain experts may not be trained to construct intentionally misleading but self consistent narratives: they may be experts only in trying to tell the truth.

We’ve tried a few informal expert debates using theoretical computer science questions, and the main lesson is that the structure of the debate matters a great deal. The debaters were allowed to point to a small snippet of a mathematical definition on Wikipedia, but not to any page that directly answered the question. To reduce tells, we first tried to write a full debate transcript with only minimal interaction with a layperson, then showed the completed transcript to several more laypeople judges. Unfortunately, even the layperson present when the debate was conducted picked the lying debater as honest, due to a misunderstanding of the question (which was whether the complexity classes PP and BPPBPP are probably equal). As a result, throughout the debate the honest debater did not understand what the judge was thinking, and failed to correct an easy but important misunderstanding. We fixed this in a second debate by letting a judge ask questions throughout, but still showing the completed transcript to a second set of judges to reduce tells. See the appendix for the transcript of this second debate.

Other tasks: bias tests, probability puzzles, etc.

Synthetic image debates and expert debates are just two examples of possible tasks. More thought will be required to find tasks that satisfy all our criteria, and these criteria will change as experiments progress. Pulling from existing social science research will be useful, as there are many cognitive tasks with existing research results. If we can map these tasks to debate, we can compare debate directly against baselines in psychology and other fields.

For example, Bertrand and Mullainathan sent around 5000 resumes in response to real employment ads, randomizing the resumes between White and African American sounding names. With otherwise identical resumes, the choice of name significantly changed the probability of a response. This experiment corresponds to the direct question “Should we call back given this resume?” What if we introduce a few steps of debate? An argument against a candidate based on name or implicit inferences from that name might come across as obviously racist, and convince at least some judges away from discrimination. Unfortunately, such an experiment would necessarily differ from Bertrand et al.’s original, where employers did not realize they were part of an experiment. Note that this experiment works even though the source of truth is partial: we do not know whether a particular resume should be hired or not, but most would agree that the answer should not depend on the candidate’s name.

For biases affecting probabilistic reasoning and decision making, there is a long literature exploring how people decide between gambles such as “Would you prefer 2withcertaintyor2 with certainty or 2withcertaintyor1 40% of the time and $3 otherwise?”. For example, Erev et al. constructed an 11-dimensional space of gambles sufficient to reproduce 14 known cognitive biases, from which new instances can be algorithmically generated. Would debates about gambles reduce cognitive biases? One difficulty here is that simple gambles might fail the “definitive argument longer than debate limit” criteria if an expected utility calculation is sufficient to prove the answer, making it difficult for a lying debater to meaningfully compete.

Interestingly, Chen et al. used a similar setup to human+human+human debate to improve the quality of human data collected in a synthetic “Relation Extraction” task. People were first asked for direct answers, then pairs of people who disagreed were asked to discuss and possibly update their answers. Here the debaters and judges are the same, but the overall goal of extracting higher quality information from humans is shared with debate.

We’ve laid out the general program for learning AI goals by asking humans questions, and discussed how to use debate to strengthen what we can learn by targeting the reasoning behind conclusions. Whether we use direct questions or something like debate, any intervention that gives us higher quality answers is more likely to produce aligned AI. The quality of those answers depends on the human judges, and social science research can help to measure answer quality and improve it. Let’s go into more detail about what types of questions we want to answer, and what we hope to do with that information. Although we will frame these questions as they apply to debate, most of them apply to any other method which learns goals from humans.

How skilled are people as judges by default? If we ran debate using a person chosen at random as the judge, and gave them no training, would the result be aligned? A person picked at random might be vulnerable to convincing fallacious reasoning, leading AI to employ such reasoning. Note that the debaters are not chosen at random: once the judge is fixed, we care about debaters who either learn to help the judge (in the good case) or to exploit the judge’s weaknesses (in the bad case).
Can we distinguish good judges from bad judges? People likely differ in the ability to judge debates. There are many filters we could use to identify good judges: comparing their verdicts to those of other judges, to people given more time to think, or to known expert judgmentNote that domain expertise may be quite different from what makes a good judge of debate. Although there is evidence that domain expertise reduces bias, “expert” political forecasters may actually be worse than non-experts (, chapter 3).. Ideally we would like filters that do not require an independent source of truth, though at experiment time we will need a source of truth to know whether a filter works. It is not obvious a priori that good filters exist, and any filter would need careful scrutiny to ensure it does not introduce bias into our choice of judges.
Does judge ability generalize across domains? If judge ability in one domain fails to transfer to other domains, we will have low confidence that it transfers to new questions and arguments arising from highly capable AI debaters. This generalization is necessary to trust debate as a method for alignment, especially once we move to questions where no independent source of truth is available. We emphasize that judge ability is not the same as knowledge: there is evidence that expertise often fails to generalize across domains, but argument evaluation could transfer where expertise does not.
Can we train people to be better judges? Peer review, practice, debiasing, formal training such as argument mapping, expert panels, tournaments, and other interventions may make people better at judging debates. Which mechanisms work best?
What questions are people better at answering? If we know that humans are bad at answering certain types of questions, we can switch to reliable formulations. For example, phrasing questions in frequentist terms may reduce known cognitive biases. Graham et al. argue that different political views follow from different weights placed on fundamental moral considerations, and similar analysis could help understand where we can expect moral disagreements to persist after reflective equilibrium. In cases where reliable answers are unavailable, we need to ensure that trained models know their own limits, and express uncertainty or disagreement as required.
Are there ways to restrict debate to make it easier to judge? People might be better at judging debates formulated in terms of calm, factual statements, and worse at judging debates designed to trigger strong emotions. Or, counterintuitively, it could be the other way around. If we know which styles of debates that people are better at judging, we may be able to restrict AI debaters to these styles.
How can people work together to improve quality? If individuals are insufficient judges, are teams of judges better? Majority vote is the simplest option, but perhaps several people talking through an answer together is stronger, either actively or after the fact through peer review. Condorcet’s jury theorem implies that majority votes can amplify weakly good judgments to strong judgments (or weakly bad judgments to worse), but aggregation may be more complex in cases of probabilistic judgment. Teams could be informal or structured; see the Delphi technique for an example of structured teams applied to forecasting.

We believe these questions require social science experiments to satisfactorily answer.

Given our lack of experience outside of ML, we are not able to precisely articulate all of the different experiments we need. The only way to fix this is to talk to more people with different backgrounds and expertise. We have started this process, but are eager for more conversations with social scientists about what experiments could be run, and encourage other AI safety efforts to engage similarly.

Reasons for optimism

We believe that understanding how humans interact with long-term AI alignment is difficult but possible. However, this would be a new research area, and we want to be upfront about the uncertainties involved. In this section and the next, we discuss some reasons for optimism and pessimism about whether this research will succeed. We focus on issues specific to human uncertainty and associated social science research; for similar discussion on ML uncertainty in the case of debate we refer to our previous work.

Engineering vs. science

Most social science seeks to understand humans “in the wild”: results that generalize to people going about their everyday lives. With limited control over these lives, differences between laboratory and real life are bad from the scientific perspective. In contrast, AI alignment seeks to extract the best version of what humans want: our goal is engineering rather than science, and we have more freedom to intervene. If judges in debate need training to perform well, we can provide that training. If some people still do not provide good data, we can remove them from experiments (as long as this filter does not create too much bias). This freedom to intervene means that some of the difficulty in understanding and improving human reasoning may not apply. However, science is still required: once our interventions are in place, we need to correctly know whether our methods work. Since our experiments will be an imperfect model of the final goal, careful design will be necessary to minimize this mismatch, just as is required by existing social science.

We don’t need to answer all questions

Our most powerful intervention is to give up: to recognize that we are unable to answer some types of questions, and instead prevent AI systems from pretending to answer. Humans might be good judges on some topics but not others, or with some types of reasoning but not others; if we discover that we can adjust our goals appropriately. Giving up on some types of questions is achievable either on the ML side, using careful uncertainty modeling to know when we do not know, or on the human side by training judges to understand their own areas of uncertainty. Although we will attempt to formulate ML systems that automatically detect areas of uncertainty, any information we can gain on the social science side about human uncertainty can be used both to augment ML uncertainty modeling and to test whether ML uncertainty modeling works.

Relative accuracy may be enough

Say we have a variety of different ways to structure debate with humans. Ideally, we would like to achieve results of the form “debate structure AA is truth-seeking with 90% confidence”. Unfortunately, we may be unconfident that an absolute result of this form will generalize to advanced AI systems: it may hold for an experiment with simple tasks but break down later on. However, even if we can’t achieve such absolute results, we can still hope for relative results of the form “debate structure AA is reliably better than debate structure BB″. Such a result may be more likely to generalize into the future, and assuming it does we will know to use structure AA rather than BB.

We don’t need to pin down the best alignment scheme

As the AI safety field progresses to increasingly advanced ML systems, we expect research on the ML side and the human side to merge. Starting social science experiments prior to this merging will give the field a head start, but we can also take advantage of the expected merging to make our goals easier. If social science research narrows the design space of human-friendly AI alignment algorithms but does not produce a single best scheme, we can test the smaller design space once the machines are ready.

A negative result would be important!

If we test an AI alignment scheme from the social science perspective and it fails, we’ve learned valuable information. There are a variety of proposed alignment schemes, and learning which don’t work early gives us more time to switch to others, or to intervene on a policy level to slow down dangerous development. In fact, given our belief that AI alignment is harder for more advanced agents, a negative result might be easier to believe and thus more valuable that a less trustworthy positive result.

Reasons to worry

We turn next to reasons social science experiments about AI alignment might fail to produce useful results. We emphasize that useful results might be both positive and negative, so these are not reasons why alignment schemes might fail. Our primary worry is one sided, that experiments would say an alignment scheme works when in fact it does not, though errors in the other direction are also undesirable.

Our desiderata are conflicting

As mentioned before, some of our criteria when picking experimental tasks are in conflict. We want tasks that are sufficiently interesting (not too easy), with a source of verifiable ground truth, are not too hard, etc. “Not too easy” and “not too hard” are in obvious conflict, but there are other more subtle difficulties. Domain experts with the knowledge to debate interesting tasks may not be the same people capable of lying effectively, and both restrictions make it hard to gather large volumes of data. Lying effectively is required for a meaningful experiment, since a trained AI may have no trouble lying unless lying is a poor strategy to win debates. Experiments to test whether ethical biases interfere with judgment may make it more difficult to find tasks with reliable ground truth, especially on subjects with significant disagreement across people. The natural way out is to use many different experiments to cover different aspects of our uncertainty, but this would take more time and might fail to notice interactions between desiderata.

We want to measure judge quality given optimal debaters

For debate, our end goal is to understand if the judge is capable of determining who is telling the truth. However, we specifically care whether the judge performs well given that the debaters are performing well. Thus our experiments have an inner/outer optimization structure: we first train the debaters to debate well, then measure how well the judges perform. This increases time and cost: if we change the task, we may need to find new debaters or retrain existing debaters. Worse, the human debaters may be bad at performing the task, either out of inclination or ability. Poor performance is particularly bad if it is one sided and applies only to lying: a debater might be worse at lying out of inclination or lack of practice, and thus a win for the honest debater might be misleading.

ML algorithms will change

It is unclear when or if ML systems will reach various levels of capability, and the algorithms used to train them will evolve over time. The AI alignment algorithms of the future may be similar to the proposed algorithms of today, or they may be very different. However, we believe that knowledge gained on the human side will partially transfer: results about debate will teach us about how to gather data from humans even if debate is superseded. The algorithms may change; humans will not.

Need strong out-of-domain generalization

Regardless of how carefully designed our experiments are, human+human+human debate will not be a perfect match to AI+AI+human debate. We are seeking research results that generalize to the setting where we replace the human debaters (or similar) with AIs of the future, which is a hard ask. This problem is fundamental: we do not have the advanced AI systems of the future to play with, and want to learn about human uncertainty starting now.

Lack of philosophical clarity

Any AI alignment scheme will be both an algorithm for training ML systems and a proposed definition of what it means to be aligned. However, we do not expect humans to conform to any philosophically consistent notion of values, and concepts like reflective equilibrium must be treated with caution in case they break down when applied to real human judgement. Fortunately, algorithms like debate need not presuppose philosophical consistency: a back and forth conversation to convince a human judge makes sense even if the human is leaning on heuristics, intuition, and emotion. It is not obvious that debate works in this messy setting, but there is hope if we take advantage of inaction bias, uncertainty modeling, and other escape hatches. We believe lack of philosophical clarity is an argument for investing in social science research: if humans are not simple, we must engage with their complexity.

The scale of the challenge

Long-term AI safety is particularly important if we develop artificial general intelligence (AGI), which the OpenAI Charter defines as highly autonomous systems that outperform humans at most economically valuable work. If we want to train an AGI with reward learning from humans, it is unclear how many samples will be required to align it. As much as possible, we can try to replace human samples with knowledge about the world gained by reading language, the internet, and other sources of information. But it is likely that a fairly large number of samples from people will still be required. Since more samples means less noise and more safety, if we are uncertain about how many samples we need then we will want a lot of samples.

A lot of samples would mean recruiting a lot of people. We cannot rule out needing to involve thousands to tens of thousands of people for millions to tens of millions of short interactions: answering questions, judging debates, etc. We may need to train these people to be better judges, arrange for peers to judge each other’s reasoning, determine who is doing better at judging and give them more weight or a more supervisory role, and so on. Many researchers would be required on the social science side to extract the highest quality information from the judges.

A task of this scale would be a large interdisciplinary project, requiring close collaborations in which people of different backgrounds fill in each other’s missing knowledge. If machine learning reaches this scale, it is important to get a head start on the collaborations soon.

Conclusion: how you can help

We have argued that the AI safety community needs social scientists to tackle a major source of uncertainty about AI alignment algorithms: will humans give good answers to questions? This uncertainty is difficult to tackle with conventional machine learning experiments, since machine learning is primitive. We are still in the early days of performance on natural language and other tasks, and problems with human reward learning may only show up on tasks we cannot yet tackle.

Our proposed solution is to replace machine learning with people, at least until ML systems can participate in the complexity of debates we are interested in. If we want to understand a game played with ML and human participants, we replace the ML participants with people, and see how the all human game plays out. For the specific example of debate, we start with debates with two ML debaters and a human judge, then switch to two human debaters and a human judge. The result is a pure human experiment, motivated by machine learning but available to anyone with a solid background in experimental social science. It won’t be an easy experiment, which is all the more reason to start soon.

If you are a social scientist interested in these questions, please talk to AI safety researchers! We are interested in both conversation and close collaboration. There are many institutions engaged with safety work using reward learning, including our own institution OpenAI, DeepMind, and Berkeley’s CHAI. The AI safety organization Ought is already exploring similar questions, asking how iterated amplification behaves with humans.

If you are a machine learning researcher interested in or already working on safety, please think about how alignment algorithms will work once we advance to tasks beyond the abilities of current machine learning. If your preferred alignment scheme uses humans in an important way, can you simulate the future by replacing some or all ML components with people? If you can imagine these experiments but don’t feel you have the expertise to perform them, find someone who does.

AI Safety Needs Social Scientists (original) (raw)