Neural computations underlying arbitration between model-based and model-free learning - PubMed (original) (raw)

Neural computations underlying arbitration between model-based and model-free learning

Sang Wan Lee et al. Neuron. 2014.

Abstract

There is accumulating neural evidence to support the existence of two distinct systems for guiding action selection, a deliberative "model-based" and a reflexive "model-free" system. However, little is known about how the brain determines which of these systems controls behavior at one moment in time. We provide evidence for an arbitration mechanism that allocates the degree of control over behavior by model-based and model-free systems as a function of the reliability of their respective predictions. We show that the inferior lateral prefrontal and frontopolar cortex encode both reliability signals and the output of a comparison between those signals, implicating these regions in the arbitration process. Moreover, connectivity between these regions and model-free valuation areas is negatively modulated by the degree of model-based control in the arbitrator, suggesting that arbitration may work through modulation of the model-free valuation system when the arbitrator deems that the model-based system should drive behavior.

Copyright © 2014 Elsevier Inc. All rights reserved.

PubMed Disclaimer

Figures

Figure 1

Figure 1

Task design. (A) Sequential two-choice Markov decision task. Participants move from one state to the other with a certain state-transition probability p following a binary choice (left or right) (B) Illustration of the specific goal condition, in which the color of the collecting box (either yellow, blue, or red) should match the color of the coin, and the flexible condition, in which participants are allowed to collect any kind of coin. The high uncertainty condition corresponds to _p_=(0.5,0.5) and the low uncertainty condition corresponds to _p_=(0.9,0.1). (C) Illustration of the task. The specific goal block requires participants to rely on a model-based strategy for guiding choices in each state, while, in the flexible goal block, an initial model-based strategy during early experience can give way to a model-free strategy after extensive experience. See also Figure S2.

Figure 2

Figure 2

Computational hypothesis to account for arbitration between model-based and model-free learning strategies. The Bayesian model computes reliability using the state-prediction error used to update state-action values of the model-based learning system and a Pearce-hall type associability model computes reliability using the reward-prediction error used for the update of the state-action value of the model-free. The computed reliability functions as a transition rate for the two-state transition model, in which each state represents the probability of choosing the model-based learning strategy (PMB) and the model-free (1-PMB), respectively. The state-action value regulating the actual choice behavior is given by the weighted average of values from the two reinforcement learning systems. See also Figure S1 and Table S1.

Figure 3

Figure 3

Behavioral Results. (A) Performance of the subjects in the form of the mean total reward accrued, the reward rate, and the proportion of optimal choices. The left bar graph shows the average of reward value received in each trial, averaged over all subjects. The middle bar graph shows the reward rate, the proportion of trials the rewarding goal is reached. The right bar graph shows the optimal choices, defined by the ideal agent’s behavior in each condition (Figure S2A). The bold line in the bars refers to the baseline given by the random agent making choices. The green color code corresponds to the low state-transition uncertainty condition, and the yellow corresponds to the high uncertainty condition. Error bars = SEM across subjects. (B) Performance of the arbitrator in capturing variation in subjects’ choice behavior,to demonstrate that the model is performing well in predicting subjects’ choices. The model predicted probability of choosing the right action has been split into five equal sized bins. The proportion of subjects’ right choices increases with the model’s action probability. Error bars are SEM. (C) Performance of the arbitrator in capturing variation in model-based and model-free choice strategies on the consistency of participants’ choice behavior on a trial-by-trial basis plotted separately for situations where the arbitrator favors model based control (PMB>0.5), compared to when the arbitrator favors model-free control (PMB<0.5). The choice consistency is the proportion of changes of choices from trial to trial in each state. Choice consistency is significantly higher when the arbitrator predicts predominantly model-free control compared to when it predicts predominantly model-based control. On the other hand, simply plotting the choice consistency as a function of the experimental conditions: specific vs flexible goal is not sufficient to reveal robust differences on this behavioral measure. Results are plotted separately for two different states in the task (State 1 and 4 = the state at layer 1 and 2 of the task, respectively. States 2, 3, and 5 are rarely sampled by participants, because they lead to relatively low valued outcomes and hence are not plotted here as there are insufficient samples to enable meaningful performance plots to be extracted. Error bars are SEM. **(D)** Results from a log-likelihood test comparing the degree to which model-based vs model-free reinforcement-learning accounts best for participants’ choices, plotted separately for the (i) situations in which model-based control (PMB>0.5) and (ii) situations in which the arbitrator favors model-free control (PMB<0.5). The model-based and the model-free were fitted independently to prevent circularity. Test statistics of likelihood-ratio test refers to log-likelihood value of the model-based minus the model-free. The more negative the ratio, the more the model-free system accounts better for behavior, while the more positive the ratio the more the model-based system accounts better for behavior. As can be seen, in the strategic goal-condition the ratio test favors the model-free system (significant at p<1e-4), while in the flexible goal-condition the ratio test favors the model-based system (significant at p<1e-11).These findings thereby validate the task manipulations by showing that the task can successfully manipulate control to be governed predominantly by either the model-based or model-free system. Error bars are SEM. See also Table S2.

Figure 4

Figure 4

Neural correlates of reliability-based arbitration. (A) (Top) Bilateral Inferior lateral prefrontal cortex encodes reliability signals for the model-based (RelMB) and the model-free (RelMF) systems individually. The two reliabilities are, by and large, not highly correlated (mean:−0.26, standard deviation: 0.106), suggesting that our task successfully dissociates the model-based from the model-free. Effects significant at p<0.05 (FWE corrected) are shown in yellow. (Bottom) A region of rostral anterior cingulate cortex (rACC) was found to encode the difference in reliability between the model-based and model-free systems (RelMB-RelMF), while an area of bilateral ilPFC and right FPC was correlated with the reliability of whichever system had the highest reliability index on each trial (max(RelMB, RelMF)). (B) The mean percent signal change for a parametric modulator encoding a max and difference reliability signal in lateral prefrontal cortex (lPFC) and rostral anterior cingulate cortex (rACC). The signal has been split into two equal sized bins according to the 50th and 100th percentile. The error bars are S.E.M. across subjects. See also Figure S3 and Table S3.

Figure 5

Figure 5

Results of a model comparison process on BOLD correlates of the arbitration process. For this we implemented a Bayesian model selection analysis, and illustrate voxels for which the exceedance probability is 0.9 in favor of a given model. UncBayesArb refers to the uncertainty-based arbitration used by Daw et al. (2005), dualBayesArb refers to the dualBayesArb-dynamic model, and mixedArb refers to the mixedArb-dynamic model. The colored blobs refer to the voxels in which exceedance proabability>0.9, indicating that the corresponding model provides a significantly better account for the neural activity in that region. See also Figure S4.

Figure 6

Figure 6

Neural correlates of model-based and model-free value signals. QMB refers to the chosen value of the model-based system, QMF the chosen value of the model-free, the areas corresponding to QMB|MF respond to chosen values commonly for both systems. QArb refers to the encoding of the chosen minus un-chosen value signals, in which the value signals are a weighted combination of model-based and model-free values determined by the output of the arbitrator (PMB). See also Figure S5 and Table S4.

Figure 7

Figure 7

Neural correlates of value integration. (A) Connectivity analyses between reliability regions in inferior lateral prefrontal cortex and model-free value areas. The shaded circles represent seed regions from which physiological signals were extracted, and colored blobs show the psychophysiological interaction effect. Shown are significant negative correlations between activity in the left inferior lateral prefrontal cortex and a region of posterior putamen modulated by PMB (in orange), of the right inferior lateral prefrontal cortex and the bilateral anterior putamen modulated by PMB (in green), and also of the right FPC prefrontal cortex and the right posterior putamen modulated by PMB (in purple). (B) Connectivity analyses between model-value areas and vmPFC area involved in encoding integrated value signal. Shown in cyan color is the negative modulation of posterior putamen activity on ventromedial prefrontal cortex activity by PMB. All images are shown thresholded at p<0.001 for display purposes. See also Figure S5 and Table S5.

Comment in

References

    1. Adams CD, Dickinson A. Instrumental responding following reinforcer devaluation. Quarterly Journal of Experimental Psychology. Q. J. Exp. Psychol. 1981;33:109–122.
    1. Aron AR, Fletcher PC, Bullmore ET, Sahakian BJ, Robbins TW. Stop-signal inhibition disrupted by damage to right inferior frontal gyrus in humans. Nat. Neurosci. 2003;6:1329. -PubMed
    1. Aron AR, Robbins TW, Poldrack R. a. Inhibition and the right inferior frontal cortex. Trends Cogn. Sci. 2004;8:170–177. -PubMed
    1. Badre D, Doll BB, Long NM, Frank MJ. Rostrolateral prefrontal cortex and individual differences in uncertainty-driven exploration. Neuron. 2012;73:595–607. -PMC -PubMed
    1. Balleine BW, Dickinson A. Goal-directed instrumental action: contingency and incentive learning and their cortical substrates. Neuropharmacology. 1998;37:407–419. -PubMed

Publication types

MeSH terms

Substances

LinkOut - more resources