Neural basis of reinforcement learning and decision making - PubMed (original) (raw)

Review

Neural basis of reinforcement learning and decision making

Daeyeol Lee et al. Annu Rev Neurosci. 2012.

Abstract

Reinforcement learning is an adaptive process in which an animal utilizes its previous experience to improve the outcomes of future choices. Computational theories of reinforcement learning play a central role in the newly emerging areas of neuroeconomics and decision neuroscience. In this framework, actions are chosen according to their value functions, which describe how much future reward is expected from each action. Value functions can be adjusted not only through reward and penalty, but also by the animal's knowledge of its current environment. Studies have revealed that a large proportion of the brain is involved in representing and updating value functions and using them to choose an action. However, how the nature of a behavioral task affects the neural mechanisms of reinforcement learning remains incompletely understood. Future studies should uncover the principles by which different computational elements of reinforcement learning are dynamically coordinated across the entire brain.

PubMed Disclaimer

Figures

Figure 1

Figure 1. Economic and reinforcement learning theories of decision making

(a) In economic theories, decision making corresponds to selecting an action with the maximum utility. (b) In reinforcement learning, actions are chosen probabilistically (i.e., softmax) on the basis of their value functions. In addition, value functions are updated on the basis of the outcome (reward or penalty) resulting from the action chosen by the animal. RPE, reward prediction error.

Figure 2

Figure 2. Time course of signals related to different state value functions during decision making

(a) Signals related to the state value functions before (red) and after (blue) decision making in the dorsolateral prefrontal cortex (DLPFC; Kim et al. 2008, Kim & Lee 2011) and striatum (Cai et al. 2011) during an intertemporal choice task. These two state value functions correspond to the average of the action value functions for two options and the chosen value, respectively. During these studies, monkeys chose between a small immediate reward and a large delayed reward, and the magnitude of neural signals related to different value functions were estimated by the coefficient of partial determination (CPD). Lines correspond to the mean CPD for all the neurons recorded in each brain area with the shaded area corresponding to the standard error of the mean. (b) Proportion of neurons carrying chosen value signals in the rodent lateral (AGl) and medial (AGm) agranular cortex, corresponding to the primary and secondary motor cortex, respectively, dorsal (DS) and ventral (VS) striatum, anterior cingulate cortex (ACC), prelimbic (PLC)/infralimbic (ILC) cortex, and orbitofrontal cortex (OFC). During these studies (Kim et al. 2009, Sul et al. 2010, 2011), the rats performed a dynamic foraging task. Large symbols indicate that the proportions are significantly (p<0.05) above the chance level.

Figure 3

Figure 3. Time course of signals related to the animal’s choice, its outcome, and action-outcome conjunction in multiple brain areas of primates and rodents

(a) Spatial layout of the choice targets during a matching pennies task used in single-neuron recording experiments in monkeys. (b) Brain regions tested during the studies on monkeys (Barraclough et al. 2004, Seo & Lee 2007, Seo et al. 2009). ACCd, dorsal anterior cingulate cortex; DLPFC, dorsolateral prefrontal cortex; LIP, lateral intraparietal cortex. (c) Fraction of neurons significantly modulating their activity according to the animal’s choice (top), its outcome (middle), and choice-outcome conjunction (bottom) during the current (trial lag =0) and 3 previous trials (trial lags =1~3). (d) Modified T-maze used in a rodent dynamic foraging task. (e) Anatomical areas tested in single-neuron recording experiments in rodents (Kim et al. 2009, Sul et al. 2010, 2011). Same abbreviations as in Figure 2b. (f) Fraction of neurons significantly modulating their activity according to the animal’s choice (top), its outcome (middle), and choice-outcome conjunction (bottom) during the current (lag =0) and previous trials (lag =1). Large symbols indicate that the proportions are significantly (p<0.05) above the chance level.

Figure 4

Figure 4. Areas in the human brain involved in updating model-free and model-based value functions (Behrens et al. 2008)

(a) Regions in which the activity is correlated with the volatility in estimating the value functions based on reward history (green) and social information (red). (b) Activity in the ventromedital prefrontal cortex was correlated with the value functions regardless of whether they were estimated from reward history or social information. (c) Subjects more strongly influenced by reward history (ordinate) tended to show greater signal change in the anterior cingulate cortex in association with reward history (abscissa; green region in a). (d) Subjects more strongly influenced by social information (ordinate) showed greater signal changes in the anterior cingulate cortex in association with social information (abnscissa; red region in a).

Figure 5

Figure 5. Neuronal activity related to hypothetical outcomes in the primate orbitofrontal cortex

(a) Rock-paper-sciossors task used for single-neuron recording studies in monkeys (Abe & Lee 2011). (b) An example neuron recorded in the orbitofrontal cortex that modulated its activity according to the magnitude of reward that was available from the unchosen winning target (indicated by ‘W’ in the top panels). The spike density function of this neuron was estimated separately according to the position of the winning target (columns), the position of the target chosen by the animal (rows), and the magnitude of the reward available from the winning target (colors).

References

    1. Abe H, Lee D. Distributed coding of actual and hypothetical outcomes in the orbital and dorsolateral prefrontal cortex. Neuron. 2011;70:731–741. -PMC -PubMed
    1. Andersen RA, Essick GK, Siegel RM. Neurons of area 7 activated by both visual stimuli and oculomotor behavior. Exp. Brain Res. 1987;67:316–322. -PubMed
    1. Balleine BW, Dickinson A. Goal-directed instrumental action: contingency and incentive learning and their cortical substrates. Neuropharmacology. 1998;37:407–419. -PubMed
    1. Barraclough DJ, Conroy ML, Lee D. Prefrontal cortex and decision making in a mixed-strategy game. Nat. Neurosci. 2004;7:404–410. -PubMed
    1. Beck JM, Ma WJ, Kiani R, Hanks T, Churchland AK, Roitman J, Shadlen MN, Latham PE, Pouget A. Probabilistic population codes for Bayesian decision making. Neuron. 2008;60:1142–1152. -PMC -PubMed

Publication types

MeSH terms

LinkOut - more resources