Dopamine, uncertainty and TD learning - PubMed (original) (raw)

Dopamine, uncertainty and TD learning

Yael Niv et al. Behav Brain Funct. 2005.

Abstract

Substantial evidence suggests that the phasic activities of dopaminergic neurons in the primate midbrain represent a temporal difference (TD) error in predictions of future reward, with increases above and decreases below baseline consequent on positive and negative prediction errors, respectively. However, dopamine cells have very low baseline activity, which implies that the representation of these two sorts of error is asymmetric. We explore the implications of this seemingly innocuous asymmetry for the interpretation of dopaminergic firing patterns in experiments with probabilistic rewards which bring about persistent prediction errors. In particular, we show that when averaging the non-stationary prediction errors across trials, a ramping in the activity of the dopamine neurons should be apparent, whose magnitude is dependent on the learning rate. This exact phenomenon was observed in a recent experiment, though being interpreted there in antipodal terms as a within-trial encoding of uncertainty.

PubMed Disclaimer

Figures

Figure 1

Figure 1

Averaged prediction errors in a probabilistic reward task (a) DA response in trials with different reward probabilities. Population peri-stimulus time histograms (PSTHs) show the summed spiking activity of several DA neurons over many trials, for each p r, pooled over rewarded and unrewarded trials at intermediate probabilities. (b) TD prediction error with asymmetric scaling. In the simulated task, in each trial one of five stimuli was randomly chosen and displayed at time t = 5. The stimulus was turned off at t = 25, at which time a reward was given with a probability of p r specified by the stimulus. We used a tapped delay-line representation of the stimuli (see text), with each stimulus represented by a different set of units ('neurons'). The TD error was δ(t) = r(t) + w(t - 1)·x(t) - w(t - 1)·x(t - 1), with r(t) the reward at time t, and x(t) and w(t) the state and weight vectors for the unit. A standard online TD learning rule was used with a fixed learning rate α, w(t) = w(t - 1) + αδ(t)x(t - 1), so each weight represented an expected future reward value. Similar to Fiorillo et al., we depict the prediction error δ(t) averaged over many trials, after the task has been learned. The representational asymmetry arises as negative values of δ(t) have been scaled by d = 1/6 prior to summation of the simulated PSTH, although learning proceeds according to unscaled errors. Finally, to account for the small positive responses at the time of the stimulus for p r = 0 and at the time of the (predicted) reward for p r = 1 seen in (a), we assumed a small (8%) chance that a predictive stimulus is misidentified. (c) DA response in p r = 0.5 trials, separated into rewarded (left) and unrewarded (right) trials. (d) TD Model of (c). (a,c) Reprinted with permission from [15]©2003 AAAS. Permission from AAAS is required for all other uses.

Figure 2

Figure 2

Backpropagation of prediction errors explains ramping activity. (a) The TD prediction error across each of six consecutive trials (top to bottom) from the simulation in Figure 1b, with p r = 0.5. Highlighted in red is the error at the time of the reward in the first of the trials, and its gradual back-propagation towards the time of the stimulus in subsequent trials. Block letters indicate the outcome of each specific trial (R = rewarded; N = not rewarded). The sequence of rewards preceding these trials is given on the top right. (b) The TD error from these six trials, and four more following them, superimposed. The red and green lines illustrate the envelope of the errors in these trials. Summing over these trials results in no above-baseline activity on average (black line), as positive and negative errors occur at random 50% of the time, and so cancel each other. (c) However, when the prediction errors are asymmetrically represented above and below the baseline firing rate (here negative errors were asymmetrically scaled by d = 1/6 to simulate the asymmetric encoding of prediction errors by DA neurons), an average ramping of activity emerges when averaging over trials, as is illustrated by the black line. All simulation parameters are the same as in Figure 1b,d.

Figure 3

Figure 3

Trace conditioning with probabilistic rewards. (a) An illustration of one trial of the delay conditioning task of Fiorillo et al. [15]. A trial consists of a 2-second visual stimulus, the offset of which coincides with the delivery of the juice reward, if such a reward is programmed according to the probability associated with the visual cue. In unrewarded trials the stimulus terminated without a reward. In both cases an inter-trial interval of 9 seconds on average separates trials. (b) An illustration of one trial of the trace conditioning task of Morris et al. [16]. The crucial difference is that there is now a substantial temporal delay between the offset of the stimulus and the onset of the reward (the "trace" period), and no external stimulus indicates the expected time of reward. This confers additional uncertainty as precise timing of the predicted reward must be internally resolved, especially in unrewarded trials. In this task, as in [15], one of several visual stimuli (not shown) was presented in each trial, and each stimulus was associated with a probability of reward. Here, also, the monkey was requested to perform an instrumental response (pressing the key corresponding to the side in which the stimulus was presented), the failure of which terminated the trial without a reward. Trials were separated by variable inter-trial intervals. (c,d) DA firing rate (smoothed) relative to baseline, around the expected time of the reward, in rewarded trials (c) and in unrewarded trials (d). (c,d) Reprinted from [16] ©2004 with permission from Elsevier. The traces imply an overall positive response at the expected time of the reward, but with a very small, or no ramp preceding this. Similar results were obtained in a classical conditioning task briefly described in [15], which employed a trace conditioning procedure, confirming that the trace period, and not the instrumental nature of the task depicted in (b) was the crucial difference from (a).

Figure 4

Figure 4

Dependence of the ramp on learning rate. The shape of the ramp, but not the height of its peak, is dependent on the learning rate. The graph shows simulated activity for the case of p r = 0.5 near the time of the expected reward, for different learning rates, averaged over both rewarded and unrewarded trials. According to TD learning with persistent asymmetrically coded prediction errors, averaging over activity in rewarded and unrewarded trials results in a ramp up to the time of reward. The height of the peak of the ramp is determined by the ratio of rewarded and unrewarded trials, however, the breadth of the ramp is determined by the rate of back-propagation of these error signals from the time of the (expected) reward to the time of the predictive stimulus. A higher learning rate results in a larger fraction of the error propagating back, and thus a higher ramp. With lower learning rates, the ramp becomes negligible, although the positive activity (on average) at the time of reward is still maintained. Note that although the learning rate used in the simulations depicted in Figure 1b,d was 0.8, this should not be taken as the literal synaptic learning rate of the neural substrate, given our schematic representation of the stimulus. In a more realistic representation in which a population of neurons is active at every timestep, a much lower learning rate would produce similar results.

Comment in

Similar articles

Cited by

References

    1. Ljungberg T, Apicella P, Schultz W. Responses of monkey dopamine neurons during learning of behavioral reactions. Journal Neurophysiol. 1992;67:145–163. - PubMed
    1. Schultz W. Predictive reward signal of dopamine neurons. Journal of Neurophysiology. 1998;80:1–27. http://jn.physiology.org/cgi/content/full/80/1/1 - PubMed
    1. O'Doherty J, Dayan P, Friston K, Critchley H, Dolan R. Temporal difference models and reward-related learning in the human brain. Neuron. 2003;38:329–337. doi: 10.1016/S0896-6273(03)00169-7. - DOI - PubMed
    1. Seymour B, O'Doherty J, Dayan P, Koltzenburg M, Jones A, Dolan R, Friston K, Frackowiak R. Temporal difference models describe higher order learning in humans. Nature. 2004;429:664–667. doi: 10.1038/nature02581. - DOI - PubMed
    1. Montague PR, Hyman SE, Cohan JD. Computational roles for dopamine in behavioural control. Nature. 2004;431:760–767. doi: 10.1038/nature03015. - DOI - PubMed

LinkOut - more resources