Striatal and Tegmental Neurons Code Critical Signals for Temporal-Difference Learning of State Value in Domestic Chicks (original) (raw)
Related papers
Cognitive Brain Research, 2004
To reveal the functional roles of the striatum, we examined the effects of excitotoxic lesions to the bilateral medial striatum (mSt) and nucleus accumbens (Ac) in a food reinforcement color discrimination operant task. With a food reward as reinforcement, 1-week-old domestic chicks were trained to peck selectively at red and yellow beads (S+) and not to peck at a blue bead (SÀ). Those chicks then received either lesions or sham operations and were tested in extinction training sessions, during which yellow turned out to be nonrewarding (SÀ), whereas red and blue remained unchanged. To further examine the effects on postoperant noninstrumental aspects of behavior, we also measured the bwaiting timeQ, during which chicks stayed at the empty feeder after pecking at yellow. Although the lesioned chicks showed significantly higher error rates in the nonrewarding yellow trials, their postoperant waiting time gradually decreased similarly to the sham controls. Furthermore, the lesioned chicks waited significantly longer than the controls, even from the first extinction block. In the blue trials, both lesioned and sham chicks consistently refrained from pecking, indicating that the delayed extinction was not due to a general disinhibition of pecking. Similarly, no effects were found in the novel training sessions, suggesting that the lesions had selective effects on the extinction of a learned operant. These results suggest that a neural representation of memory-based reward anticipation in the mSt/Ac could contribute to the anticipation error required for extinction. D 2004 Elsevier B.V. All rights reserved.
Behavioural Brain Research, 2002
Effects of bilateral chemical lesions of the ventro-medial basal ganglia (lobus parolfactorius, LPO) were examined in 3 Á/9-day-old domestic chicks. In experiment-1, chicks were trained to peck at a blue bead that was associated with drops of water as a reward. Addition of passive avoidance training using a bitter yellow bead resulted in highly selective pecking between blue and yellow. LPO lesion (given 3 Á/5 h after training) did not impair the selectivity when chicks were tested 24 h afterwards, while the novel reinforcement using a red bead was severely impaired. In experiment-2, chicks were trained in a GO/NO-GO color discrimination task with food reward. Trained chicks received bilateral LPO lesions, and they were tested 48 h afterwards for the number of pecks and latency of the first peck in each trial. The LPO lesion did not impair the recall of memorized color discrimination in tests, while the chicks were severely deficient in post-operative novel training. These results confirm that: (1) bilateral LPO ablation does not interfere with selective pecking based on the memorized color cues; but (2) it impairs reinforcement in novel training. LPO is thus supposed to be involved in acquisition, rather than execution of memorized behaviors. #
Behavioural brain research, 2014
To investigate the role of social contexts in controlling the neuronal representation of food reward, we recorded single neuron activity in the medial striatum/nucleus accumbens of domestic chicks and examined whether activities differed between two blocks with different contexts. Chicks were trained in an operant task to associate light-emitting diode color cues with three trial types that differed in the type of food reward: no reward (S-), a small reward/short-delay option (SS), and a large reward/long-delay alternative (LL). Amount and duration of reward were set such that both of SS and LL were chosen roughly equally. Neurons showing distinct cue-period activity in rewarding trials (SS and LL) were identified during an isolation block, and activity patterns were compared with those recorded from the same neuron during a subsequent pseudo-competition block in which another chick was allowed to forage in the same area, but was separated by a transparent window. In some neurons, c...
Anticipatory reward signals in ventral striatal neurons of behaving rats
Eur. J. Neurosci, 2008
It has been proposed that the striatum plays a crucial role in learning to select appropriate actions, optimizing rewards according to the principles of 'Actor-Critic' models of trial-and-error learning. The ventral striatum (VS), as Critic, would employ a temporal difference (TD) learning algorithm to predict rewards and drive dopaminergic neurons. This study examined this model's adequacy for VS responses to multiple rewards in rats. The respective arms of a plus-maze provided rewards of varying magnitudes; multiple rewards were provided at 1-s intervals while the rat stood still. Neurons discharged phasically prior to each reward, during both initial approach and immobile waiting, demonstrating that this signal is predictive and not simply motor-related. In different neurons, responses could be greater for early, middle or late droplets in the sequence. Strikingly, this activity often reappeared after the final reward, as if in anticipation of yet another. In contrast, previous TD learning models show decremental reward-prediction profiles during reward consumption due to a temporal-order signal introduced to reproduce accurate timing in dopaminergic reward-prediction error signals. To resolve this inconsistency in a biologically plausible manner, we adapted the TD learning model such that input information is nonhomogeneously distributed among different neurons. By suppressing reward temporal-order signals and varying richness of spatial and visual input information, the model reproduced the experimental data. This validates the feasibility of a TDlearning architecture where different groups of neurons participate in solving the task based on varied input information.
Behavioural Brain Research, 2006
The effects of bilateral chemical lesions of the ventral striatum (nucleus accumbens and the surrounding areas in the medial striatum) and arcopallium (major descending area of the avian telencephalon) were examined in 1-2-weeks old domestic chicks. Using a Y-maze, we analyzed the lesion effects on the choices that subject chicks made in two tasks with identical economical consequences, i.e., a small-and-close food reward vs. a large-and-distant food reward. In task 1, red, yellow, and green beads were associated with a feeder placed at various distances from the chicks; chicks thus anticipated the spatial proximity of food by the bead's color, whereas the quantity of the food was fixed. In task 2, red and yellow flags on the feeders were associated with various amount of food; the chicks thus anticipated the quantity of food by the flag's color, whereas the proximity of the reward could be directly visually determined. In task 1, bilateral lesions of the ventral striatum (but not the arcopallium) enhanced the impulsiveness of the chicks' choices, suggesting that choices based on the anticipated proximity were selectively changed. In task 2, similar lesions of the ventral striatum did not change choices. In both experiments, motor functions of the chicks remained unchanged, suggesting that the lesions did not affect the foraging efficiency, i.e., objective values of food. Neural correlates of anticipated food rewards in the ventral striatum (but not those in the arcopallium) could allow chicks to invest appropriate amount of work-cost in approaching distant food resources.
2005
Behavioral conditioning of cue-reward pairing results in a shift of midbrain dopamine (DA) cell activity from responding to the reward to responding to the predictive cue. However, the precise time course and mechanism underlying this shift remain unclear. Here, we report a combined single-unit recording and temporal difference (TD) modeling approach to this question. The data from recordings in conscious rats showed that DA cells retain responses to predicted reward after responses to conditioned cues have developed, at least early in training. This contrasts with previous TD models that predict a gradual stepwise shift in latency with responses to rewards lost before responses develop to the conditioned cue. By exploring the TD parameter space, we demonstrate that the persistent reward responses of DA cells during conditioning are only accurately replicated by a TD model with long-lasting eligibility traces (nonzero values for the parameter) and low learning rate (␣). These physiological constraints for TD parameters suggest that eligibility traces and low per-trial rates of plastic modification may be essential features of neural circuits for reward learning in the brain. Such properties enable rapid but stable initiation of learning when the number of stimulus-reward pairings is limited, conferring significant adaptive advantages in real-world environments.
1991
Spontaneous bursting (5 or more spikes of 200-450mV amplitude at 400Hz) occurs in many areas of chick forebrain. Day-old chicks trained on a one-trial passive avoidance task show a bilateral increase of up to 350% in bursting following training in one of these areas: the intermediate medial hyperstriatum ventrale, or IMHV (Mason & Rose, 1987; 1988). An investigation was carried out into the time course and lateralization of this change in bursting activity following the training of day-old chicks on a passive avoidance task. Chicks were trained to either avoid a bead coated with the bitter-tasting substance methylanthranilate (M-birds) or were trained to peck a water coated bead (W-birds). Bursting was recorded sequentially from the IMHV of both hemispheres at 8 time points over the period 1 to 9 hours post-test. The results indicate that there are significant differences in bursting activity recorded from M-birds only during the period 3-7hr posttest, when compared to W-birds. Betw...
Neural control of dopamine neurotransmission: implications for reinforcement learning
European Journal of Neuroscience, 2012
In the past few decades there has been remarkable convergence of machine learning with neurobiological understanding of reinforcement learning mechanisms, exemplified by temporal difference (TD) learning models. The anatomy of the basal ganglia provides a number of potential substrates for instantiation of the TD mechanism. In contrast to the traditional concept of direct and indirect pathway outputs from the striatum, we emphasize that projection neurons of the striatum are branched and individual striatofugal neurons innervate both globus pallidus externa and globus pallidus interna ⁄ substantia nigra (GPi ⁄ SNr). This suggests that the GPi ⁄ SNr has the necessary inputs to operate as the source of a TD signal. We also discuss the mechanism for the timing processes necessary for learning in the TD framework. The TD framework has been particularly successful in analysing electrophysiogical recordings from dopamine (DA) neurons during learning, in terms of reward prediction error. However, present understanding of the neural control of DA release is limited, and hence the neural mechanisms involved are incompletely understood. Inhibition is very conspicuously present among the inputs to the DA neurons, with inhibitory synapses accounting for the majority of synapses on DA neurons. Furthermore, synchronous firing of the DA neuron population requires disinhibition and excitation to occur together in a coordinated manner. We conclude that the inhibitory circuits impinging directly or indirectly on the DA neurons play a central role in the control of DA neuron activity and further investigation of these circuits may provide important insight into the biological mechanisms of reinforcement learning.