Simple statistical gradient-following algorithms for connectionist reinforcement learning (original) (raw)
References
Barto, A.G. (1985). Learning by statistical cooperation of self-interested neuron-like computing elements.Human Neurobiology, 4, 229–256. Google Scholar
Barto, A.G. & Anandan, P. (1985). Pattern recognizing stochastic learning automata.IEEE Transactions on Systems, Man, and Cybernetics, 15, 360–374. Google Scholar
Barto, A.G. & Anderson, C.W. (1985). Structural learning in connectionist systems.Proceedings of the Seventh Annual Conference of the Cognitive Science Society, (pp. 43–53). Irvine, CA.
Barto, A.G., Sutton, R.S., & Anderson, C.W. (1983). Neuronlike elements that can solve difficult learning control problems.IEEE Transactions on Systems, Man, and Cybernetics, 13, 835–846. Google Scholar
Barto, A.G., & Jordan, M.I. (1987). Gradient following without back-propagation in layered networks.Proceedings of the First Annual International Conference on Neural Networks, Vol. II (pp. 629–636). San Diego, CA.
Barto, A.G., Sutton, R.S., & Watkins, C.J.C.H. (1990). Learning and sequential decision making. In: M. Gabriel & J.W. Moore (Eds.),Learning and computational neuroscience: Foundations of adaptive networks. Cambridge, MA: MIT Press. Google Scholar
Dayan, P. (1990). Reinforcement comparison. In D.S. Touretzky, J.L. Elman, T.J. Sejnowski, & G.E. Hinton (Eds.),Proceedings of the 1990 Connectionist Models Summer School (pp. 45–51). San Mateo, CA: Morgan Kaufmann. Google Scholar
Goodwin, G.C. & Sin, K.S. (1984).Adaptive filtering prediction and control. Englewood Cliffs, NJ: Prentice-Hall. Google Scholar
Gullapalli, V. (1990). A stochastic reinforcement learning algorithm for learning real-valued functions.Neural Networks, 3, 671–692. Google Scholar
Hinton, G.E. & Sejnowski, T.J. (1986). Learning and relearning in Boltzmann machines. In: D.E. Rumelhart & J.L. McClelland, (Eds.),Parallel distributed processing: Explorations in the microstructure of cognition. Vol. 1: Foundations. Cambridge, MA: MIT Press. Google Scholar
Jordan, M.I. & Rumelhart, D.E. (1990).Forward models: supervised learning with a distal teacher. (Occasional Paper — 40). Cambridge, MA: Massachusetts Institute of Technology, Center for Cognitive Science. Google Scholar
leCun, Y. (1985). Une procedure d'apprentissage pour resau a sequil assymetrique [A learning procedure for asymmetric threshold networks].Proceedings of Cognitiva, 85, 599–604. Google Scholar
Munro, P. (1987). A dual back-propagation scheme for scalar reward learning.Proceedings of the Ninth Annual Conference of the Cognitive Science Society (pp. 165–176). Seattle, WA.
Narendra, K.S. & Thathatchar, M.A.L. (1989).Learning Automata: An introduction. Englewood Cliffs, NJ: Prentice Hall. Google Scholar
Narendra, K.S. & Wheeler, R.M., Jr. (1983). An_N_-player sequential stochastic game with identical payoffs.IEEE Transactions on Systems, Man, and Cybernetics, 13, 1154–1158. Google Scholar
Nilsson, N.J. (1980).Principles of artificial intelligence. Palo Alto, CA: Tioga. Google Scholar
Parker, D.B. (1985).Learning-logic. (Technical Report TR-47). Cambridge, MA: Massachusetts Institute of Technology, Center for Computational Research in Economics and Management Science. Google Scholar
Rohatgi, V.K. (1976)An introduction to probability theory and mathematical statistics. New York: Wiley. Google Scholar
Rumelhart, D.E., Hinton, G.E., & Williams, R.J. (1986). Learning internal representations by error propagation. In: D.E. Rumelhart & J.L. McClelland, (Eds.),Parallel distributed processing: Explorations in the microstructure of cognition. Vol. 1: Foundations. Cambridge: MIT Press. Google Scholar
Schmidhuber, J.H. & Huber, R. (1990). Learning to generate focus trajectories for attentive vision. (Technical Report FKI-128-90). Technische Universität München, Institut für Informatik.
Sutton, R.S. (1984).Temporal credit assignment in reinforcement learning. Ph.D. Dissertation, Dept. of Computer and Information Science, University of Massachusetts, Amherst, MA. Google Scholar
Sutton, R.S. (1988). Learning to predict by the methods of temporal differences.Machine Learning, 3, 9–44. Google Scholar
Thathatchar, M.A.L. & Sastry, P.S. (1985). A new approach to the design of reinforcement schemes for learning automata.IEEE Transactions on Systems, Man, and Cybernetics, 15, 168–175. Google Scholar
Wheeler, R.M., Jr. & Narendra K.S. (1986). Decentralized learning in finite Markov chains.IEEE Transactions on Automatic Control, 31, 519–526. Google Scholar
Watkins, C.J.C.H. (1989).Learning from delayed rewards. Ph.D. Dissertation, Cambridge University, Cambridge, England. Google Scholar
Werbos, P.J. (1974).Beyond regression: new tools for prediction and analysis in the behavioral sciences. Ph.D. Dissertation, Harvard University, Cambridge, MA. Google Scholar
Williams, R.J. (1986).Reinforcement learning in connectionist networks: A mathematical analysis. (Technical Report 8605). San Diego: University of California, Institute for Cognitive Science. Google Scholar
Williams, R.J. (1987a).Reinforcement-learning connectionist systems. (Technical Report NU-CCS-87-3). Boston, MA: Northeastern University, College of Computer Science. Google Scholar
Williams, R.J. (1987b). A class of gradient-estimating algorithms for reinforcement learning in neural networks.Proceedings of the First Annual International Conference on Neural Networks, Vol. II (pp. 601–608). San Diego, CA.
Williams, R.J. (1988a). On the use of backpropagation in associative reinforcement learning.Proceedings of the Second Annual International Conference on Neural Networks, Vol. I (pp. 263–270). San Diego, CA.
Williams, R.J. (1988b).Toward a theory of reinforcement-learning connectionist systems. (Technical Report NU-CCS-88-3). Boston, MA: Northeastern University, College of Computer Science. Google Scholar
Williams, R.J. & Peng, J. (1991). Function optimization using connectionist reinforcement learning algorithms.Connection Science, 3, 241–268. Google Scholar