Policy Gradient Methods for Reinforcement Learning with Function Approximation (original) (raw)
Related papers
Policy gradient methods for reinforcement learning with function …
Advances in neural …
Function approximation is essential to reinforcement learning, but the standard approach of approximating a value function and deter- mining a policy from it has so far proven theoretically intractable. In this paper we explore an alternative approach in which the policy is explicitly ...
Neural Value Function Approximation in Continuous State Reinforcement Learning Problems
2018
Recent development of Deep Reinforcement Learning (DRL) has demonstrated superior performance of neural networks in solving challenging problems with large or continuous state spaces. In this work, we focus on the problem of minimising the expected one step Temporal Difference (TD) error with neural function approximator for a continuous state space, from a smooth optimisation perspective. An approximate Newton’s algorithm is proposed. Effectiveness of the algorithm is demonstrated on both finite and continuous state space benchmarks. We show that, in order to benefit from the second order approximate Newton’s algorithm, gradient of the TD target needs to be considered for training.
Compatible Value Gradients for Reinforcement Learning of Continuous Deep Policies
mThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
Localizing policy gradient estimates to action transitions
2000
Abstract Function Approximation (FA) representations of the state-action value function Q have been proposed in order to reduce variance in performance gradients estimates, and thereby improve performance of Policy Gradient (PG) reinforcement learning in large continuous domains (eg, the PIFA algorithm of Sutton et al.(in press)).