The Gambler's Problem and Beyond (original) (raw)

Policy Gradient vs. Value Function Approximation: A Reinforcement Learning Shootout

2006

This paper compares the performance of policy gradient techniques with traditional value function approximation methods for reinforcement learning in a difficult problem domain. We introduce the Spacewar task, a continuous, stochastic, partially-observable, competitive multi-agent environment. We demonstrate that a neural-network based implementation of an online policy gradient algorithm (OLGARB (Weaver & Tao, 2001)) is able to perform well in this task and is competitive with the more well-established value function approximation algorithms (Sarsa(λ) and Q-learning (Sutton & Barto, 1998)).

An analysis of reinforcement learning with function approximation

Proceedings of the 25th international conference on Machine learning - ICML '08, 2008

We address the problem of computing the optimal Q-function in Markov decision problems with infinite state-space. We analyze the convergence properties of several variations of Q-learning when combined with function approximation, extending the analysis of TD-learning in (Tsitsiklis & Van Roy, 1996a) to stochastic control settings. We identify conditions under which such approximate methods converge with probability 1. We conclude with a brief discussion on the general applicability of our results and compare them with several related works.

Convergent temporal-difference learning with arbitrary smooth function …

Advances in Neural …, 2009

We introduce the first temporal-difference learning algorithms that converge with smooth value function approximators, such as neural networks. Conventional temporal-difference (TD) methods, such as TD(λ), Q-learning and Sarsa have been used successfully with function approximation in many applications. However, it is well known that off-policy sampling, as well as nonlinear function approximation, can cause these algorithms to become unstable (i.e., the parameters of the approximator may diverge). solved the problem of off-policy learning with linear TD algorithms by introducing a new objective function, related to the Bellman error, and algorithms that perform stochastic gradient-descent on this function. These methods can be viewed as natural generalizations to previous TD methods, as they converge to the same limit points when used with linear function approximation methods. We generalize this work to nonlinear function approximation. We present a Bellman error objective function and two gradient-descent TD algorithms that optimize it. We prove the asymptotic almost-sure convergence of both algorithms, for any finite Markov decision process and any smooth value function approximator, to a locally optimal solution. The algorithms are incremental and the computational complexity per time step scales linearly with the number of parameters of the approximator. Empirical results obtained in the game of Go demonstrate the algorithms' effectiveness. * On leave from MTA SZTAKI, Hungary.

No-Regret Reinforcement Learning with Value Function Approximation: a Kernel Embedding Approach

ArXiv, 2020

We consider the regret minimisation problem in reinforcement learning (RL) in the episodic setting. In many real-world RL environments, the state and action spaces are continuous or very large. Existing approaches establish regret guarantees by either a low-dimensional representation of the probability transition model or a functional approximation of Q functions. However, the understanding of function approximation schemes for state value functions largely remains missing. In this paper, we propose an online model-based RL algorithm, namely the CME-RL, that learns representations of transition distributions as embeddings in a reproducing kernel Hilbert space while carefully balancing the exploitation-exploration tradeoff. We demonstrate the efficiency of our algorithm by proving a frequentist (worst-case) regret bound that is of order tildeObig(HgammaNsqrtNbig)\tilde{O}\big(H\gamma_N\sqrt{N}\big)tildeObig(HgammaNsqrtNbig)\footnote{ tildeO(cdot)\tilde{O}(\cdot)tildeO(cdot) hides only absolute constant and poly-logarithmic factors}, where HHH is the epis...

Coarse Q-Learning: Addressing the convergence problem when quantizing continuous state variables

Value-based approaches to reinforcement learning (RL) maintain a value function that measures the long term utility of a state or state-action pair. A long standing issue in RL is how to create a finite representation in a continuous, and therefore infinite, state environment. The common approach is to use function approximators such as tile coding, memory or instance based methods. These provide some balance between generalisation, resolution, and storage, but converge slowly in multidimensional state environments. Another approach of quantizing state into lookup tables has been commonly regarded as highly problematic, due to large memory requirements and poor generalisation. In particular, attempting to reduce memory requirements and increase generalisation by using coarser quantization forms a non-Markovian system that does not converge. This paper investigates the problem in using quantized lookup tables and presents an extension to the Q-Learning algorithm, referred to as Coarse Q-Learning (C QL ), which resolves these issues. The presented algorithm will be shown to drastically reduce the memory requirements and increase generalisation by simulating the Markov property. In particular, this algorithm means the size of the input space is determined by the granularity required by the policy being learnt, rather than by the inadequacies of the learning algorithm or the nature of the state-reward dynamics of the environment. Importantly, the method presented solves the problem represented by the curse of dimensionality.

QLBS: Q-Learner in the Black-Scholes(-Merton) Worlds

The Journal of Derivatives, 2020

This paper presents a discrete-time option pricing model that is rooted in Reinforcement Learning (RL), and more specifically in the famous Q-Learning method of RL. We construct a riskadjusted Markov Decision Process for a discrete-time version of the classical Black-Scholes-Merton (BSM) model, where the option price is an optimal Q-function, while the optimal hedge is a second argument of this optimal Q-function, so that both the price and hedge are parts of the same formula. Pricing is done by learning to dynamically optimize risk-adjusted returns for an option replicating portfolio, as in the Markowitz portfolio theory. Using Q-Learning and related methods, once created in a parametric setting, the model is able to go model-free and learn to price and hedge an option directly from data, and without an explicit model of the world. This suggests that RL may provide efficient data-driven and model-free methods for optimal pricing and hedging of options, once we depart from the academic continuous-time limit, and vice versa, option pricing methods developed in Mathematical Finance may be viewed as special cases of model-based Reinforcement Learning. Further, due to simplicity and tractability of our model which only needs basic linear algebra (plus Monte Carlo simulation, if we work with synthetic data), and its close relation to the original BSM model, we suggest that our model could be used for benchmarking of different RL algorithms for financial trading applications.

Neural Value Function Approximation in Continuous State Reinforcement Learning Problems

2018

Recent development of Deep Reinforcement Learning (DRL) has demonstrated superior performance of neural networks in solving challenging problems with large or continuous state spaces. In this work, we focus on the problem of minimising the expected one step Temporal Difference (TD) error with neural function approximator for a continuous state space, from a smooth optimisation perspective. An approximate Newton’s algorithm is proposed. Effectiveness of the algorithm is demonstrated on both finite and continuous state space benchmarks. We show that, in order to benefit from the second order approximate Newton’s algorithm, gradient of the TD target needs to be considered for training.

Experiments with Reinforcement Learning in Problems with Continuous State and Action Spaces

Adaptive Behavior, 1997

A key element in the solution of reinforcement learning problems is the value function. The purpose of this function is to measure the long-term utility or value of any given state. The function is important because an agent can use this measure to decide what to do next. A common problem in reinforcement learning when applied to systems having continuous states and action spaces is that the value function must operate with a domain consisting of real-valued variables, which means that it should be able to represent the value of infinitely many state and action pairs. For this reason, function approximators are used to represent the value function when a close-form solution of the optimal policy is not available. In this article, we extend a previously proposed reinforcement learning algorithm so that it can be used with function approximators that generalize the value of individual experiences across both state and action spaces. In particular, we discuss the benefits of using spar...