Metrics for Finite Markov Decision Processes (original) (raw)

Metrics for Markov decision processes with infinite state spaces

arXiv preprint arXiv:1207.1386, 2012

Abstract: We present metrics for measuring state similarity in Markov decision processes (MDPs) with infinitely many states, including MDPs with continuous state spaces. Such metrics provide a stable quantitative analogue of the notion of bisimulation for MDPs, and are suitable for use in MDP approximation. We show that the optimal value function associated with a discounted infinite horizon planning task varies continuously with respect to our metric distances.

Methods for computing state similarity in markov decision processes

2012

Abstract: A popular approach to solving large probabilistic systems relies on aggregating states based on a measure of similarity. Many approaches in the literature are heuristic. A number of recent methods rely instead on metrics based on the notion of bisimulation, or behavioral equivalence between states (Givan et al, 2001, 2003; Ferns et al, 2004). An integral component of such metrics is the Kantorovich metric between probability distributions.

Bisimulation Metrics for Continuous Markov Decision Processes

2011

In recent years, various metrics have been developed for measuring the behavioral similarity of states in probabilistic transition systems [J. Desharnais et al., Proceedings of CONCUR'99, Springer-Verlag, London, 1999, pp. 258-273; F. van Breugel and J. Worrell, Proceedings of ICALP'01, Springer-Verlag, London, 2001, pp. 421-432]. In the context of finite Markov decision processes (MDPs), we have built on these metrics to provide a robust quantitative analogue of stochastic bisimulation [N. Ferns, P. Panangaden, and D.

L G ] 2 9 A ug 2 01 9 Optimality and Approximation with Policy Gradient Methods in Markov Decision Processes

2019

Policy gradient methods are among the most effective methods in challenging reinforcement learning problems with large state and/or action spaces. However, little is known about even their most basic theoretical convergence properties, including: if and how fast they converge to a globally optimal solution (say with a sufficiently rich policy class); how they cope with approximation error due to using a restricted class of parametric policies; or their finite sample behavior. Such characterizations are important not only to compare these methods to their approximate value function counterparts (where such issues are relatively well understood, at least in the worst case), but also to help with more principled approaches to algorithm design. This work provides provable characterizations of computational, approximation, and sample size issues with regards to policy gradient methods in the context of discounted Markov Decision Processes (MDPs). We focus on both: 1) “tabular” policy par...

Finite state Markov decision models with average reward criteria

Stochastic Processes and their Applications, 1994

This paper deal\ with a discrete time Markov decision model with a finite state space, arbitrary action space, and bounded reward function under the average reward criteria. We conaider four average reward criteria and prove the existence of persistently nearly optimal strategies in various classes of strategies for models with complete state information. We show that such strategies exist in any clas\ of strategies satisfying the fbllowing condition: along any trajectory at different epochs the controller knows different information about the past. Though neither optimal nor stationary nearly optitnal strategies may exist. we show that for some nonempty set of states the described nearly optimal strategies may be chosen either stationary or optimal. Markov decision models * average reward criteria * persistently nearly optimal strategies * Markov strategies * stationary strategies * non-repeating condition Cr~rrr.

Reinforcement learning in finite MDPs: PAC analysis

2009

We study the problem of learning near-optimal behavior in finite Markov Decision Processes (MDPs) with a polynomial number of samples. These "PAC-MDP" algorithms include the wellknown E 3 and R-MAX algorithms as well as the more recent Delayed Q-learning algorithm. We summarize the current state-of-the-art by presenting bounds for the problem in a unified theoretical framework. A more refined analysis for upper and lower bounds is presented to yield insight into the differences between the model-free Delayed Q-learning and the model-based R-MAX.

Theory of Finite Horizon Markov Decision Processes

Universitext, 2011

In this chapter we will establish the theory of Markov Decision Processes with a finite time horizon and with general state and action spaces. Optimization problems of this kind can be solved by a backward induction algorithm. Since state and action space are arbitrary, we will impose a structure assumption on the problem in order to prove the validity of the backward induction and the existence of optimal policies. The chapter is organized as follows. Section 2.1 provides the basic model data and the definition of policies. The precise mathematical model is then presented in Section 2.2 along with a sufficient integrability assumption which implies a well-defined problem. The solution technique for these problems is explained in Section 2.3. Under structure assumptions on the model it will be shown that Markov Decision Problems can be solved recursively by the so-called Bellman equation. The next section summarizes a number of important special cases in which the structure assumption is satisfied. Conditions on the model data are given such that the value functions are upper semicontinuous, continuous, measurable, increasing, concave or convex respectively. Also the monotonicity of the optimal policy under some conditions is established. This is an essential property for computations. Finally the important concept of upper bounding functions is introduced in this section. Whenever an upper bounding function for a Markov Decision Model exists, the integrability assumption is satisfied. This concept will be very fruitful when dealing with infinite horizon Markov Decision Problems in Chapter 7. In Section 2.5 the important case of stationary Markov Decision Models is investigated. The notion 'stationary' indicates that the model data does not depend on the time index. The relevant theory is here adopted from the non-stationary case. Finally Section 2.6 highlights the application of the developed theory by investigating three simple examples. The first example is a special card game, the second one a cash balance problem and the last one deals with the classical stochastic LQ-problems. The last section contains some notes and references.

Reinforcement Learning exploiting state-action equivalence

2019

Leveraging an equivalence property on the set of states of state-action pairs in an Markov Decision Process (MDP) has been suggested by many authors. We take the study of equivalence classes to the reinforcement learning (RL) setup, when transition distributions are no longer assumed to be known, in a discrete MDP with average reward criterion and no reset. We study powerful similarities between state-action pairs related to optimal transport. We first analyze a variant of the UCRL2 algorithm called C-UCRL2, which highlights the clear benefit of leveraging this equivalence structure when it is known ahead of time: the regret bound scales as Õ(D √ KCT ) where C is the number of classes of equivalent state-action pairs and K bounds the size of the support of the transitions. A non trivial question is whether this benefit can still be observed when the structure is unknown and must be learned while minimizing the regret. We propose a sound clustering technique that provably learn the u...

Optimality and Approximation with Policy Gradient Methods in Markov Decision Processes

2020

Policy gradient methods are among the most effective methods in challenging reinforcement learning problems with large state and/or action spaces. However, little is known about even their most basic theoretical convergence properties, including: if and how fast they converge to a globally optimal solution (say with a sufficiently rich policy class); how they cope with approximation error due to using a restricted class of parametric policies; or their finite sample behavior. Such characterizations are important not only to compare these methods to their approximate value function counterparts (where such issues are relatively well understood, at least in the worst case), but also to help with more principled approaches to algorithm design. This work provides provable characterizations of computational, approximation, and sample size issues with regards to policy gradient methods in the context of discounted Markov Decision Processes (MDPs). We focus on both: 1) "tabular" ...