Finite state Markov decision models with average reward criteria (original) (raw)

Necessary conditions for the optimality equation in average-reward Markov decision processes

Applied Mathematics & Optimization, 1989

An average-reward Markov decision process (MDP) with discretetime parameter, denumerable state space, and bounded reward function is considered. With such a model, we associate a family of MDPs. Then, we determine necessary conditions for the existence of a bounded solution to the optimality equation for each one of the models in the family. Moreover, necessary and sufficient conditions are given so that the optimality equations have a bounded solution with an additional property.

Estimation and control in finite Markov decision processes with the average reward criterion

Applicationes Mathematicae, 2004

This work concerns Markov decision chains with finite state and action sets. The transition law satisfies the simultaneous Doeblin condition but is unknown to the controller, and the problem of determining an optimal adaptive policy with respect to the average reward criterion is addressed. A subset of policies is identified so that, when the system evolves under a policy in that class, the frequency estimators of the transition law are consistent on an essential set of admissible state-action pairs, and the non-stationary value iteration scheme is used to select an optimal adaptive policy within that family.

A Convex Programming Approach for Discrete-Time Markov Decision Processes under the Expected Total Reward Criterion

SIAM Journal on Control and Optimization, 2020

In this work, we study discrete-time Markov decision processes (MDPs) under constraints with Borel state and action spaces and where all the performance functions have the same form of the expected total reward (ETR) criterion over the infinite time horizon. One of our objective is to propose a convex programming formulation for this type of MDPs. It will be shown that the values of the constrained control problem and the associated convex program coincide and that if there exists an optimal solution to the convex program then there exists a stationary randomized policy which is optimal for the MDP. It will be also shown that in the framework of constrained control problems, the supremum of the expected total rewards over the set of randomized policies is equal to the supremum of the expected total rewards over the set of stationary randomized policies. We consider standard hypotheses such as the so-called continuity-compactness conditions and a Slater-type condition. Our assumptions are quite weak to deal with cases that have not yet been addressed in the literature. An example is presented to illustrate our results with respect to those of the literature.

Optimal control of average reward constrained continuous-time finite Markov decision processes

Proceedings of the 41st IEEE Conference on Decision and Control, 2002.

The paper studies optimization of average-reward continuous-time finite state and action Markov Decision Processes with multiple criteria and constraints. Under the standard unichain assumption, we prove the existence of optimal K-switching strategies for feasible problems with K constraints. For switching randomized strategies, the decisions depend on the current state and the the time spent in the current state after the last jump. For stationary strategies, these functions do not depend on sojourn times, i.e. they are constant in time. For K-snitching strategies, these functions are piece-wise constant and the total number of jumps is limited by K. If there is no absorbing states, there exist also optimal K-randomized policies. We consider the linear programming approach and provide algorithms for calculations of optimal policies.

A Note on the Existence of Optimal Policies in Total Reward Dynamic Programs with Compact Action Sets

Mathematics of Operations Research, 2000

This work deals with Markov decision processes (MDPs) with expected total rewards, discrete state spaces, and compact action sets. Within this framework, a question on the existence of optimal stationary policies, formulated by Puterman (1994, p. 326), is considered. The paper concerns the possibility of obtaining an affirmative answer when additional assumptions are imposed on the decision model. Three conditions ensuring the existence of average optimal stationary policies in finite-state MDPs are analyzed, and it is shown that only the so-called structural continuity condition is a natural sufficient assumption under which the existence of total-reward optimal stationary policies can be guaranteed. In particular, this existence result holds for unichain MDPs with finite state space, but an example is provided to show that this general conclusion does not have an extension to the denumerable state space case.

Sample-path optimal stationary policies in stable Markov decision chains with the average reward criterion

Journal of Applied Probability, 2015

This paper concerns discrete-time Markov decision chains with denumerable state and compact action sets. Besides standard continuity requirements, the main assumption on the model is that it admits a Lyapunov function ℓ. In this context the average reward criterion is analyzed from the sample-path point of view. The main conclusion is that if the expected average reward associated to ℓ2 is finite under any policy then a stationary policy obtained from the optimality equation in the standard way is sample-path average optimal in a strong sense.

On Near Optimality of the Set of Finite-State Controllers for Average Cost POMDP

Mathematics of Operations Research, 2008

We consider the average cost problem for partially observable Markov decision processes (POMDP) with finite state, observation, and control spaces. We prove that there exists an -optimal finite-state controller (FSC) functionally independent of initial distributions for any > 0, under the assumption that the optimal liminf average cost function of the POMDP is constant. As part of our proof, we establish that if the optimal liminf average cost function is constant, then the optimal limsup average cost function is also constant, and the two are equal. We also discuss the connection between the existence of nearly optimal finite-history controllers and two other important issues for average cost POMDP: the existence of an average cost that is independent of the initial state distribution, and the existence of a bounded solution to the constant average cost optimality equation. .

On the Expected Total Reward with Unbounded Returns for Markov Decision Processes

Applied Mathematics & Optimization, 2018

We consider a discrete-time Markov decision process with Borel state and action spaces. The performance criterion is to maximize a total expected utility determined by unbounded return function. It is shown the existence of optimal strategies under general conditions allowing the reward function to be unbounded both from above and below and the action sets available at each step to the decision maker to be not necessarily compact. To deal with unbounded reward functions, a new characterization for the weak convergence of probability measures is derived. Our results are illustrated by examples.

Markov Decision Processes with Long-Term Average Constraints

ArXiv, 2021

We consider the problem of constrained Markov Decision Process (CMDP) where an agent interacts with a unichain Markov Decision Process. At every interaction, the agent obtains a reward. Further, there are K cost functions. The agent aims to maximize the long-term average reward while simultaneously keeping the K long-term average costs lower than a certain threshold. In this paper, we propose CMDP-PSRL, a posterior sampling based algorithm using which the agent can learn optimal policies to interact with the CMDP. Further, for MDP with S states, A actions, and diameter D, we prove that following CMDP-PSRL algorithm, the agent can bound the regret of not accumulating rewards from optimal policy by Õ(poly(DSA) √ T ). Further, we show that the violations for any of the K constraints is also bounded by Õ(poly(DSA) √ T ). To the best of our knowledge, this is the first work which obtains a Õ( √ T ) regret bounds for ergodic MDPs with long-term average constraints.