Average optimality for continuous-time Markov decision processes in Polish spaces (original) (raw)
2006, The Annals of Applied Probability
This paper is devoted to studying the average optimality in continuous-time Markov decision processes with fairly general state and action spaces. The criterion to be maximized is expected average rewards. The transition rates of underlying continuous-time jump Markov processes are allowed to be unbounded, and the reward rates may have neither upper nor lower bounds. We first provide two optimality inequalities with opposed directions, and also give suitable conditions under which the existence of solutions to the two optimality inequalities is ensured. Then, from the two optimality inequalities we prove the existence of optimal (deterministic) stationary policies by using the Dynkin formula. Moreover, we present a "semimartingale characterization" of an optimal stationary policy. Finally, we use a generalized Potlach process with control to illustrate the difference between our conditions and those in the previous literature, and then further apply our results to average optimal control problems of generalized birth-death systems, upwardly skip-free processes and two queueing systems. The approach developed in this paper is slightly different from the "optimality inequality approach" widely used in the previous literature. . This reprint differs from the original in pagination and typographic detail. 1 2 X. GUO AND U. RIEDER specified by four primitive data: a state space S; an action space A with subsets A(x) of admissible actions, which may depend on the current state x ∈ S; transition rates q(·|x, a); and reward (or cost) rates r(x, a). Using these terms, we now briefly describe some existing works on the expected average criterion. When the state space is finite, a bounded solution to the average optimality equation (AOE) and methods for computing optimal stationary policies have been investigated in . Since then, most work has focused on the case of a denumerable state space; for instance, see for bounded transition and reward rates, for bounded transition rates but unbounded reward rates, [16, 35] for unbounded transition rates but bounded reward rates and [12, 13, 17] for unbounded transition and reward rates. For the case of an arbitrary state space, to the best of our knowledge, only Doshi [5] and Hernández-Lerma [19] have addressed this issue. They ensured the existence of optimal stationary policies. However, the treatments in [5] and [19] are restricted to uniformly bounded reward rates and nonnegative cost rates, respectively, and the AOE plays a key role in the proof of the existence of average optimal policies. Moreover, to establish the AOE, Doshi [5] needed the hypothesis that all admissible action sets are finite and the relative difference of the optimal discounted value function is equicontinuous, whereas in [19] the assumption about the existence of a solution to the AOE is imposed. On the other hand, it is worth mentioning that some of the conditions in are imposed on the family of weak infinitesimal operators deduced from all admissible policies, instead of the primitive data. In this paper we study the much more general case. That is, the reward rates may have neither upper nor lower bounds, all of the state and action spaces are fairly general and the transition rates are allowed to be unbounded. We first provide two optimality inequalities rather than one for the "optimality inequality approach" used in , for instance. Under suitable assumptions we not only prove the existence of solutions to the two optimality inequalities, but also ensure the existence of optimal stationary policies by using the two inequalities and the Dynkin formula. Also, to verify our assumptions, we further give sufficient conditions which are imposed on the primitive data. Moreover, we present a semimartingale characterization of an optimal stationary policy. Finally, we use controlled generalized Potlach processes to show that all conditions in this paper are satisfied, whereas the earlier conditions fail to hold. Then we further apply our results to average optimal control problems of generalized birth-death systems and upwardly skip-free processes [1], a pair of controlled queues in tandem , and M/M/N/0 queue systems . It should be noted that, on the one hand, the optimality inequality approach used in the previous literature (see, e.g., for continuous-time MDPs and [20, 21, 31, 34] for discrete-time MDPs) is not applied to our case, because in our model the reward rates may have neither upper nor lower bounds. On the other hand, we not only CONTINUOUS-TIME MARKOV DECISION PROCESSES 3