Learning-based model predictive control for Markov decision processes (original) (raw)

Learning-based model predictive control for Markov decision processes * LEARNING-BASED MODEL PREDICTIVE CONTROL FOR MARKOV DECISION PROCESSES

We propose the use of Model Predictive Control (MPC) for controlling systems described by Markov decision processes. First, we consider a straightforward MPC algorithm for Markov decision processes. Then, we propose value functions, a means to deal with issues arising in conventional MPC, e.g., computational requirements and sub-optimality of actions. We use reinforcement learning to let an MPC agent learn a value function incrementally. The agent incorporates experience from the interaction with the system in its decision making. Our approach initially relies on pure MPC. Over time, as experience increases, the learned value function is taken more and more into account. This speeds up the decision making, allows decisions to be made over an infinite instead of a finite horizon, and provides adequate control actions, even if the system and desired performance slowly vary over time.

Experience-based model predictive control using reinforcement learning

2004

Abstract Model predictive control (MPC) is becoming an increasingly popular method to select actions for controlling dynamic systems. Traditionally MPC uses a model of the system to be controlled and a performance function to characterize the desired behavior of the system. The MPC agent finds actions over a finite horizon that lead the system into a desired direction. A significant problem with conventional MPC is the amount of computations required and suboptimality of chosen actions.

Model predictive control and reinforcement learning as two complementary frameworks

Model predictive control (MPC) and reinforcement learning (RL) are two popular families of methods to control system dynamics. In their traditional setting, they formulate the control problem as a discrete-time optimal control problem and compute a suboptimal control policy. We present in this paper in a unified framework these two families of methods. We run for MPC and RL algorithms simulations on a benchmark control problem taken from the power system literature and discuss the results obtained.

Blending MPC & Value Function Approximation for Efficient Reinforcement Learning

2021

Model-Predictive Control (MPC) is a powerful tool for controlling complex, real-world systems that uses a model to make predictions about future behavior. For each state encountered, MPC solves an online optimization problem to choose a control action that will minimize future cost. This is a surprisingly effective strategy, but real-time performance requirements warrant the use of simple models. If the model is not sufficiently accurate, then the resulting controller can be biased, limiting performance. We present a framework for improving on MPC with model-free reinforcement learning (RL). The key insight is to view MPC as constructing a series of local Q-function approximations. We show that by using a parameter lambda\lambdalambda, similar to the trace decay parameter in TD($\lambda$), we can systematically trade-off learned value estimates against the local Q-function approximations. We present a theoretical analysis that shows how error from inaccurate models in MPC and value function e...

An experimental study of two predictive reinforcement learning methods and comparison with model-predictive control

IFAC-PapersOnLine

Reinforcement learning (RL) has been successfully used in various simulations and computer games. Industry-related applications, such as autonomous mobile robot motion control, are somewhat challenging for RL up to date though. This paper presents an experimental evaluation of predictive RL controllers for optimal mobile robot motion control. As a baseline for comparison, model-predictive control (MPC) is used. Two RL methods are tested: a roll-out Q-learning, which may be considered as MPC with terminal cost being a Q-function approximation, and a so-called stacked Q-learning, which in turn is like MPC with the running cost substituted for a Q-function approximation. The experimental foundation is a mobile robot with a differential drive (Robotis Turtlebot3). Experimental results showed that both RL methods beat the baseline in terms of the accumulated cost, whereas the stacked variant performed best. Provided the series of previous works on stacked Q-learning, this particular study supports the idea that MPC with a running cost adaptation inspired by Q-learning possesses potential of performance boost while retaining the nice properties of MPC.

Stability-Constrained Markov Decision Processes Using MPC

ArXiv, 2021

In this paper, we consider solving discounted Markov Decision Processes (MDPs) under the constraint that the resulting policy is stabilizing. In practice MDPs are solved based on some form of policy approximation. We will leverage recent results proposing to use Model Predictive Control (MPC) as a structured policy in the context of Reinforcement Learning to make it possible to introduce stability requirements directly inside the MPC-based policy. This will restrict the solution of the MDP to stabilizing policies by construction. The stability theory for MPC is most mature for the undiscounted MPC case. Hence, we will first show in this paper that stable discounted MDPs can be reformulated as undiscounted ones. This observation will entail that the MPC-based policy with stability requirements will produce the optimal policy for the discounted MDP if it is stable, and the best stabilizing policy otherwise.

Reinforcement Learning and Markov Decision Processes

Reinforcement Learning, 2012

Situated in between supervised learning and unsupervised learning, the paradigm of reinforcement learning deals with learning in sequential decision making problems in which there is limited feedback. This text introduces the intuitions and concepts behind Markov decision processes and two classes of algorithms for computing optimal behaviors: reinforcement learning and dynamic programming. First the formal framework of Markov decision process is defined, accompanied by the definition of value functions and policies. ...

Combining Markov Decision Processes with Linear Optimal Controllers

Linear Quadratic Gaussian (LQG) control has a known analytical solution [1] but non-linear problems do not . The state of the art method used to find approximate solutions to non-linear control problems (iterative LQG) [3] carries a large computational cost associated with iterative calculations [4]. We propose a novel approach for solving nonlinear Optimal Control (OC) problems which combines Reinforcement Learning (RL) with OC. The new algorithm, RLOC, uses a small set of localized optimal linear controllers and applies a Monte Carlo algorithm that learns the mapping from the state space to controllers. We illustrate our approach by solving a non-linear OC problem of the 2-joint arm operating in a plane with two point masses. We show that controlling the arm with the RLOC is less costly than using the Linear Quadratic Regulator (LQR). This finding shows that non-linear optimal control problems can be solved using a novel approach of adaptive RL.

Value Function Based Reinforcement Learning in Changing Markovian Environments

Journal of Machine Learning Research - JMLR, 2008

The paper investigates the possibility of applying value function based reinforcement learn- ing (RL) methods in cases when the environment may change over time. First, theorems are presented which show that the optimal value function of a discounted Markov decision process (MDP) Lipschitz continuously depends on the immediate-cost function and the transition-probability function. Dependence on the discount factor is also analyzed and shown to be non-Lipschitz. Afterwards, the concept of (",�)-MDPs is introduced, which is a generalization of MDPs and "-MDPs. In this model the environment may change over time, more precisely, the transition function and the cost function may vary from time to time, but the changes must be bounded in the limit. Then, learning algorithms in changing environments are analyzed. A general relaxed convergence theorem for stochastic iterative algorithms is presented. We also demonstrate the results through three classical RL meth- ods: asynchronou...

Self Learning Control of Constrained Markov Decision Processes - A Gradient Approach

2003

We present stochastic approximation algorithms for computing the locally optimal policy of a constrained average cost finite state Markov Decision process. Because the optimal control strategy is known to be a randomized policy, we consider here a pa- rameterizationof the actionprobabilities to establish the optimizationproblem. The stochastic approximationalgorithms require computationof the gradien t of the cost function with respect to the parameter that characterizes the randomized policy. This is computed,by novel simulation based gradient estimation schemes involving weak derivatives. Similar to neuro-dynamic programming,algorithms (e.g. Q-learning or