Using Rewards for Belief State Updates in Partially Observable Markov Decision Processes (original) (raw)

Partially Observable Markov Decision Processes

Springer eBooks, 2011

Partially Observable Markov Decision Processes (POMDPs) provide a general framework for AI planning, but they lack the structure for representing real world planning problems in a convenient and efficient way. Representations built on logic allow for problems to be specified in a compact and transparent manner. Moreover, decision making algorithms can assume and exploit structure found in the state space, actions, observations, and success criteria, and can solve with relative efficiency problems with large state spaces. In recent years researchers have sought to combine the benefits of logic with the expressiveness of POMDPs. In this paper, we show how to build upon and extend the results in this fusing of logic and decision theory. In particular, we present a compact representation of POMDPs and a method to update beliefs after actions and observations. The key contribution is our compact representation of belief states and of the operations used to update them. We then use heuristic search to find optimal plans that maximize expected total reward given an initial belief state.

A POMDP extension with belief-dependent rewards

2010

Partially Observable Markov Decision Processes (POMDPs) model sequential decision-making problems under uncertainty and partial observability. Unfortunately, some problems cannot be modeled with state-dependent reward functions, e.g., problems whose objective explicitly implies reducing the uncertainty on the state. To that end, we introduce ρPOMDPs, an extension of POMDPs where the reward function ρ depends on the belief state. We show that, under the common assumption that ρ is convex, the value function is also convex, what makes it possible to (1) approximate ρ arbitrarily well with a piecewise linear and convex (PWLC) function, and (2) use state-of-the-art exact or approximate solving algorithms with limited changes.

Partially observable Markov decision processes with imprecise parameters

Artificial Intelligence, 2007

This study extends the framework of partially observable Markov decision processes (POMDPs) to allow their parameters, i.e., the probability values in the state transition functions and the observation functions, to be imprecisely specified. It is shown that this extension can reduce the computational costs associated with the solution of these problems. First, the new framework, POMDPs with imprecise parameters (POMDPIPs), is formulated. We consider (1) the interval case, in which each parameter is imprecisely specified by an interval that indicates possible values of the parameter, and (2) the point-set case, in which each probability distribution is imprecisely specified by a set of possible distributions. Second, a new optimality criterion for POMDPIPs is introduced. As in POMDPs, the criterion is to regard a policy, i.e., an action-selection rule, as optimal if it maximizes the expected total reward. The expected total reward, however, cannot be calculated precisely in POMDPIPs, because of the parameter imprecision. Instead, we estimate the total reward by adopting arbitrary second-order beliefs, i.e., beliefs in the imprecisely specified state transition functions and observation functions. Although there are many possible choices for these second-order beliefs, we regard a policy as optimal as long as there is at least one of such choices with which the policy maximizes the total reward. Thus there can be multiple optimal policies for a POMDPIP. We regard these policies as equally optimal, and aim at obtaining one of them. By appropriately choosing which second-order beliefs to use in estimating the total reward, computational costs incurred in obtaining such an optimal policy can be reduced significantly. We provide an exact solution algorithm for POMDPIPs that does this efficiently. Third, the performance of such an optimal policy, as well as the computational complexity of the algorithm, are analyzed theoretically. Last, empirical studies show that our algorithm quickly obtains satisfactory policies to many POMDPIPs.

Algorithms for partially observable Markov decision processes

Chapter 1 Introduction 1 1.1 Planning 1 1.2 Applications 4 1.3 Thesis 6 1.4 Outline 8 Chapter 2 POMDP Theory and Algorithms 11 2.1 POMDP Model 2.1.1 Model de nition 2.1.2 Belief states, policies and value functions 2.1.3 Belief space MDP 2.1.4 Value iteration 2.2 Properties of Value Functions 2.2.1 Policy tree 2.2.2 Piecewise linear and convex property 2.2.3 Parsimonious representations 2.3 Di culties in Solving a POMDP 2.4 Standard Algorithms 2.4.1 Value iteration vi 2.4.2 Policy iteration 2.5 Theoretical Results 2.6 An Overview of POMDP Algorithms 2.6.1 Decomposing value functions 2.6.2 Value iteration: superset algorithms 2.6.3 Value iteration: subset algorithms 2.7 Current Research Status Chapter 3 Modi ed Value Iteration 3.1 Motivation 3.2 Uniformly Improvable Value Function 3.3 Modi ed Value Iteration: the Algorithm 3.3.1 Backing up on witness points of input vectors 3.3.2 Retaining uniform improvability 3.3.3 The algorithm 3.3.4 Stopping point-based value iteration 3.3.5 Convergence of modi ed value iteration 3.3.6 Computing the Bellman residual 3.4 Empirical Studies 3.4.1 E ectiveness of point-based improvements 3.4.2 Variations of point-based DP update 3.5 Related Work 3.5.1 Point-based and standard DP updates 3.5.2 Point-based procedure and value function approximation 3.5.3 Previous work related to modi ed value iteration 3.6 Conclusion Chapter 4 Value Iteration over subspace 4.

Learning and solving partially observable markov decision processes

2007

Partially Observable Markov Decision Processes (POMDPs) provide a rich representation for agents acting in a stochastic domain under partial observability. POMDPs optimally balance key properties such as the need for information and the sum of collected rewards. However, POMDPs are difficult to use for two reasons; first, it is difficult to obtain the environment dynamics and second, even given the environment dynamics, solving POMDPs optimally is intractable. This dissertation deals with both difficulties. We begin with a number of methods for learning POMDPs. Methods for learning POMDPs are usually categorized as either model-free or model-based. We show how model-free methods fail to provide good policies as noise in the environment increases. We continue to suggest how to transform model-free into model-based methods, thus improving their solution. This transformation is first demonstrated in an offline process-after the model-free method has computed a policy, and then in an online setting-where a model of the environment is learned together with a policy through interactions with the environment. The second part of the dissertation focuses on ways to solve predefined POMDPs. Pointbased methods for computing value functions have shown a great potential for solving large scale POMDPs. We provide a number of new algorithms that outperform existing point-based methods. We first show how properly ordering the value function updates can greatly reduce the required number of updates. We then present a trial-based algorithm that outperforms all current point-based algorithms. Due to the success of point-based algorithms on large domains, a need arises for compact representations of the environment. We thoroughly investigate the use of Algebraic Decision Diagrams (ADDs) for representing system dynamics. We show how all operations required for point-based algorithms can be implemented efficiently using ADDs.

Reinforcement Learning in Partially Observable Markov Decision Processes using Hybrid Probabilistic Logic Programs

Computing Research Repository, 2010

We present a probabilistic logic programming framework to reinforcement learning, by integrating reinforcement learning, in POMDP environments, with normal hybrid probabilistic logic programs with probabilistic answer set semantics, that is capable of representing domain-specific knowledge. We formally prove the correctness of our approach. We show that the complexity of finding a policy for a reinforcement learning problem in our approach is NP-complete. In addition, we show that any reinforcement learning problem can be encoded as a classical logic program with answer set semantics. We also show that a reinforcement learning problem can be encoded as a SAT problem. We present a new high level action description language that allows the factored representation of POMDP. Moreover, we modify the original model of POMDP so that it be able to distinguish between knowledge producing actions and actions that change the environment.

Value-directed belief state approximation for POMDPs

2000

Abstract We consider the problem belief-state monitoring for the purposes of implementing a policy for a partially-observable Markov decision process (POMDP), specifically how one might approximate the belief state. Other schemes for beliefstate approximation (eg, based on minimizing a measure such as KL-divergence between the true and estimated state) are not necessarily appropriate for POMDPs.

Planning and acting in partially observable stochastic domains

Artificial Intelligence, 1998

In this paper, we bring techniques from operations research to bear on the problem of choosing optimal actions in partially observable stochastic domains. We begin by introducing the theory of Markov decision processes (MDPs) and partially observable MDPs (POMDPs). We then outline a novel algorithm for solving POMDPs off line and show how, in some cases, a finite-memory controller can be extracted from the solution to a POMDP. We conclude with a discussion of how our approach relates to previous work, the complexity of finding exact solutions to POMDPs, and of some possibilities for finding approximate solutions.

A Bayesian approach for learning and planning in partially observable Markov decision processes

2011

Abstract Bayesian learning methods have recently been shown to provide an elegant solution to the exploration-exploitation trade-off in reinforcement learning. However most investigations of Bayesian reinforcement learning to date focus on the standard Markov Decision Processes (MDPs). The primary focus of this paper is to extend these ideas to the case of partially observable domains, by introducing the Bayes-Adaptive Partially Observable Markov Decision Processes.

Belief Selection in Point-Based Planning Algorithms for POMDPs

Lecture Notes in Computer Science, 2006

Current point-based planning algorithms for solving partially observable Markov decision processes (POMDPs) have demonstrated that a good approximation of the value function can be derived by interpolation from the values of a specially selected set of points. The performance of these algorithms can be improved by eliminating unnecessary backups or concentrating on more important points in the belief simplex. We study three methods designed to improve point-based value iteration algorithms. The first two methods are based on reachability analysis on the POMDP belief space. This approach relies on prioritizing the beliefs based on how they are reached from the given initial belief state. The third approach is motivated by the observation that beliefs which are the most overestimated or underestimated have greater influence on the precision of value function than other beliefs. We present an empirical evaluation illustrating how the performance of point-based value iteration varies with these approaches.