Active Learning in Partially Observable Markov Decision Processes (original) (raw)
2005
https://doi.org/10.1007/11564096_59
Sign up for access to the world's latest research
checkGet notified about relevant papers
checkSave papers to use in your research
checkJoin the discussion with peers
checkTrack your impact
Abstract
Learning in Partially Observable Markov Decision Processes is a notoriously difficult problem. The goal of our research is to address this problem for environments in which a partial model may be available, in the beginning, but in which there is uncertainty about the model parameters. We developed and algorithm called MEDUSA , which is based on ideas from active learning [1,2,3].
Related papers
A Bayesian approach for learning and planning in partially observable Markov decision processes
2011
Abstract Bayesian learning methods have recently been shown to provide an elegant solution to the exploration-exploitation trade-off in reinforcement learning. However most investigations of Bayesian reinforcement learning to date focus on the standard Markov Decision Processes (MDPs). The primary focus of this paper is to extend these ideas to the case of partially observable domains, by introducing the Bayes-Adaptive Partially Observable Markov Decision Processes.
2006
Abstract When an agent evolves in a partially observable environment, it has to deal with uncertainties when choosing its actions. An efficient model for such environments is to use partially observable Markov decision processes (POMDPs). Many algorithms have been developed for POMDPs. Some use an offline approach, learning a complete policy before the execution. Others use an online approach, constructing the policy online for the current belief state.
Learning without state-estimation in partially observable Markovian decision problems
Icml, 1984
Reinforcement learning (RL) algorithms provide a sound theoretical basis for building learning control architectures for embedded agents. Unfortunately all of the theory and much of the practice (see for an exception) of RL is limited to Markovian decision processes (MDPs). Many realworld decision tasks, however, are inherently non-Markovian, i.e., the state of the environment is only incompletely known to the learning agent. In this paper we consider only partially observable MDPs (POMDPs), a useful class of non-Markovian decision processes. Most previous approaches to such problems have combined computationally expensive state-estimation techniques with learning control. This paper investigates learning in POMDPs without resorting to any f o r m of state estimation. We present results about what TD(0) and Q-learning will do when applied to POMDPs. It is shown that the conventional discounted RL framework is inadequate to deal with POMDPs. Finally we develop a new framework for learning without state-estimation in POMDPs by including stochastic policies in the search space, and by de ning the value or utility o f a distribution over states.
Partially observable Markov decision processes with imprecise parameters
Artificial Intelligence, 2007
This study extends the framework of partially observable Markov decision processes (POMDPs) to allow their parameters, i.e., the probability values in the state transition functions and the observation functions, to be imprecisely specified. It is shown that this extension can reduce the computational costs associated with the solution of these problems. First, the new framework, POMDPs with imprecise parameters (POMDPIPs), is formulated. We consider (1) the interval case, in which each parameter is imprecisely specified by an interval that indicates possible values of the parameter, and (2) the point-set case, in which each probability distribution is imprecisely specified by a set of possible distributions. Second, a new optimality criterion for POMDPIPs is introduced. As in POMDPs, the criterion is to regard a policy, i.e., an action-selection rule, as optimal if it maximizes the expected total reward. The expected total reward, however, cannot be calculated precisely in POMDPIPs, because of the parameter imprecision. Instead, we estimate the total reward by adopting arbitrary second-order beliefs, i.e., beliefs in the imprecisely specified state transition functions and observation functions. Although there are many possible choices for these second-order beliefs, we regard a policy as optimal as long as there is at least one of such choices with which the policy maximizes the total reward. Thus there can be multiple optimal policies for a POMDPIP. We regard these policies as equally optimal, and aim at obtaining one of them. By appropriately choosing which second-order beliefs to use in estimating the total reward, computational costs incurred in obtaining such an optimal policy can be reduced significantly. We provide an exact solution algorithm for POMDPIPs that does this efficiently. Third, the performance of such an optimal policy, as well as the computational complexity of the algorithm, are analyzed theoretically. Last, empirical studies show that our algorithm quickly obtains satisfactory policies to many POMDPIPs.
Learning to Act in Continuous Dec-POMDPs
2018
We address a long-standing open problem of reinforcement learning in continuous decentralized partially observable Markov decision processes. Previous attempts focused on different forms of generalized policy iteration, which at best led to local optima. In this paper, we restrict attention to plans, which are simpler to store and update than policies. We derive, under mild conditions, the first optimal cooperative multi-agent reinforcement learning algorithm. To achieve significant scalability gains, we replace the greedy maximization by mixed-integer linear programming. Experiments show our approach can learn to act optimally in many finite domains from the literature.
Learning in POMDPs with Monte Carlo Tree Search
2017
The POMDP is a powerful framework for reasoning under outcome and information uncertainty, but constructing an accurate POMDP model is difficult. Bayes-Adaptive Partially Observable Markov Decision Processes (BA-POMDPs) extend POMDPs to allow the model to be learned during execution. BA-POMDPs are a Bayesian RL approach that, in principle, allows for an optimal trade-off between exploitation and exploration. Unfortunately, BA-POMDPs are currently impractical to solve for any non-trivial domain. In this paper, we extend the Monte-Carlo Tree Search method POMCP to BA-POMDPs and show that the resulting method, which we call BA-POMCP, is able to tackle problems that previous solution methods have been unable to solve. Additionally, we introduce several techniques that exploit the BA-POMDP structure to improve the efficiency of BA-POMCP along with proof of their convergence.
Active Learning of Dynamic Bayesian Networks in Markov Decision Processes
Lecture Notes in Computer Science
Several recent techniques for solving Markov decision processes use dynamic Bayesian networks to compactly represent tasks. The dynamic Bayesian network representation may not be given, in which case it is necessary to learn it if one wants to apply these techniques. We develop an algorithm for learning dynamic Bayesian network representations of Markov decision processes using data collected through exploration in the environment. To accelerate data collection we develop a novel scheme for active learning of the networks. We assume that it is not possible to sample the process in arbitrary states, only along trajectories, which prevents us from applying existing active learning techniques. Our active learning scheme selects actions that maximize the total entropy of distributions used to evaluate potential refinements of the networks.
Learning and solving partially observable markov decision processes
2007
Partially Observable Markov Decision Processes (POMDPs) provide a rich representation for agents acting in a stochastic domain under partial observability. POMDPs optimally balance key properties such as the need for information and the sum of collected rewards. However, POMDPs are difficult to use for two reasons; first, it is difficult to obtain the environment dynamics and second, even given the environment dynamics, solving POMDPs optimally is intractable. This dissertation deals with both difficulties. We begin with a number of methods for learning POMDPs. Methods for learning POMDPs are usually categorized as either model-free or model-based. We show how model-free methods fail to provide good policies as noise in the environment increases. We continue to suggest how to transform model-free into model-based methods, thus improving their solution. This transformation is first demonstrated in an offline process-after the model-free method has computed a policy, and then in an online setting-where a model of the environment is learned together with a policy through interactions with the environment. The second part of the dissertation focuses on ways to solve predefined POMDPs. Pointbased methods for computing value functions have shown a great potential for solving large scale POMDPs. We provide a number of new algorithms that outperform existing point-based methods. We first show how properly ordering the value function updates can greatly reduce the required number of updates. We then present a trial-based algorithm that outperforms all current point-based algorithms. Due to the success of point-based algorithms on large domains, a need arises for compact representations of the environment. We thoroughly investigate the use of Algebraic Decision Diagrams (ADDs) for representing system dynamics. We show how all operations required for point-based algorithms can be implemented efficiently using ADDs.
A Bayesian Approach to Model Learning in Non-Markovian Environments
Most of the reinforcement learning (RL) algorithms assume that the learning processes of embedded agents can be formulated as Markov Decision Processes (MDPs). However , the assumption is not valid for many realistic problems. Therefore, research o n RL techniques for non-Markovian environments is gaining more attention recently. We h a ve developed a Bayesian approach t o R L i n n o n-Markovian environments, in which t h e e n vi-ronment is modeled as a history tree model, a stochastic model with variable memory length. In our approach, given a class of history trees, the agent explores the environment and learns the maximum a posteriori (MAP) model on the basis of Bayesian Statistics. The optimal policy can be computed by Dynamic Programming, after the agent has learned the environment model. Unlike many other model learning techniques, our approach does not suuer from the problems of noise and overrtting, thanks to the Bayesian framework. We have analyzed the asymptotic behavior of the proposed algorithm and have proved that if the given class contains the exact model of the environment, the model learned by our algorithm converges to it. We also present the results of our experiments in two non-Markovian environments.
Learning to explore and exploit in pomdps
2009
A fundamental objective in reinforcement learning is the maintenance of a proper balance between exploration and exploitation. This problem becomes more challenging when the agent can only partially observe the states of its environment. In this paper we propose a dual-policy method for jointly learning the agent behavior and the balance between exploration exploitation, in partially observable environments. The method subsumes traditional exploration, in which the agent takes actions to gather information about the environment, and active learning, in which the agent queries an oracle for optimal actions (with an associated cost for employing the oracle). The form of the employed exploration is dictated by the specific problem. Theoretical guarantees are provided concerning the optimality of the balancing of exploration and exploitation. The effectiveness of the method is demonstrated by experimental results on benchmark problems.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.
References (5)
- Anderson, B. and Moore, A. "Active Learning in HMMs". ICML 2005.
- Cohn, D. A., Ghahramani, Z. and Jordan, M. I. "Active Learning with Statistical Models". NIPS 1996.
- Dearden, R.,Friedman, N.,Andre, N., "Model Based Bayesian Exploration". UAI 1999.
- Jaulmes, R., Pineau, J., Precup, D. "Active learning in Partially Observable Markov Decision Processes". ECML 2005.
- Jaulmes, R., Pineau, J., Precup, D. "Learning in non-stationary Partially Observable Markov Decision Processes". ECML 2005 Workshop on learning in non-stationary environments.
Related papers
Lecture Notes in Computer Science, 2012
We consider the active learning problem of inferring the transition model of a Markov Decision Process by acting and observing transitions. This is particularly useful when no reward function is a priori defined. Our proposal is to cast the active learning task as a utility maximization problem using Bayesian reinforcement learning with belief-dependent rewards. After presenting three possible performance criteria, we derive from them the belief-dependent rewards to be used in the decision-making process. As computing the optimal Bayesian value function is intractable for large horizons, we use a simple algorithm to approximately solve this optimization problem. Despite the sub-optimality of this technique, we show experimentally that our proposal is efficient in a number of domains.
Learning in non-stationary Partially Observable Markov Decision Processes
We study the problem of finding an optimal policy for a Partially Observable Markov Decision Process (POMDP) when the model is not perfectly known and may change over time. We present the algorithm MEDUSA+, which incrementally improves a POMDP model using selected queries, while still optimizing the reward. Empirical results show the response of the algorithm to changes in the parameters of a model: the changes are learned quickly and the agent still accumulates high reward throughout the process.
Model-Based Online Learning of POMDPs
Lecture Notes in Computer Science, 2005
Learning to act in an unknown partially observable domain is a difficult variant of the reinforcement learning paradigm. Research in the area has focused on model-free methods -methods that learn a policy without learning a model of the world. When sensor noise increases, model-free methods provide less accurate policies. The model-based approach -learning a POMDP model of the world, and computing an optimal policy for the learned model -may generate superior results in the presence of sensor noise, but learning and solving a model of the environment is a difficult problem. We have previously shown how such a model can be obtained from the learned policy of model-free methods, but this approach implies a distinction between a learning phase and an acting phase that is undesirable. In this paper we present a novel method for learning a POMDP model online, based on McCallums' Utile Suffix Memory (USM), in conjunction with an approximate policy obtained using an incremental POMDP solver. We show that the incrementally improving policy provides superior results to the original USM algorithm, especially in the presence of increasing sensor and action noise.
A Partially-Observable Markov Decision Process for Dealing with Dynamically Changing Environments
Lecture Notes in Computer Science, 2014
Partially Observable Markov Decision Processes (POMDPs) have been met with great success in planning domains where agents must balance actions that provide knowledge and actions that provide reward. Recently, nonparametric Bayesian methods have been successfully applied to POMDPs to obviate the need of a priori knowledge of the size of the state space, allowing to assume that the number of visited states may grow as the agent explores its environment. These approaches rely on the assumption that the agent's environment remains stationary; however, in real-world scenarios the environment may change over time. In this work, we aim to address this inadequacy by introducing a dynamic nonparametric Bayesian POMDP model that both allows for automatic inference of the (distributional) representations of POMDP states, and for capturing non-stationarity in the modeled environments. Formulation of our method is based on imposition of a suitable dynamic hierarchical Dirichlet process (dHDP) prior over state transitions. We derive efficient algorithms for model inference and action planning and evaluate it on several benchmark tasks.
Probabilistic robot planning under model uncertainty: an active learning approach
2005
While recent POMDP techniques have been successfully applied to the problem of robot control under uncertainty, they typically assume a known (and stationary) model of the environment. In this paper, we study the problem of finding an optimal policy for controlling a robot in a partially observable domain, where the model is not perfectly known, and may change over time. We present an algorithm called MEDUSA which incrementally learns a POMDP model using oracle queries, while still optimizing a reward function. We demonstrate effectiveness of the approach for realistic robot planning scenarios, with minimal a priori knowledge of the model.
Soft Methodology and Random Information Systems, 2004
In this paper, we investigate the conditions under which dynamic programming yields a solution to simultaneous learning and optimal control of a Markov decision process. First, we introduce a new optimality criterion that allows act-state dependence. This criterion is based on a partial preference ordering induced by an imprecise probability model of the dynamics of the system, updated by observations of the state and control history of the system. Then, we show that dynamic programming yields the set of all optimal solutions if the imprecise probability model satisfies particular properties. When we model learning of the system dynamics by an imprecise Dirichlet model, these properties turn out to be satisfied.
Partially Observable Markov Decision Processes
Springer eBooks, 2011
Partially Observable Markov Decision Processes (POMDPs) provide a general framework for AI planning, but they lack the structure for representing real world planning problems in a convenient and efficient way. Representations built on logic allow for problems to be specified in a compact and transparent manner. Moreover, decision making algorithms can assume and exploit structure found in the state space, actions, observations, and success criteria, and can solve with relative efficiency problems with large state spaces. In recent years researchers have sought to combine the benefits of logic with the expressiveness of POMDPs. In this paper, we show how to build upon and extend the results in this fusing of logic and decision theory. In particular, we present a compact representation of POMDPs and a method to update beliefs after actions and observations. The key contribution is our compact representation of belief states and of the operations used to update them. We then use heuristic search to find optimal plans that maximize expected total reward given an initial belief state.
Scalable Bayesian Reinforcement Learning for Multiagent POMDPs
2013
Bayesian methods for reinforcement learning (RL) allow model uncertainty to be considered explicitly and offer a principled way of dealing with the exploration/exploitation tradeoff. However, for multiagent systems there have been few such approaches, and none of them apply to problems with state uncertainty. In this paper, we fill this gap by proposing a Bayesian RL framework for multiagent partially observable Markov decision processes that is able to take advantage of structure present in many problems. In this framework, a team of agents operates in a centralized fashion, but has uncertainty about the model of the environment. Fitting many real-world situations, we consider the case where agents learn the appropriate models while acting in an online fashion. Because it can quickly become intractable to choose the optimal action in naı̈ve versions of this online learning problem, we propose a more scalable approach based on samplebased search and factored value functions for the ...
Bayesian reinforcement learning in continuous POMDPs with application to robot navigation
2008
Abstract We consider the problem of optimal control in continuous and partially observable environments when the parameters of the model are not known exactly. Partially observable Markov decision processes (POMDPs) provide a rich mathematical model to handle such environments but require a known model to be solved by most approaches. This is a limitation in practice as the exact model parameters are often difficult to specify exactly.
Efficient Exploitation of Factored Domains in Bayesian Reinforcement Learning for POMDPs
2018
While the POMDP has proven to be a powerful framework to model and solve partially observable stochastic problems, it assumes ac- curate and complete knowledge of the environment. When such information is not available, as is the case in many real world appli- cations, one must learn such a model. The BA-POMDP considers the model as part of the hidden state and explicitly considers the uncertainty over it, and as a result transforms the learning problem into a planning problem. This model, however, grows exponentially with the underlying POMDP size, and becomes intractable for non- trivial problems. In this article we propose a factored framework, the FBA-POMDP that represents the model as a Bayes-Net, dras- tically decreasing the number of parameters required to describe the dynamics of the environment. We demonstrate that the our ap- proach allows solvers to tackle problems much larger than possible in the BA-POMDP.