Constraint-based dynamic programming for decentralized POMDPs with structured interactions (original) (raw)

Point-based policy generation for decentralized POMDPs

2010

Memory-bounded techniques have shown great promise in solving complex multi-agent planning problems modeled as DEC-POMDPs. Much of the performance gains can be attributed to pruning techniques that alleviate the complexity of the exhaustive backup step of the original MBDP algorithm. Despite these improvements, state-of-the-art algorithms can still handle a relative small pool of candidate policies, which limits the quality of the solution in some benchmark problems. We present a new algorithm, Point-Based Policy Generation, which avoids altogether searching the entire joint policy space. The key observation is that the best joint policy for each reachable belief state can be constructed directly, instead of producing first a large set of candidates. We also provide an efficient approximate implementation of this operation. The experimental results show that our solution technique improves the performance significantly in terms of both runtime and solution quality.

Towards Computing Optimal Policies for Decentralized POMDPs

2002

The problem of deriving joint policies for a group of agents that maximze some joint reward function can be modelled as a decentralized partially observable Markov decision process (DEC-POMDP). Significant algorithms have been developed for single agent POMDPs however, with a few exceptions, effective algorithms for deriving policies for decentralized POMDPS have not been developed. As a first step, we present new algorithms for solving decentralized POMDPs. In particular, we describe an exhaustive search algorithm for a globally optimal solution and analyze the complexity of this algorithm, which we find to be doubly exponential in the number of agents and time, highlighting the importance of more feasible approximations. We define a class of algorithms which we refer to as "Joint Equilibrium-based Search for Policies"(JESP) and describe an exhaustive algorithm and a dynamic programming algorithm for JESP. Finaly, we empirically compare the exhaustive JESP algorithm with the globally optimal exhaustive algorithm.

Taming decentralized POMDPs: Towards efficient policy computation for multiagent settings

2003

The problem of deriving joint policies for a group of agents that maximize some joint reward function can be modeled as a decentralized partially observable Markov decision process (POMDP). Yet, despite the growing importance and applications of decentralized POMDP models in the multiagents arena, few algorithms have been developed for efficiently deriving joint policies for these models. This paper presents a new class of locally optimal algorithms called "Joint Equilibriumbased search for policies (JESP)". We first describe an exhaustive version of JESP and subsequently a novel dynamic programming approach to JESP. Our complexity analysis reveals the potential for exponential speedups due to the dynamic programming approach. These theoretical results are verified via empirical comparisons of the two JESP versions with each other and with a globally optimal brute-force search algorithm. Finally, we prove piece-wise linear and convexity (PWLC) properties, thus taking steps towards developing algorithms for continuous belief states.

Planning with Macro-Actions in Decentralized POMDPs

INTERNATIONAL CONFERENCE ON AUTONOMOUS AGENTS & MULTIAGENT SYSTEMS

Decentralized partially observable Markov decision processes (Dec-POMDPs) are general models for decentralized decision making under uncertainty. However, they typically model a problem at a low level of granularity, where each agent's actions are primitive operations lasting exactly one time step. We address the case where each agent has macro-actions: temporally extended actions which may require different amounts of time to execute. We model macro-actions as \emph{options} in a factored Dec-POMDP model, focusing on options which depend only on information available to an individual agent while executing. This enables us to model systems where coordination decisions only occur at the level of deciding which macro-actions to execute, and the macro-actions themselves can then be executed to completion. The core technical difficulty when using options in a Dec-POMDP is that the options chosen by the agents no longer terminate at the same time. We present extensions of two leading Dec-POMDP algorithms for generating a policy with options and discuss the resulting form of optimality. Our results show that these algorithms retain agent coordination while allowing near-optimal solutions to be generated for significantly longer horizons and larger state-spaces than previous Dec-POMDP methods.

Modeling and Planning with Macro-Actions in Decentralized POMDPs

Journal of Artificial Intelligence Research, 2019

Decentralized partially observable Markov decision processes (Dec-POMDPs) are general models for decentralized multi-agent decision making under uncertainty. However, they typically model a problem at a low level of granularity, where each agent's actions are primitive operations lasting exactly one time step. We address the case where each agent has macro-actions: temporally extended actions that may require different amounts of time to execute. We model macro-actions as options in a Dec-POMDP, focusing on actions that depend only on information directly available to the agent during execution. Therefore, we model systems where coordination decisions only occur at the level of deciding which macro-actions to execute. The core technical difficulty in this setting is that the options chosen by each agent no longer terminate at the same time. We extend three leading Dec-POMDP algorithms for policy generation to the macro-action case, and demonstrate their effectiveness in both sta...

Point-based Dynamic Programming for DEC-POMDPs

2006

We introduce point-based dynamic programming (DP) for decentralized partially observable Markov decision processes (DEC-POMDPs), a new discrete DP algorithm for planning strategies for cooperative multi-agent systems. Our approach makes a connection between optimal DP algorithms for partially observable stochastic games, and point-based approximations for singleagent POMDPs. We show for the first time how relevant multi-agent belief states can be computed. Building on this insight, we then show how the linear programming part in current multi-agent DP algorithms can be avoided, and how multi-agent DP can thus be applied to solve larger problems. We derive both an optimal and an approximated version of our algorithm, and we show its efficiency on test examples from the literature.

Bounded Dynamic Programming for Decentralized POMDPs

2007

Solving decentralized POMDPs (DEC-POMDPs) optimally is a very hard problem. As a result, several approximate algorithms have been developed, but these do not have satisfactory error bounds. In this paper, we first discuss optimal dynamic programming and some approximate finite horizon DEC-POMDP algorithms. We then present a bounded dynamic programming algorithm. Given a problem and an error bound, the algorithm will return a solution within that bound when it is able to solve the problem. We give a proof of this bound and provide some experimental results showing high quality solutions to large DEC-POMDPs for large horizons.

Incremental Clustering and Expansion for Faster Optimal Planning in Decentralized POMDPs

2013

This article presents the state-of-the-art in optimal solution methods for decentralized partially observable Markov decision processes (Dec-POMDPs), which are general models for collaborative multiagent planning under uncertainty. Building off the generalized multiagent A * (GMAA*) algorithm, which reduces the problem to a tree of one-shot collaborative Bayesian games (CBGs), we describe several advances that greatly expand the range of Dec-POMDPs that can be solved optimally. First, we introduce lossless incremental clustering of the CBGs solved by GMAA*, which achieves exponential speedups without sacrificing optimality. Second, we introduce incremental expansion of nodes in the GMAA * search tree, which avoids the need to expand all children, the number of which is in the worst case doubly exponential in the node’s depth. This is particularly beneficial when little clustering is possible. In addition, we introduce new hybrid heuristic representations that are more compact and th...

Improved Memory-Bounded Dynamic Programming for Decentralized POMDPs

Memory-Bounded Dynamic Programming (MBDP) has proved extremely eective in solving decentralized POMDPs with large horizons. We generalize the algorithm and improve its scalability by reducing the com- plexity with respect to the number of obser- vations from exponential to polynomial. We derive error bounds on solution quality with respect to this new approximation and ana- lyze the convergence behavior. To evaluate the eectiveness of the improvements, we in- troduce a new, larger benchmark problem. Experimental results show that despite the high complexity of decentralized POMDPs, scalable solution techniques such as MBDP perform surprisingly well.

Scalable Planning and Learning for Multiagent POMDPs

Online, sample-based planning algorithms for POMDPs have shown great promise in scaling to problems with large state spaces, but they become intractable for large action and observation spaces. This is particularly problematic in multiagent POMDPs where the action and observation space grows exponentially with the number of agents. To combat this intractability, we propose a novel scalable approach based on samplebased planning and factored value functions that exploits structure present in many multiagent settings. This approach applies not only in the planning case, but also in the Bayesian reinforcement learning setting. Experimental results show that we are able to provide high quality solutions to large multiagent planning and learning problems. 2.1 Multiagent POMDPs An MPOMDP Messias, Spaan, and Lima (2011) is a multiagent planning model that unfolds over a number of steps. At every stage, agents take individual actions and receive individual observations. However, in an MPOMDP, all individual observations are shared via communication, allowing the team of agents to act in a 'centralized manner'. We will restrict ourselves to the setting where such communication is free of noise, costs and delays. An MPOMDP is a tuple I, S, {A i }, T, R, {Z i }, O, h with: I, a set of agents; S, a set of states with designated initial state distribution b 0 ; A = × i A i , the set of joint actions, using action sets for each agent, i; T , a set of state transition probabilities: T s as = Pr(s |s, a), the probability of transitioning from state s to s when actions a are taken by the agents; R, a reward function: R(s, a), the immediate reward for being in state s and taking actions a; Z = × i Z i , the set of joint observations, using observation sets for each agent, i; O, a set of observation probabilities: O as z = Pr(z| a,s), the probability of seeing observations o given actions a were taken and resulting state s ; h, the horizon. An MPOMDP can be reduced to a POMDP with a single centralized controller that takes joint actions and receives joint observations Pynadath and Tambe (2002). Therefore, MPOMDPs can be solved with POMDP solution methods, some of which will be described in the remainder of this section. However, such approaches do not exploit the particular structure inherent to many MASs. In Sec. 4, we present a first online planning method that overcomes this deficiency.