Convergence to Pareto Optimality in General Sum Games Via Learning Opponent Preferences (original) (raw)

Convergence to Pareto Optimality in General Sum Games via Learning Opponent's Preference

We consider the learning problem faced by two self-interested agents playing any general-sum game repeatedly where the opponent payoff is unknown. The concept of Nash Equilibrium in repeated games provides us an individually rational solution for playing such games and can be achieved by playing the Nash Equilibrium strategy for the single-shot game in every iteration. However, such a strategy can sometimes lead to a Pareto-dominated outcome for the repeated game. Our goal is to design learning strategies that converge to a Pareto-efficient outcome that also produces a Nash Equilibrium payoff for repeated two player n-action general-sum games. We present a learning algorithm, POSNEL, which learns opponent's preference structure and produces, under self-play, Nash equilibrium payoffs in the limit in all such games. We also show that such learning will generate Pareto-optimal payoffs in a large majority of games. We derive a probability bound for convergence to Nash Equilibrium payoff and experimentally demonstrate convergence to Pareto optimality for all structurally distinct 2-player 2-action conflict games. We also compare our algorithm with existing algorithms such as WOLF-IGA and JAL and showed that POSNEL on average, outperforms both the algorithms.

Reaching pareto-optimality in prisoner’s dilemma using conditional joint action learning

Autonomous Agents and Multi-Agent Systems, 2007

We consider the learning problem faced by two self-interested agents repeatedly playing a general-sum stage game. We assume that the players can observe each other's actions but not the payoffs received by the other player. The concept of Nash Equilibrium in repeated games provides an individually rational solution for playing such games and can be achieved by playing the Nash Equilibrium strategy for the single-shot game in every iteration. Such a strategy, however can sometimes lead to a Pareto-Dominated outcome for games like Prisoner's Dilemma. So we prefer learning strategies that converge to a Pareto-Optimal outcome that also produces a Nash Equilibrium payoff for repeated two-player, n-action general-sum games. The Folk Theorem enable us to identify such outcomes. In this paper, we introduce the Conditional Joint Action Learner (CJAL) which learns the conditional probability of an action taken by the opponent given its own actions and uses it to decide its next course of action. We empirically show that under self-play and if the payoff structure of the Prisoner's Dilemma game satisfies certain conditions, a CJAL learner, using a random exploration strategy followed by a completely greedy exploitation technique, will learn to converge to a Pareto-Optimal solution. We also show that such learning will generate Pareto-Optimal payoffs in a large majority of other two-player general sum games. We compare the performance of CJAL with that of existing algorithms such as WOLF-PHC and JAL on all structurally distinct two-player conflict games with ordinal payoffs.

Efficient learning in games

2006

: We consider the problem of learning strategy selection in games. The theoretical solution to this problem is a distribution over strategies that responds to a Nash equilibrium of the game. When the payoff function of the game is not known to the participants, such a ...

Learning payoff functions in infinite games

Machine Learning, 2007

We consider a class of games with real-valued strategies and payoff information available only in the form of data from a given sample of strategy profiles. Solving such games with respect to the underlying strategy space requires generalizing from the data to a complete payoff-function representation. We address payoff-function learning as a standard regression problem, with provision for capturing known structure (symmetry) in the multiagent environment. To measure learning performance, we consider the relative utility of prescribed strategies, rather than the accuracy of payoff functions per se. We demonstrate our approach and evaluate its effectiveness on two examples: a two-player version of the first-price sealed-bid auction (with known analytical form), and a five-player marketbased scheduling game (with no known solution).

Actor-Critic Algorithms for Learning Nash Equilibria in N-player General-Sum Games

arXiv (Cornell University), 2014

We consider the problem of finding stationary Nash equilibria (NE) in a finite discounted general-sum stochastic game. We first generalize a non-linear optimization problem from Filar and Vrieze [2004] to a N-player setting and break down this problem into simpler sub-problems that ensure there is no Bellman error for a given state and an agent. We then provide a characterization of solution points of these sub-problems that correspond to Nash equilibria of the underlying game and for this purpose, we derive a set of necessary and sufficient SG-SP (Stochastic Game-Sub-Problem) conditions. Using these conditions, we develop two actor-critic algorithms: OFF-SGSP (model-based) and ON-SGSP (model-free). Both algorithms use a critic that estimates the value function for a fixed policy and an actor that performs descent in the policy space using a descent direction that avoids local minima. We establish that both algorithms converge, in self-play, to the equilibria of a certain ordinary differential equation (ODE), whose stable limit points coincide with stationary NE of the underlying general-sum stochastic game. On a single state non-generic game (see Hart and Mas-Colell [2005]) as well as on a synthetic two-player game setup with 810, 000 states, we establish that ON-SGSP consistently outperforms NashQ [Hu and Wellman, 2003] and FFQ [Littman, 2001] algorithms.

Regret testing: A simple payoff-based procedure for learning Nash equilibrium

2006

A learning rule is uncoupled if a player does not condition his strategy on the opponent’s payo¤s. It is radically uncoupled if a player does not condition his strategy on the opponent’s actions or payoffs. We demonstrate a family of simple, radically uncoupled learning rules whose period-by-period behavior comes arbitrarily close to Nash equilibrium behavior in any finite two-person game.

Learning with minimal information in continuous games

Theoretical Economics, 2020

While payoff‐based learning models are almost exclusively devised for finite action games, where players can test every action, it is harder to design such learning processes for continuous games. We construct a stochastic learning rule, designed for games with continuous action sets, which requires no sophistication from the players and is simple to implement: players update their actions according to variations in own payoff between current and previous action. We then analyze its behavior in several classes of continuous games and show that convergence to a stable Nash equilibrium is guaranteed in all games with strategic complements as well as in concave games, while convergence to Nash equilibrium occurs in all locally ordinal potential games as soon as Nash equilibria are isolated.

Learning to commit in repeated games

Proceedings of the fifth international joint conference on Autonomous agents and multiagent systems - AAMAS '06, 2006

Learning to converge to an efficient, i.e., Pareto-optimal Nash equilibrium of the repeated game is an open problem in multiagent learning. Our goal is to facilitate the learning of efficient outcomes in repeated plays of incomplete information games when only opponent's actions but not its payoffs are observable. We use a two-stage protocol that allows a player to unilaterally commit to an action, allowing the other player to choose an action knowing the action chosen by the committed player. The motivation behind commitment is to promote trust between the players and prevent them from mutually harmful choices made to preclude worst-case outcomes. Our agents learn whether commitment is beneficial or not. Interestingly, the decision to commit can be thought of as expanding the action space and our proposed protocol can be incorporated by any learning strategies used for playing repeated games. We show the improvement of the outcome efficiency of standard learning algorithms when using our proposed commitment protocol. We propose convergence to pareto optimal Nash equilibrium of repeated games as desirable learning outcomes. The performance evaluation in this paper uses a similarly motivated metric that measures the percentage of Nash equilibria for repated games that dominate the observed outcome.

Learning with repeated-game strategies

Frontiers in Neuroscience, 2014

We use the self-tuning Experience Weighted Attraction model with repeated-game strategies as a computer testbed to examine the relative frequency, speed of convergence and progression of a set of repeated-game strategies in four symmetric 2 × 2 games: Prisoner's Dilemma, Battle of the Sexes, Stag-Hunt, and Chicken. In the Prisoner's Dilemma game, we find that the strategy with the most occurrences is the "Grim-Trigger." In the Battle of the Sexes game, a cooperative pair that alternates between the two pure-strategy Nash equilibria emerges as the one with the most occurrences. In the Stag-Hunt and Chicken games, the "Win-Stay, Lose-Shift" and "Grim-Trigger" strategies are the ones with the most occurrences. Overall, the pairs that converged quickly ended up at the cooperative outcomes, whereas the ones that were extremely slow to reach convergence ended up at non-cooperative outcomes.

Learning the optimum as a Nash equilibrium

Journal of Economic Dynamics and Control, 2000

This paper shows the computational bene"ts of a game theoretic approach to optimization of high dimensional control problems. A dynamic noncooperative game framework is adopted to partition the control space and to search the optimum as the equilibrium of a k-person dynamic game played by k-parallel genetic algorithms. When there are multiple inputs, we delegate control authority over a set of control variables exclusively to one player so that k arti"cially intelligent players explore and communicate to learn the global optimum as the Nash equilibrium. In the case of a single input, each player's decision authority becomes active on exclusive sets of dates so that k GAs construct the optimal control trajectory as the equilibrium of evolving best-to-date responses. Sample problems are provided to demonstrate the gains in computational speed and accuracy.