Stochastic Multiplicative Weights Updates in Zero-Sum Games (original) (raw)

Multiplicative Weights Update in Zero-Sum Games

Proceedings of the 2018 ACM Conference on Economics and Computation, 2018

and Design We study the classic setting where two agents compete against each other in a zero-sum game by applying the Multiplicative Weights Update (MWU) algorithm. In a twist of the standard approach of [11], we focus on the K-L divergence from the equilibrium but instead of providing an upper bound about the rate of increase we provide a nonnegative lower bound for games with interior equilibria. This implies movement away from equilibria and towards the boundary. In the case of zero-sum games without interior equilibria convergence to the boundary (and in fact to the minimal product of subsimplexes that contains all Nash equilibria) follows via an orthogonal argument. In that subspace divergence from the set of NE applies for all nonequilibrium initial conditions via the first argument. We argue the robustness of this non-equilibrating behavior by considering the following generalizations: • Step size: Agents may be using different and even decreasing step sizes. • Dynamics: Agents may be using Follow-the-Regularized-Leader algorithms and possibly apply different regularizers (e.g. MWU versus Gradient Descent). We also consider a linearized version of MWU. • More than two agents: Multiple agents can interact via arbitrary networks of zero-sum polymatrix games and their affine variants. Our results come in stark contrast with the standard interpretation of the behavior of MWU (and more generally regret minimizing dynamics) in zero-sum games, which is typically referred to as "converging to equilibrium". If equilibria are indeed predictive even for the benchmark class of zero-sum games, agents in practice must deviate robustly from the axiomatic perspective of optimization driven dynamics as captured by

Competition and Coordination in Stochastic Games

Lecture Notes in Computer Science, 2007

Agent competition and coordination are two classical and most important tasks in multiagent systems. In recent years, there was a number of learning algorithms proposed to resolve such type of problems. Among them, there is an important class of algorithms, called adaptive learning algorithms, that were shown to be able to converge in self-play to a solution in a wide variety of the repeated matrix games. Although certain algorithms of this class, such as Infinitesimal Gradient Ascent (IGA), Policy Hill-Climbing (PHC) and Adaptive Play Q-learning (APQ), have been catholically studied in the recent literature, a question of how these algorithms perform versus each other in general form stochastic games is remaining little-studied. In this work we are trying to answer this question. To do that, we analyse these algorithms in detail and give a comparative analysis of their behavior on a set of competition and coordination stochastic games. Also, we introduce a new multiagent learning algorithm, called ModIGA. This is an extension of the IGA algorithm, which is able to estimate the strategy of its opponents in the cases when they do not explicitly play mixed strategies (e.g., APQ) and which can be applied to the games with more than two actions.

Heterogeneous learning in zero-sum stochastic games with incomplete information

2010 49th IEEE Conference on Decision and Control (CDC 2010), 2010

Learning algorithms are essential for the applications of game theory in a networking environment. In dynamic and decentralized settings where the traffic, topology and channel states may vary over time and the communication between agents is impractical, it is important to formulate and study games of incomplete information and fully distributed learning algorithms which for each agent requires a minimal amount of information regarding the remaining agents. In this paper, we address this major challenge and introduce heterogeneous learning schemes in which each agent adopts a distinct learning pattern in the context of games with incomplete information. We use stochastic approximation techniques to show that the heterogeneous learning schemes can be studied in terms of their deterministic ordinary differential equation (ODE) counterparts. Depending on the learning rates of the players, these ODEs could be different from the standard replicator dynamics, (myopic) best response (BR) dynamics, logit dynamics, and fictitious play dynamics. We apply the results to a class of security games in which the attacker and the defender adopt different learning schemes due to differences in their rationality levels and the information they acquire.

Stability of Multi-Agent Learning: Convergence in Network Games with Many Players

arXiv (Cornell University), 2023

The behaviour of multi-agent learning in many player games has been shown to display complex dynamics outside of restrictive examples such as network zero-sum games. In addition, it has been shown that convergent behaviour is less likely to occur as the number of players increase. To make progress in resolving this problem, we study Q-Learning dynamics and determine a sufficient condition for the dynamics to converge to a unique equilibrium in any network game. We find that this condition depends on the nature of pairwise interactions and on the network structure, but is explicitly independent of the total number of agents in the game. We evaluate this result on a number of representative network games and show that, under suitable network conditions, stable learning dynamics can be achieved with an arbitrary number of agents.

Regret Minimization and Convergence to Equilibria in General-sum Markov Games

arXiv (Cornell University), 2022

An abundance of recent impossibility results establish that regret minimization in Markov games with adversarial opponents is both statistically and computationally intractable. Nevertheless, none of these results preclude the possibility of regret minimization under the assumption that all parties adopt the same learning procedure. In this work, we present the first (to our knowledge) algorithm for learning in general-sum Markov games that provides sublinear regret guarantees when executed by all agents. The bounds we obtain are for swap regret, and thus, along the way, imply convergence to a correlated equilibrium. Our algorithm is decentralized, computationally efficient, and does not require any communication between agents. Our key observation is that online learning via policy optimization in Markov games essentially reduces to a form of weighted regret minimization, with unknown weights determined by the path length of the agents' policy sequence. Consequently, controlling the path length leads to weighted regret objectives for which sufficiently adaptive algorithms provide sublinear regret guarantees.

Multiplicative updates outperform generic no-regret learning in congestion games

Proceedings of the 41st annual ACM symposium on Symposium on theory of computing - STOC '09, 2009

We study the outcome of natural learning algorithms in atomic congestion games. Atomic congestion games have a wide variety of equilibria often with vastly differing social costs. We show that in almost all such games, the wellknown multiplicative-weights learning algorithm results in convergence to pure equilibria. Our results show that natural learning behavior can avoid bad outcomes predicted by the price of anarchy in atomic congestion games such as the load-balancing game introduced by Koutsoupias and Papadimitriou, which has super-constant price of anarchy and has correlated equilibria that are exponentially worse than any mixed Nash equilibrium.

Fast and Furious Learning in Zero-Sum Games: Vanishing Regret with Non-Vanishing Step Sizes

2019

We show for the first time, to our knowledge, that it is possible to reconcile in online learning in zero-sum games two seemingly contradictory objectives: vanishing time-average regret and non-vanishing step sizes. This phenomenon, that we coin ``fast and furious" learning in games, sets a new benchmark about what is possible both in max-min optimization as well as in multi-agent systems. Our analysis does not depend on introducing a carefully tailored dynamic. Instead we focus on the most well studied online dynamic, gradient descent. Similarly, we focus on the simplest textbook class of games, two-agent two-strategy zero-sum games, such as Matching Pennies. Even for this simplest of benchmarks the best known bound for total regret, prior to our work, was the trivial one of O(T)O(T)O(T), which is immediately applicable even to a non-learning agent. Based on a tight understanding of the geometry of the non-equilibrating trajectories in the dual space we prove a regret bound of $\The...

Convergence to Pareto Optimality in General Sum Games via Learning Opponent's Preference

We consider the learning problem faced by two self-interested agents playing any general-sum game repeatedly where the opponent payoff is unknown. The concept of Nash Equilibrium in repeated games provides us an individually rational solution for playing such games and can be achieved by playing the Nash Equilibrium strategy for the single-shot game in every iteration. However, such a strategy can sometimes lead to a Pareto-dominated outcome for the repeated game. Our goal is to design learning strategies that converge to a Pareto-efficient outcome that also produces a Nash Equilibrium payoff for repeated two player n-action general-sum games. We present a learning algorithm, POSNEL, which learns opponent's preference structure and produces, under self-play, Nash equilibrium payoffs in the limit in all such games. We also show that such learning will generate Pareto-optimal payoffs in a large majority of games. We derive a probability bound for convergence to Nash Equilibrium payoff and experimentally demonstrate convergence to Pareto optimality for all structurally distinct 2-player 2-action conflict games. We also compare our algorithm with existing algorithms such as WOLF-IGA and JAL and showed that POSNEL on average, outperforms both the algorithms.

On Finding and Learning Effective Strategies for Complex Non-zero-sum Repeated Games

IAT, 2012

We study complex non-zero-sum iterated twoplayer games, more specifically, various strategies and their performances in iterated traveler's dilemma (ITD). We focus on the relative performances of several types of parameterized strategies, where each such strategy type corresponds to a particular "philosophy" on how to best predict opponent's future behavior and/or entice the opponent to alter its behavior. We are particularly interested in adaptable, learning and/or evolving strategies that try to predict the future behavior of the other player in order to optimize their own behavior in the long run. We also study strategies that strive to minimize risk, as risk minimization has been recently suggested to be the appropriate paradigm for ITD and other complex games that have posed difficulties to classical game theory. We share the key insights from an elaborate round-robin tournament that we have implemented and analyzed. We draw some conclusions on what kinds of adaptability and models of the other player's behavior seem to be most effective in the long run. Lastly, we indicate some promising ways forward toward a better understanding of learning how to play complex iterated games well.

Convergence to Pareto Optimality in General Sum Games Via Learning Opponent Preferences

2005

We consider the learning problem faced by two self-interested agents playing any general-sum game repeatedly where the opponent payoff is unknown. The concept of Nash Equilibrium in repeated games provides us an individually rational solution for playing such games and can be achieved by playing the Nash Equilibrium strategy for the single-shot game in every iteration. However, such a strategy can sometimes lead to a Pareto-dominated outcome for the repeated game. Our goal is to design learning strategies that converge to a Pareto-efficient outcome that also produces a Nash Equilibrium payoff for repeated two player n-action general-sum games. We present a learning algorithm, POSNEL, which learns opponent's preference structure and produces, under self-play, Nash equilibrium payoffs in the limit in all such games. We also show that such learning will generate Pareto-optimal payoffs in a large majority of games. We derive a probability bound for convergence to Nash Equilibrium payoff and experimentally demonstrate convergence to Pareto optimality for all structurally distinct 2-player 2-action conflict games. We also compare our algorithm with existing algorithms such as WOLF-IGA and JAL and showed that POSNEL on average, outperforms both the algorithms.