The Irrevocable Multi-Armed Bandit Problem (original) (raw)
Related papers
Multi-armed bandit problems with dependent arms
2007
We provide a framework to exploit dependencies among arms in multi-armed bandit problems, when the dependencies are in the form of a generative model on clusters of arms. We find an optimal MDP-based policy for the discounted reward case, and also give an approximation of it with formal error guarantee. We discuss lower bounds on regret in the undiscounted reward scenario, and propose a general two-level bandit policy for it. We propose three different instantiations of our general policy and provide theoretical justifications of how the regret of the instantiated policies depend on the characteristics of the clusters. Finally, we empirically demonstrate the efficacy of our policies on large-scale realworld and synthetic data, and show that they significantly outperform classical policies designed for bandits with independent arms.
Linear Programming for Finite State Multi-Armed Bandit Problems
Mathematics of Operations Research, 1986
... Soc. Ser. B. 42 143-149. .(1982). Optimization over Time. Vol. 1. John Wiley, New York. Varaiya, P., Walrand, J. and Buyukkoc, C. (1984). Extensions of the Multi-Armed Bandit Problem. Electronic Research Laboratory, University of California, Berkeley, Technical Report, 41 pp. ...
The multi-armed bandit, with constraints
Annals of Operations Research, 2012
The early sections of this paper present an analysis of a Markov decision model that is known as the multi-armed bandit under the assumption that the utility function of the decision maker is either linear or exponential. The analysis includes efficient procedures for computing the expected utility associated with the use of a priority policy and for identifying a priority policy that is optimal. The methodology in these sections is novel, building on the use of elementary row operations. In the later sections of this paper, the analysis is adapted to accommodate constraints that link the bandits. It was demonstrated in [12, 10] that, given each multi-state, it is optimal to play any Markov chain (bandit) whose current state has the largest index (lowest label). Following [12, 10], the multi-armed bandit problem has stimulated research in control theory, economics, probability, and operations research. A sampling of noteworthy papers includes Bergemann and Välimäkim [2], Bertsimas and Niño-Mora [4], El Karoui and Karatzas [8], Katehakis and Veinott [15], Schlag [17], Sonin [18], Tsisiklis [19], Variaya, Walrand and Buyukkoc [20], Weber [22], and Whittle [24],. Books on the subject (that list many references) include Berry and Fristedt [3], Gittins [11], Gittins, Glazebrook and Weber [13]. The last and most recent of these books provides a status report on the multi-armed bandit that is almost up-to-date. An implication of the analysis in [12, 10] is that the largest of all of the indices equals the maximum over all states of the ratio r(i)/(1 − c), where r(i) denotes the expectation of the reward that is earned if state i's bandit is played once while state i is observed and where c is the discount factor. In 1994, Tsitsiklis [19]
Multi-armed bandits with simple arms
Advances in Applied Mathematics, 1986
An exact solution to certain multi-armed bandit problems with independent and simple arms is presented. An arm is simple if the observations associated with the arm have one of two distributions conditional on the value of an unknown dichotomous parameter. This solution is obtained relating Gittins indices for the arms to ladder variables for associated random walks. 0 1986 Academic PRSS. IX
Multi-armed bandit problem with precedence relations
Institute of Mathematical Statistics Lecture Notes - Monograph Series, 2006
Consider a multi-phase project management problem where the decision maker needs to deal with two issues: (a) how to allocate resources to projects within each phase, and (b) when to enter the next phase, so that the total expected reward is as large as possible. We formulate the problem as a multi-armed bandit problem with precedence relations. In Chan, Fuh and Hu (2005), a class of asymptotically optimal arm-pulling strategies is constructed to minimize the shortfall from perfect information payoff. Here we further explore optimality properties of the proposed strategies. First, we show that the efficiency benchmark, which is given by the regret lower bound, reduces to those in Lai and Robbins (1985), Hu and Wei (1989), and Fuh and Hu (2000). This implies that the proposed strategy is also optimal under the settings of aforementioned papers. Secondly, we establish the super-efficiency of proposed strategies when the bad set is empty. Thirdly, we show that they are still optimal with constant switching cost between arms. In addition, we prove that the Wald's equation holds for Markov chains under Harris recurrent condition, which is an important tool in studying the efficiency of the proposed strategies.
Finite-time Analysis of the Multiarmed Bandit Problem
2002
Reinforcement learning policies face the exploration versus exploitation dilemma, i.e. the search for a balance between exploring the environment to find profitable actions while taking the empirically best action as often as possible. A popular measure of a policy's success in addressing this dilemma is the regret, that is the loss due to the fact that the globally optimal policy is not followed all the times. One of the simplest examples of the exploration/exploitation dilemma is the multi-armed bandit problem. Lai and Robbins were the first ones to show that the regret for this problem has to grow at least logarithmically in the number of plays. Since then, policies which asymptotically achieve this regret have been devised by Lai and Robbins and many others. In this work we show that the optimal logarithmic regret is also achievable uniformly over time, with simple and efficient policies, and for all reward distributions with bounded support.
TheN-armed bandit with unimodal structure
Metrika, 1983
In this paper we study a special class of bandit problems, which are characterized by a unimodal structure of the expected rewards of the arms. In Section 1, the motivation for studying this problem is explained. In the next two sections, two different decision procedures are analyzed, which are based on a stochastic approximation of the best arm of the bandit. Finally, in Section 4, a special procedure is discussed and some numerical data are presented, which were obtained by applying it to a concrete N-armed bandit with unimodal structure.
The multi-armed bandit problem with covariates
The Annals of Statistics, 2013
We consider a multi-armed bandit problem in a setting where each arm produces a noisy reward realization which depends on an observable random covariate. As opposed to the traditional static multi-armed bandit problem, this setting allows for dynamically changing rewards that better describe applications where side information is available. We adopt a nonparametric model where the expected rewards are smooth functions of the covariate and where the hardness of the problem is captured by a margin parameter. To maximize the expected cumulative reward, we introduce a policy called Adaptively Binned Successive Elimination (ABSE) that adaptively decomposes the global problem into suitably "localized" static bandit problems. This policy constructs an adaptive partition using a variant of the Successive Elimination (SE) policy. Our results include sharper regret bounds for the SE policy in a static bandit problem and minimax optimal regret bounds for the ABSE policy in the dynamic problem.
Adaptive Tug-of-war Model for Two-armed Bandit Problem
The "tug-of-war (TOW) model" proposed in our previous studies ] is a unique dynamical system inspired by the photoavoidance behavior of a single-celled amoeba of the true slime mold Physarum polycephalum. The TOW model is applied to solving the "multi-armed bandit problem," a problem of finding the most rewarding one from multiple options as accurately and speedy as possible. We showed that the model exhibits better performances compared with other well-known algorithms. However, in order to maximize its performance, the TOW model is required an optimized parameter w. In this study, we propose a new TOW model which adaptively produces the estimates to determine w in its own way and thus has no parameter. We show that in some asymmetric problems the new model is more efficient than the UCB1tuned algorithm , which is known as the best algorithm.