Finite-time Analysis of the Multiarmed Bandit Problem (original) (raw)

Abstract

Reinforcement learning policies face the exploration versus exploitation dilemma, i.e. the search for a balance between exploring the environment to find profitable actions while taking the empirically best action as often as possible. A popular measure of a policy's success in addressing this dilemma is the regret, that is the loss due to the fact that the globally optimal policy is not followed all the times. One of the simplest examples of the exploration/exploitation dilemma is the multi-armed bandit problem. Lai and Robbins were the first ones to show that the regret for this problem has to grow at least logarithmically in the number of plays. Since then, policies which asymptotically achieve this regret have been devised by Lai and Robbins and many others. In this work we show that the optimal logarithmic regret is also achievable uniformly over time, with simple and efficient policies, and for all reward distributions with bounded support.

Article PDF

References

Agrawal, R. (1995). Sample mean based index policies with _O(_log n) regret for the multi-armed bandit problem. Advances in Applied Probability, 27, 1054–1078.
Google Scholar
Berry, D., & Fristedt, B. (1985). Bandit problems. London: Chapman and Hall.
Google Scholar
Burnetas, A., & Katehakis, M. (1996). Optimal adaptive policies for sequential allocation problems. Advances in Applied Mathematics, 17:2, 122–142.
Google Scholar
Duff, M. (1995). Q-learning for bandit problems. In Proceedings of the 12th International Conference on Machine Learning (pp. 209-217).
Gittins, J. (1989). Multi-armed bandit allocation indices, Wiley-Interscience series in Systems and Optimization. New York: John Wiley and Sons.
Google Scholar
Holland, J. (1992). Adaptation in natural and artificial systems. Cambridge: MIT Press/Bradford Books.
Google Scholar
Ishikida, T., & Varaiya, P. (1994). Multi-armed bandit problem revisited. Journal of Optimization Theory and Applications, 83:1, 113–154.
Google Scholar
Lai, T., & Robbins, H. (1985). Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6, 4–22.
Google Scholar
Pollard, D. (1984). Convergence of stochastic processes. Berlin: Springer.
Google Scholar
Sutton, R., & Barto, A. (1998). Reinforcement learning, an introduction. Cambridge: MIT Press/Bradford Books.
Google Scholar
Wilks, S. (1962). Matematical statistics. New York: John Wiley and Sons.
Google Scholar
Yakowitz, S., & Lowe, W. (1991). Nonparametric bandit methods. Annals of Operations Research, 28, 297–312.
Google Scholar

Download references

Author information

Authors and Affiliations

University of Technology Graz, A-8010, Graz, Austria
Peter Auer
DTI, University of Milan, via Bramante 65, I-26013, Crema, Italy
Nicolò Cesa-Bianchi
Lehrstuhl Informatik II, Universität Dortmund, D-44221, Dortmund, Germany
Paul Fischer

Authors

Peter Auer
Nicolò Cesa-Bianchi
Paul Fischer

Rights and permissions

About this article

Cite this article

Auer, P., Cesa-Bianchi, N. & Fischer, P. Finite-time Analysis of the Multiarmed Bandit Problem.Machine Learning 47, 235–256 (2002). https://doi.org/10.1023/A:1013689704352

Download citation

Issue date: May 2002
DOI: https://doi.org/10.1023/A:1013689704352