Combining Off and On-Policy Training in Model-Based Reinforcement Learning (original) (raw)

Mastering the game of Go with deep neural networks and tree search

All games of perfect information have an optimal value function, v * (s), which determines the outcome of the game, from every board position or state s, under perfect play by all players. These games may be solved by recursively computing the optimal value function in a search tree containing approximately b d possible sequences of moves, where b is the game's breadth (number of legal moves per position) and d is its depth (game length). In large games, such as chess (b ≈ 35, d ≈ 80) 1 and especially Go (b ≈ 250, d ≈ 150) 1 , exhaustive search is infeasible 2,3 , but the effective search space can be reduced by two general principles. First, the depth of the search may be reduced by position evaluation: truncating the search tree at state s and replacing the subtree below s by an approximate value function v(s) ≈ v * (s) that predicts the outcome from state s. This approach has led to superhuman performance in chess 4 , checkers 5 and othello 6 , but it was believed to be intractable in Go due to the complexity of the game 7. Second, the breadth of the search may be reduced by sampling actions from a policy p(a|s) that is a probability distribution over possible moves a in position s. For example, Monte Carlo rollouts 8 search to maximum depth without branching at all, by sampling long sequences of actions for both players from a policy p. Averaging over such rollouts can provide an effective position evaluation, achieving superhuman performance in backgammon 8 and Scrabble 9 , and weak amateur level play in Go 10. Monte Carlo tree search (MCTS) 11,12 uses Monte Carlo rollouts to estimate the value of each state in a search tree. As more simulations are executed, the search tree grows larger and the relevant values become more accurate. The policy used to select actions during search is also improved over time, by selecting children with higher values. Asymptotically, this policy converges to optimal play, and the evaluations converge to the optimal value function 12. The strongest current Go programs are based on MCTS, enhanced by policies that are trained to predict human expert moves 13. These policies are used to narrow the search to a beam of high-probability actions, and to sample actions during rollouts. This approach has achieved strong amateur play 13–15. However, prior work has been limited to shallow policies 13–15 or value functions 16 based on a linear combination of input features. Recently, deep convolutional neural networks have achieved unprecedented performance in visual domains: for example, image classification 17 , face recognition 18 , and playing Atari games 19. They use many layers of neurons, each arranged in overlapping tiles, to construct increasingly abstract, localized representations of an image 20. We employ a similar architecture for the game of Go. We pass in the board position as a 19 × 19 image and use convolutional layers to construct a representation of the position. We use these neural networks to reduce the effective depth and breadth of the search tree: evaluating positions using a value network, and sampling actions using a policy network. We train the neural networks using a pipeline consisting of several stages of machine learning (Fig. 1). We begin by training a supervised learning (SL) policy network p σ directly from expert human moves. This provides fast, efficient learning updates with immediate feedback and high-quality gradients. Similar to prior work 13,15 , we also train a fast policy p π that can rapidly sample actions during rollouts. Next, we train a reinforcement learning (RL) policy network p ρ that improves the SL policy network by optimizing the final outcome of games of self-play. This adjusts the policy towards the correct goal of winning games, rather than maximizing predictive accuracy. Finally, we train a value network v θ that predicts the winner of games played by the RL policy network against itself. Our program AlphaGo efficiently combines the policy and value networks with MCTS. Supervised learning of policy networks For the first stage of the training pipeline, we build on prior work on predicting expert moves in the game of Go using supervised learning 13,21–24. The SL policy network p σ (a | s) alternates between con-volutional layers with weights σ, and rectifier nonlinearities. A final soft-max layer outputs a probability distribution over all legal moves a. The input s to the policy network is a simple representation of the board state (see Extended Data Table 2). The policy network is trained on randomly The game of Go has long been viewed as the most challenging of classic games for artificial intelligence owing to its enormous search space and the difficulty of evaluating board positions and moves. Here we introduce a new approach to computer Go that uses 'value networks' to evaluate board positions and 'policy networks' to select moves. These deep neural networks are trained by a novel combination of supervised learning from human expert games, and reinforcement learning from games of self-play. Without any lookahead search, the neural networks play Go at the level of state-of-the-art Monte Carlo tree search programs that simulate thousands of random games of self-play. We also introduce a new search algorithm that combines Monte Carlo simulation with value and policy networks. Using this search algorithm, our program AlphaGo achieved a 99.8% winning rate against other Go programs, and defeated the human European Go champion by 5 games to 0. This is the first time that a computer program has defeated a human professional player in the full-sized game of Go, a feat previously thought to be at least a decade away.

AlphaZero-Inspired Game Learning: Faster Training by Using MCTS Only at Test Time

IEEE transactions on games, 2022

Recently, the seminal algorithms AlphaGo and Al-phaZero have started a new era in game learning and deep reinforcement learning. While the achievements of AlphaGo and AlphaZero-playing Go and other complex games at super human level-are truly impressive, these architectures have the drawback that they require high computational resources. Many researchers are looking for methods that are similar to AlphaZero, but have lower computational demands and are thus more easily reproducible. In this paper, we pick an important element of AlphaZerothe Monte Carlo Tree Search (MCTS) planning stage-and combine it with temporal difference (TD) learning agents. We wrap MCTS for the first time around TD n-tuple networks and we use this wrapping only at test time to create versatile agents that keep at the same time the computational demands low. We apply this new architecture to several complex games (Othello, ConnectFour, Rubik's Cube) and show the advantages achieved with this AlphaZero-inspired MCTS wrapper. In particular, we present results that this agent is the first one trained on standard hardware (no GPU or TPU) to beat the very strong Othello program Edax up to and including level 7 (where most other learning-from-scratch algorithms could only defeat Edax up to level 2).

Monte Carlo Tree Search for Policy Optimization

2019

Gradient-based methods are often used for policy optimization in deep reinforcement learning, despite being vulnerable to local optima and saddle points. Although gradient-free methods (e.g., genetic algorithms or evolution strategies) help mitigate these issues, poor initialization and local optima are still concerns in highly nonconvex spaces. This paper presents a method for policy optimization based on Monte-Carlo tree search and gradient-free optimization. Our method, called Monte-Carlo tree search for policy optimization (MCTSPO), provides a better exploration-exploitation trade-off through the use of the upper confidence bound heuristic. We demonstrate improved performance on reinforcement learning tasks with deceptive or sparse reward functions compared to popular gradient-based and deep genetic algorithm baselines.

Mastering Atari, Go, chess and shogi by planning with a learned model

Nature, 2020

Constructing agents with planning capabilities has long been one of the main challenges in the pursuit of artificial intelligence. Tree-based planning methods have enjoyed huge success in challenging domains, such as chess and Go, where a perfect simulator is available. However, in real-world problems the dynamics governing the environment are often complex and unknown. In this work we present the MuZero algorithm which, by combining a tree-based search with a learned model, achieves superhuman performance in a range of challenging and visually complex domains, without any knowledge of their underlying dynamics. MuZero learns a model that, when applied iteratively, predicts the quantities most directly relevant to planning: the reward, the action-selection policy, and the value function. When evaluated on 57 different Atari games-the canonical video game environment for testing AI techniques, in which model-based planning approaches have historically struggled-our new algorithm achieved a new state of the art. When evaluated on Go, chess and shogi, without any knowledge of the game rules, MuZero matched the superhuman performance of the AlphaZero algorithm that was supplied with the game rules.

CS771 -Machine Learning Techniques Learning Atari Game Strategies Using Deep Reinforcement Learning Final Project Report Project Objectives and Motivation

Our aim was to create an AI agent which learns to play a number of Atari games well using the same set of hyper parameters. As time progressed, it transitioned into a more theory oriented project, where we studied the different ways of using Reinforcement Learning methods for our task, and implemented them on a single game, Pong, instead of a set of games. Our primary motivation behind choosing this topic was the vast potential of deep reinforcement learning to learn agents that could perform some specific set of tasks in vastly different environments, looking at the state provided and the rewards accumulated. Another important reason was that this was a topic not covered in detail in class, so it gave us a first hand experience of exploring a topic on our own, and marveling at the beauty of machine learning, when used in settings vastly different from those in the class (in this case, no separate dataset is required, and this differs substantially from the supervised and unsupervised settings that we are used to). Here is a video of our agent beating the AI in one of the episodes https://youtu.be/ CH-mqog6vZA.

Playing Atari with Deep Reinforcement Learning

We present the first deep learning model to successfully learn control policies directly from high-dimensional sensory input using reinforcement learning. The model is a convolutional neural network, trained with a variant of Q-learning, whose input is raw pixels and whose output is a value function estimating future rewards. We apply our method to seven Atari 2600 games from the Arcade Learning Environment, with no adjustment of the architecture or learning algorithm. We find that it outperforms all previous approaches on six of the games and surpasses a human expert on three of them.

Learning non-random moves for playing Othello: Improving Monte Carlo Tree Search

2011 IEEE Conference on Computational Intelligence and Games (CIG'11), 2011

Monte Carlo Tree Search (MCTS) with an appropriate tree policy may be used to approximate a minimax tree for games such as GO, where a state value function cannot be formulated easily: recent MCTS algorithms successfully combine Upper Confidence Bounds for Trees with Monte Carlo (MC) simulations to incrementally refine estimates on the gametheoretic values of the game's states. Although a game-specific value function is not required for this approach, significant improvements in performance may be achieved by derandomising the MC simulations using domain-specific knowledge. However, recent results suggest that the choice of a non-uniformly random default policy is non-trivial and may often lead to unexpected outcomes. In this paper we employ Temporal Difference Learning (TDL) as a general approach to the integration of domainspecific knowledge in MCTS and subsequently study its impact on the algorithm's performance. In particular, TDL is used to learn a linear function approximator that is used as an a priori bias to the move selection in the algorithm's default policy; the function approximator is also used to bias the values of the nodes in the tree directly. The goal of this work is to determine whether such a simplistic approach can be used to improve the performance of MCTS for the well-known board game OTHELLO. The analysis of the results highlights the broader conclusions that may be drawn with respect to non-random default policies in general.

Improving generalization in reinforcement learning on Atari 2600 games

INTERNATIONAL JOURNAL OF ADVANCE RESEARCH, IDEAS AND INNOVATIONS IN TECHNOLOGY

Deep Reinforcement Learning (DRL) is poised to revolutionize the field of artificial intelligence (AI) and represents a crucial step towards building autonomous systems with a higher-level understanding of the world around them. In particular, deep reinforcement learning has changed the landscape of autonomous agents by achieving superhuman performance on board game Go, a significant milestone in AI research. In this project, we attempt to train a Deep RL network on Demon Attack - an Atari 2600 game and test the model on different game environments to investigate the feasibility of applying Transfer Learning on environments with same action space but slightly different state space. We further extend the project to use established Reinforcement Learning techniques such as DQN, Dueling DQN and SARSA to examine whether RL agents can be generalized on unfamiliar environments by fine-tuning the hyperparameters. Finally, we borrow classic regularization techniques like 𝑙𝑙2 regularization and dropout from the world of supervised learning and probe whether these techniques which have received very limited attention in the domain of reinforcement learning are effective in reducing overfitting of Deep RL networks. Deep Networks are expensive to train and complex models take weeks to train using expensive GPUs. We find that the use of above techniques prevents the network from overfitting on current environment and gives satisfactory results when tested on slightly different environments thus enabling substantial savings in training time and resources.

Learning How to Play Bomberman with Deep Reinforcement and Imitation Learning

Lecture Notes in Computer Science, 2019

Making artificial agents that learn how to play is a longstanding goal in the area of Game AI. Recently, several successful cases have emerged driven by Reinforcement Learning (RL) and neural networkbased approaches. However, in most of the cases, the results have been achieved by training directly from pixel frames with valuable computational resources. In this paper, we devise agents that learn how to play the popular game of Bomberman by relying on state representations and RL-based algorithms without looking at the pixel level. To that, we designed five vector-based state representations and implemented Bomberman on the top of the Unity game engine through the ML-agents toolkit. We enhance the ML-agents algorithms by developing an Imitation-based learner (IL) that improves its model with the Actor-Critic Proximal-Policy Optimization (PPO) method. We compared this approach with a PPO-only learner that uses either a Multi-Layer Perceptron or a Long-Short Term-Memory network (LSTM). We conducted several pieces of training and tournament experiments by making the agents play against each other. The hybrid state representation and our IL followed by PPO learning algorithm achieve the best overall quantitative results, and we also observed that their agents learn a correct Bomberman behavior.

Model-Based Reinforcement Learning for Atari

2020

Model-free reinforcement learning (RL) can be used to learn effective policies for complex tasks, such as Atari games, even from image observations. However, this typically requires very large amounts of interaction -- substantially more, in fact, than a human would need to learn the same games. How can people learn so quickly? Part of the answer may be that people can learn how the game works and predict which actions will lead to desirable outcomes. In this paper, we explore how video prediction models can similarly enable agents to solve Atari games with fewer interactions than model-free methods. We describe Simulated Policy Learning (SimPLe), a complete model-based deep RL algorithm based on video prediction models and present a comparison of several model architectures, including a novel architecture that yields the best results in our setting. Our experiments evaluate SimPLe on a range of Atari games in low data regime of 100k interactions between the agent and the environmen...