Combining Backpropagation with Equilibrium Propagation to improve an Actor-Critic RL framework (original) (raw)

Combining backpropagation with Equilibrium Propagation to improve an Actor-Critic reinforcement learning framework

Frontiers in Computational Neuroscience

Backpropagation (BP) has been used to train neural networks for many years, allowing them to solve a wide variety of tasks like image classification, speech recognition, and reinforcement learning tasks. But the biological plausibility of BP as a mechanism of neural learning has been questioned. Equilibrium Propagation (EP) has been proposed as a more biologically plausible alternative and achieves comparable accuracy on the CIFAR-10 image classification task. This study proposes the first EP-based reinforcement learning architecture: an Actor-Critic architecture with the actor network trained by EP. We show that this model can solve the basic control tasks often used as benchmarks for BP-based models. Interestingly, our trained model demonstrates more consistent high-reward behavior than a comparable model trained exclusively by BP.

A connectionist actor-critic algorithm for faster learning and biological plausibility

— We propose a novel biologically plausible actor-critic algorithm using policy gradients in order to achieve practical, model-free reinforcement learning. It does not rely on backpropagation and is the first neural actor-critic relying only on locally available information. We show it has an advantage over pure policy gradients methods for motor learning performance in the polecart problem. We are also able to closely simulate the dopaminergic signaling patterns in rats when confronted with a two cue problem, showing that local, connectionist models can effectively model the functioning of the intrinsic reward system.

Dynamic equilibrium through reinforcement learning

2011

Reinforcement Learning is an area of Machine Learning that deals with how an agent should take actions in an environment such as to maximize the notion of accumulated reward. This type of learning is inspired by the way humans learn and has led to the creation of various algorithms for reinforcement learning. These algorithms focus on the way in which an agent's behaviour can be improved, assuming independence as to their surroundings. The current work studies the application of reinforcement learning methods to solve the inverted pendulum problem. The importance of the variability of the environment (factors that are external to the agent) on the execution of reinforcement learning agents is studied by using a model that seeks to obtain equilibrium (stability) through dynamism-a Cart-Pole system or inverted pendulum. We sought to improve the behaviour of the autonomous agents by changing the information passed to them, while maintaining the agent's internal parameters constant (learning rate, discount factors, decay rate, etc.), instead of the classical approach of tuning the agent's internal parameters. The influence of changes on the state set and the action set on an agent's capability to solve the Cart-pole problem was studied. We have studied typical behaviour of reinforcement learning agents applied to the classic BOXES model and a new form of characterizing the environment was proposed using the notion of convergence towards a reference value. We demonstrate the gain in performance of this new method applied to a Q-Learning agent.

A simple actor-critic algorithm for continuous environments,” submitted for publication, available at http://home.elka.pw.edu.pl/∼pwawrzyn

2003

In reference to methods analyzed recently by Sutton et al, and Konda & Tsitsiklis, we propose their modification called Randomized Policy Optimizer (RPO). The algorithm has a modular structure and is based on the value function rather than on the action-value function. The modules include neural approximators and a parameterized distribution of control actions. The distribution must belong to a family of smoothly exploring distributions that enables to sample from control action set to approximate certain gradient. A pre-action-value function is introduced similarly to the action-value function, with the first action replaced by the first action distribution parameter. The paper contains an experimental comparison of this approach to reinforcement learning with model-free Adaptive Critic Designs, specifically with Action-Dependent Adaptive Heuristic Critic. The comparison is favorable for our algorithm.

IEEE TRANSACTIONS ON SYSTEMS, MAN AND CYBERNETICS—PART C: APPLICATIONS AND REVIEWS 1 A Survey of Actor-Critic Reinforcement Learning: Standard and Natural Policy Gradients

2015

Abstract—Policy gradient based actor-critic algorithms are amongst the most popular algorithms in the reinforcement learning framework. Their advantage of being able to search for optimal policies using low-variance gradient estimates has made them useful in several real-life applications, such as robotics, power control and finance. Although general surveys on reinforcement learning techniques already exist, no survey is specifically dedicated to actor-critic algorithms in particular. This paper therefore describes the state of the art of actor-critic algorithms, with a focus on methods that can work in an online setting and use function approximation in order to deal with continuous state and action spaces. After starting with a discussion on the concepts of reinforcement learning and the origins of actor-critic algorithms, this paper describes the workings of the natural gradient, which has made its way into many actor-critic algorithms in the past few years. A review of several ...

Reinforcement Learning and Robotics

Introduction to Deep Learning Business Applications for Developers, 2018

Due to the recent achievements of deep learning [GBC16] benefiting from big data, powerful computation, and new algorithmic techniques, we have been witnessing the renaissance of reinforcement learning, especially the combination of reinforcement learning and deep neural networks, the so called deep reinforcement learning (deep RL). Deep Q-networks (DQNs) have ignited the field of deep RL [MKS + 15] by allowing machines to achieve superhuman performance in Atari games and the very hard board game of Go. It has long been known that RL is unstable when the action-value Q function was approximated with nonlinear functions, such as neural networks. However, DQNs made several contributions to improve the learning's stability. • DQNs stabilized the training of the Q-action value function approximation using a CNN with replay. • DQNs used an end-to-end RL approach, taking only raw pixels and the game score as inputs. • DQNs used a flexible network with the same algorithm, network architecture, and hyperparameters to play different Atari games. 1 | , , respectively. Chapter 6 reinforCement Learning and robotiCs 1 1 (6.4) SARSA refines the policy greedily with respect to action values.

Decision-Aware Actor-Critic with Function Approximation and Theoretical Guarantees

arXiv (Cornell University), 2023

Actor-critic (AC) methods are widely used in reinforcement learning (RL), and benefit from the flexibility of using any policy gradient method as the actor and value-based method as the critic. The critic is usually trained by minimizing the TD error, an objective that is potentially decorrelated with the true goal of achieving a high reward with the actor. We address this mismatch by designing a joint objective for training the actor and critic in a decision-aware fashion. We use the proposed objective to design a generic, AC algorithm that can easily handle any function approximation. We explicitly characterize the conditions under which the resulting algorithm guarantees monotonic policy improvement, regardless of the choice of the policy and critic parameterization. Instantiating the generic algorithm results in an actor that involves maximizing a sequence of surrogate functions (similar to TRPO, PPO), and a critic that involves minimizing a closely connected objective. Using simple bandit examples, we provably establish the benefit of the proposed critic objective over the standard squared error. Finally, we empirically demonstrate the benefit of our decision-aware actor-critic framework on simple RL problems. 1 Introduction Reinforcement learning (RL) is a framework for solving problems involving sequential decisionmaking under uncertainty, and has found applications in games [38, 50], robot manipulation tasks [55, 64] and clinical trials [45]. RL algorithms aim to learn a policy that maximizes the long-term return by interacting with the environment. Policy gradient (PG) methods [59, 54, 29, 25, 47] are an important class of algorithms that can easily handle function approximation and structured state-action spaces, making them widely used in practice. PG methods assume a differentiable parameterization of the policy and directly optimize the return with respect to the policy parameters. Typically, a policy's return is estimated by using Monte-Carlo samples obtained via environment interactions [59]. Since the environment is stochastic, this approach results in high variance in the estimated return, leading to higher sample-complexity (number of environment interactions required to learn a good policy). Actor-critic (AC) methods [29, 43, 5] alleviate this issue by using value-based approaches [52, 58] in conjunction with PG methods, and have been empirically successful [20, 23]. In AC algorithms, a value-based method ("critic") is used to approximate a policy's estimated value, and a PG method ("actor") uses this estimate to improve the policy towards obtaining higher returns. Though AC methods have the flexibility of using any method to independently train the actor and critic, it is unclear how to train the two components jointly in order to learn good policies. For example, the critic is typically trained via temporal difference (TD) learning and its objective is to minimize the value estimation error across all states and actions. For large real-world Markov decision processes (MDPs), it is intractable to estimate the values across all states and actions, and 37th Conference on Neural Information Processing Systems (NeurIPS 2023).

Reinforcement learning in intelligent control : a biologically-inspired approach to the relearning problem

1998

The increasingly complex demands placed on control systems have resulted in a need for intelligent control, an approach that attempts to meet these demands by emulating the capabilities found in biological systems. The need to exploit existing knowledge is a desirable feature of any intelligent control system, and this leads to the relearning problem. The problem arises when a control system is required to effectively learn new knowledge whilst exploiting still useful knowledge from past experiences. This thesis describes the adaptive critic system using reinforcement learning, a computational framework that can effectively address many of the demands in intelligent control, but is less effective when it comes to addressing the relearning problem. The thesis argues that biological mechanisms of reinforcement learning (and relearning) may provide inspiration for developing artificial intelligent control mechanisms that can better address the relearning problem. A conceptual model of ...

The Actor-Dueling-Critic Method for Reinforcement Learning

Model-free reinforcement learning is a powerful and efficient machine-learning paradigm which has been generally used in the robotic control domain. In the reinforcement learning setting, the value function method learns policies by maximizing the state-action value (Q value), but it suffers from inaccurate Q estimation and results in poor performance in a stochastic environment. To mitigate this issue, we present an approach based on the actor-critic framework, and in the critic branch we modify the manner of estimating Q-value by introducing the advantage function, such as dueling network, which can estimate the action-advantage value. The action-advantage value is independent of state and environment noise, we use it as a fine-tuning factor to the estimated Q value. We refer to this approach as the actor-dueling-critic (ADC) network since the frame is inspired by the dueling network. Furthermore, we redesign the dueling network part in the critic branch to make it adapt to the continuous action space. The method was tested on gym classic control environments and an obstacle avoidance environment, and we design a noise environment to test the training stability. The results indicate the ADC approach is more stable and converges faster than the DDPG method in noise environments.