Solving deep memory POMDPs with recurrent policy gradients (original) (raw)

Recurrent Policy Gradients

Journal of Algorithms, 2009

Reinforcement learning for partially observable Markov decision problems (POMDPs) is a challenge as it requires policies with an internal state. Traditional approaches suffer significantly from this shortcoming and usually make strong assumptions on the problem domain such as perfect system models, state-estimators and a Markovian hidden system. Recurrent neural networks (RNNs) offer a natural framework for dealing with policy learning using hidden state and require only few limiting assumptions. As they can be trained well using gradient descent, they are suited for policy gradient approaches.

Recurrent Off-policy Baselines for Memory-based Continuous Control

2021

When the environment is partially observable (PO), a deep reinforcement learning (RL) agent must learn a suitable temporal representation of the entire history in addition to a strategy to control. This problem is not novel, and there have been model-free and model-based algorithms proposed for this problem. However, inspired by recent success in model-free image-based RL, we noticed the absence of a model-free baseline for history-based RL that (1) uses full history and (2) incorporates recent advances in off-policy continuous control. Therefore, we implement recurrent versions of DDPG, TD3, and SAC (RDPG, RTD3, and RSAC) in this work, evaluate them on short-term and long-term PO domains, and investigate key design choices. Our experiments show that RDPG and RTD3 can surprisingly fail on some domains and that RSAC is the most reliable, reaching near-optimal performance on nearly all domains. However, one task that requires systematic exploration still proved to be difficult, even f...

Reinforcement Learning via Recurrent Convolutional Neural Networks

2016 23rd International Conference on Pattern Recognition (ICPR), 2016

Deep Reinforcement Learning has enabled the learning of policies for complex tasks in partially observable environments, without explicitly learning the underlying model of the tasks. While such model-free methods achieve considerable performance, they often ignore the structure of task. We present a natural representation of to Reinforcement Learning (RL) problems using Recurrent Convolutional Neural Networks (RCNNs), to better exploit this inherent structure. We define 3 such RCNNs, whose forward passes execute an efficient Value Iteration, propagate beliefs of state in partially observable environments, and choose optimal actions respectively. Backpropagating gradients through these RCNNs allows the system to explicitly learn the Transition Model and Reward Function associated with the underlying MDP, serving as an elegant alternative to classical model-based RL. We evaluate the proposed algorithms in simulation, considering a robot planning problem. We demonstrate the capability of our framework to reduce the cost of re-planning, learn accurate MDP models, and finally re-plan with learnt models to achieve near-optimal policies.

Non-Markovian Control with Gated End-to-End Memory Policy Networks

ArXiv, 2017

Partially observable environments present an important open challenge in the domain of sequential control learning with delayed rewards. Despite numerous attempts during the two last decades, the majority of reinforcement learning algorithms and associated approximate models, applied to this context, still assume Markovian state transitions. In this paper, we explore the use of a recently proposed attention-based model, the Gated End-to-End Memory Network, for sequential control. We call the resulting model the Gated End-to-End Memory Policy Network. More precisely, we use a model-free value-based algorithm to learn policies for partially observed domains using this memory-enhanced neural network. This model is end-to-end learnable and it features unbounded memory. Indeed, because of its attention mechanism and associated non-parametric memory, the proposed model allows us to define an attention mechanism over the observation stream unlike recurrent models. We show encouraging resul...

Deep Reinforcement Learning using Memory-based Approaches

This paper focuses on the problem of navigation in a space using dynamic reinforcement learning. We build on the work by Zhu et.al. [1], and explore the performance of target-driven visual navigation with memory layers added to the network. We evaluate our models using simulated 3D indoor scenes rendered by Thor framework [1], and we show that in many cases, adding memory results in small improvements in episode path lengths for targets not trained on earlier. We use an actor-critic model with policy as the function of goal as well as current state to allows for better generalization.

Learning Adaptive Driving Behavior Using Recurrent Deterministic Policy Gradients

2019 IEEE International Conference on Robotics and Biomimetics (ROBIO), 2019

In this work, we propose adaptive driving behaviors for simulated cars using continuous control deep reinforcement learning. Deep Deterministic Policy Gradient(DDPG) is known to give smooth driving maneuvers in simulated environments. Unfortunately, simple feedforward networks, lack the capability to contain temporal information, hence we have used its Recurrent variant called Recurrent Deterministic Policy Gradients. Our trained agent adapts itself to the velocity of the traffic. It is capable of slowing down in the presence of dense traffic, to prevent collisions as well to speed up and change lanes in order to overtake when the traffic is sparse. The reasons for the above behavior, as well as, our main contributions are: 1. Application of Recurrent Deterministic Policy Gradients. 2. Novel reward function formulation. 3. Modified Replay Buffer called Near and Far Replay Buffers, wherein we maintain two replay buffers and sample equally from both of them.

Constrained representation learning for recurrent policy optimisation under uncertainty

Adaptive Behavior

Learning to make decisions in partially observable environments is a notorious problem that requires a complex representation of controllers. In most work, the controllers are designed as a non-linear mapping from a sequence of temporal observations to actions. These problems can, in principle, be formulated as a partially observable Markov decision process whose policy can be parameterised through the use of recurrent neural networks. In this paper, we will propose an alternative framework that (a) uses the Long-Short-Term-Memory (LSTM) Encoder-Decoder framework to learn an internal state representation for historical observations and then (b) integrates it into existing recurrent policy models to improve the task performance. The LSTM Encoder encodes a history of observations as input into a representation of internal states. The LSTM Decoder can perform two alternative decoding tasks: predicting the same input observation sequence or predicting future observation sequences. The f...

Memory Augmented Control Networks

arXiv (Cornell University), 2017

Planning problems in partially observable environments cannot be solved directly with convolutional networks and require some form of memory. But, even memory networks with sophisticated addressing schemes are unable to learn intelligent reasoning satisfactorily due to the complexity of simultaneously learning to access memory and plan. To mitigate these challenges we propose the Memory Augmented Control Network (MACN). The network splits planning into a hierarchical process. At a lower level, it learns to plan in a locally observed space. At a higher level, it uses a collection of policies computed on locally observed spaces to learn an optimal plan in the global environment it is operating in. The performance of the network is evaluated on path planning tasks in environments in the presence of simple and complex obstacles and in addition, is tested for its ability to generalize to new environments not seen in the training set.

Steadily Learn to Drive with Virtual Memory

Proceedings of the 11th Asia-Pacific Regional Conference of the ISTVS

Reinforcement learning has shown great potential in developing high-level autonomous driving systems. However, for high-dimensional tasks, current RL methods suffer from low data efficiency and oscillation in the training process. This paper proposes an algorithm called Learn to drive with Virtual Memory (LVM) to overcome these problems. LVM compresses the high-dimensional information into compact latent states and learns a latent dynamic model to summarize the agent's experience. Various imagined latent trajectories are generated as virtual memory by the latent dynamic model. The policy is learned by propagating gradient through the learned latent model with the imagined latent trajectories and thus leads to high data efficiency. Furthermore, a double critic structure is designed to reduce the oscillation during the training process. The effectiveness of LVM is demonstrated by an image-input autonomous driving task, in which LVM outperforms the existing method in terms of data ...

Two stochastic dynamic programming problems by model-free actor-critic recurrent-network learning in non-Markovian settings

2004

We describe two stochastic non-Markovian dynamic programming (DP) problems, showing how the posed problems can be attacked by using actor-critic reinforcement learning with recurrent neural networks (RNN). We assume that the current state of a dynamical system is "completely observable," but that the rules, unknown to our decision-making agent, for the current reward and state transition depend not only on current state and action, but on possibly the "entire history" of past states and actions. This should not be confused with problems of "partially observable Markov decision processes (POMDPs)," where the current state is only deduced from either partial (observable) state alone or error-corrupted observations [11]. Our actor-critic RNN agent is capable of finding an optimal policy, while learning neither transitional probabilities, associated rewards, nor by how much the current state space must be augmented so that the Markov property holds. The RNN's recurrent connections or context units function as an "implicit" history memory (or internal state) to develop "sensitivity" to non-Markovian dependencies, rendering the process Markovian implicitly and automatically in a "totally model-free" fashion. In particular, using two small-scale longest-path problems in a stochastic non-Markovian setting, we discuss model-free learning features in comparison with the model-based approach by the classical DP algorithm.