Two stochastic dynamic programming problems by model-free actor-critic recurrent-network learning in non-Markovian settings (original) (raw)

Recurrent Policy Gradients

Journal of Algorithms, 2009

Reinforcement learning for partially observable Markov decision problems (POMDPs) is a challenge as it requires policies with an internal state. Traditional approaches suffer significantly from this shortcoming and usually make strong assumptions on the problem domain such as perfect system models, state-estimators and a Markovian hidden system. Recurrent neural networks (RNNs) offer a natural framework for dealing with policy learning using hidden state and require only few limiting assumptions. As they can be trained well using gradient descent, they are suited for policy gradient approaches.

Totally model-free reinforcement learning by actor-critic Elman networks in non-Markovian domains

1998

In this paper we describe how an actor-critic reinforcement learning agent in a non-Markovian domain nds an optimal sequence of actions in a totally modelfree fashion; that is, the agent neither learns transitional probabilities and associated rewards, nor by how much the state space should be augmented so that the Markov property holds. In particular, we employ an Elman-type recurrent neural network to solve non-Markovian problems since an Elman-type network is able to implicitly and automatically render the process Markovian.

On using discretized Cohen-Grossberg node dynamics for model-free actor-critic neural learning in non-Markovian domains

2003

We describe how multi-stage non-Markovian decision problems can be solved using actor-critic reinforcement learning by assuming that a discrete version of Cohen-Grossberg node dynamics describes the node-activation computations of a neural network (NN). Our NN (i.e., agent) is capable of rendering the process Markovian implicitly and automatically in a totally model-free fashion without learning by how much the state space must be augmented so that the Markov property holds. This serves as an alternative to using Elman or Jordantype recurrent neural networks, whose context units function as a history memory in order to develop sensitivity to non-Markovian dependencies. We shall demonstrate our concept using a small-scale non-Markovian deterministic path problem, in which our actor-critic NN finds an optimal sequence of actions (but learns neither transitional dynamics nor associated rewards), although it needs many iterations due to the nature of neural model-free learning. This is, in spirit, a neurodynamic programming approach.

Non-Markovian Control with Gated End-to-End Memory Policy Networks

ArXiv, 2017

Partially observable environments present an important open challenge in the domain of sequential control learning with delayed rewards. Despite numerous attempts during the two last decades, the majority of reinforcement learning algorithms and associated approximate models, applied to this context, still assume Markovian state transitions. In this paper, we explore the use of a recently proposed attention-based model, the Gated End-to-End Memory Network, for sequential control. We call the resulting model the Gated End-to-End Memory Policy Network. More precisely, we use a model-free value-based algorithm to learn policies for partially observed domains using this memory-enhanced neural network. This model is end-to-end learnable and it features unbounded memory. Indeed, because of its attention mechanism and associated non-parametric memory, the proposed model allows us to define an attention mechanism over the observation stream unlike recurrent models. We show encouraging resul...

Solving deep memory POMDPs with recurrent policy gradients

2007

This paper presents Recurrent Policy Gradients, a model-free reinforcement learning (RL) method creating limited-memory sto-chastic policies for partially observable Markov decision problems (POMDPs) that require long-term memories of past observations. The approach involves approximating a policy gradient for a Recurrent Neural Network (RNN) by backpropagating return-weighted characteristic eligibilities through time. Using a “Long Short-Term Memory” architecture, we are able to outperform other RL methods on two important benchmark tasks. Furthermore, we show promising results on a complex car driving simulation task.

Constrained representation learning for recurrent policy optimisation under uncertainty

Adaptive Behavior

Learning to make decisions in partially observable environments is a notorious problem that requires a complex representation of controllers. In most work, the controllers are designed as a non-linear mapping from a sequence of temporal observations to actions. These problems can, in principle, be formulated as a partially observable Markov decision process whose policy can be parameterised through the use of recurrent neural networks. In this paper, we will propose an alternative framework that (a) uses the Long-Short-Term-Memory (LSTM) Encoder-Decoder framework to learn an internal state representation for historical observations and then (b) integrates it into existing recurrent policy models to improve the task performance. The LSTM Encoder encodes a history of observations as input into a representation of internal states. The LSTM Decoder can perform two alternative decoding tasks: predicting the same input observation sequence or predicting future observation sequences. The f...

Reinforcement Learning via Recurrent Convolutional Neural Networks

2016 23rd International Conference on Pattern Recognition (ICPR), 2016

Deep Reinforcement Learning has enabled the learning of policies for complex tasks in partially observable environments, without explicitly learning the underlying model of the tasks. While such model-free methods achieve considerable performance, they often ignore the structure of task. We present a natural representation of to Reinforcement Learning (RL) problems using Recurrent Convolutional Neural Networks (RCNNs), to better exploit this inherent structure. We define 3 such RCNNs, whose forward passes execute an efficient Value Iteration, propagate beliefs of state in partially observable environments, and choose optimal actions respectively. Backpropagating gradients through these RCNNs allows the system to explicitly learn the Transition Model and Reward Function associated with the underlying MDP, serving as an elegant alternative to classical model-based RL. We evaluate the proposed algorithms in simulation, considering a robot planning problem. We demonstrate the capability of our framework to reduce the cost of re-planning, learn accurate MDP models, and finally re-plan with learnt models to achieve near-optimal policies.

Reinforcement Learning with Non-Markovian Rewards

Proceedings of the AAAI Conference on Artificial Intelligence, 2020

The standard RL world model is that of a Markov Decision Process (MDP). A basic premise of MDPs is that the rewards depend on the last state and action only. Yet, many real-world rewards are non-Markovian. For example, a reward for bringing coffee only if requested earlier and not yet served, is non-Markovian if the state only records current requests and deliveries. Past work considered the problem of modeling and solving MDPs with non-Markovian rewards (NMR), but we know of no principled approaches for RL with NMR. Here, we address the problem of policy learning from experience with such rewards. We describe and evaluate empirically four combinations of the classical RL algorithm Q-learning and R-max with automata learning algorithms to obtain new RL algorithms for domains with NMR. We also prove that some of these variants converge to an optimal policy in the limit.

Guided Soft Actor Critic: A Guided Deep Reinforcement Learning Approach for Partially Observable Markov Decision Processes

IEEE Access

Most real-world problems are essentially partially observable, and the environmental model is unknown. Therefore, there is a significant need for reinforcement learning approaches to solve them, where the agent perceives the state of the environment partially and noisily. Guided reinforcement learning methods solve this issue by providing additional state knowledge to reinforcement learning algorithms during the learning process, allowing them to solve a partially observable Markov decision process (POMDP) more effectively. However, these guided approaches are relatively rare in the literature, and most existing approaches are model-based, meaning that they require learning an appropriate model of the environment first. In this paper, we propose a novel model-free approach that combines the soft actor-critic method and supervised learning concept to solve real-world problems, formulating them as POMDPs. In experiments performed on OpenAI Gym, an open-source simulation platform, our guided soft actor-critic approach outperformed other baseline algorithms, gaining 7∼20% more maximum average return on five partially observable tasks constructed based on continuous control problems and simulated in MuJoCo.

Learning Probabilistic Reward Machines from Non-Markovian Stochastic Reward Processes

2021

The success of reinforcement learning in typical settings is, in part, predicated on underlying Markovian assumptions on the reward signal by which an agent learns optimal policies. In recent years, the use of reward machines has relaxed this assumption by enabling a structured representation of non-Markovian rewards. In particular, such representations can be used to augment the state space of the underlying decision process, thereby facilitating non-Markovian reinforcement learning. However, these reward machines cannot capture the semantics of stochastic reward signals. In this paper, we make progress on this front by introducing probabilistic reward machines (PRMs) as a representation of non-Markovian stochastic rewards. We present an algorithm to learn PRMs from the underlying decision process as well as to learn the PRM representation of a given decisionmaking policy.