META-Learning State-based {\lambda} for More Sample-Efficient Policy Evaluation (original) (raw)

META-Learning State-based Eligibility Traces for More Sample-Efficient Policy Evaluation

2020

Temporal-Difference (TD) learning is a standard and very successful reinforcement learning approach, at the core of both algorithms that learn the value of a given policy, as well as algorithms which learn how to improve policies. TD-learning with eligibility traces provides a way to boost sample efficiency by temporal credit assignment, i.e. deciding which portion of a reward should be assigned to predecessor states that occurred at different previous times, controlled by a parameter lambda\lambdalambda. However, tuning this parameter can be time-consuming, and not tuning it can lead to inefficient learning. For better sample efficiency of TD-learning, we propose a meta-learning method for adjusting the eligibility trace parameter, in a state-dependent manner. The adaptation is achieved with the help of auxiliary learners that learn distributional information about the update targets online, incurring roughly the same computational complexity per step as the usual value learner. Our approach...

Faster and More Accurate Trace-based Policy Evaluation via Overall Target Error Meta-Optimization

2019

To improve the speed and accuracy of the trace based policy evaluation method TD(λ), under appropriate assumptions, we derive and propose an off-policy compatible method of meta-learning state-based λ’s online with efficient incremental updates. Furthermore, we prove the derived bias-variance tradeoff minimization method, with slight adjustments, is equivalent to minimizing the overall target error in terms of state based λ’s. In experiments, the method shows significantly better performance when compared to the existing method and the baselines.

Off-Policy Temporal Difference Learning with Function Approximation

2001

We introduce the first algorithm for off-policy temporal-difference learning that is stable with linear function approximation. Off-policy learning is of interest because it forms the basis for popular reinforcement learning methods such as Q-learning, which has been known to diverge with linear function approximation, and because it is critical to the practical utility of multi-scale, multi-goal, learning frameworks such as options, HAMs, and MAXQ. Our new algorithm combines TD(λ) over state-action pairs with importance sampling ideas from our previous work. We prove that, given training under any -soft policy, the algorithm converges w.p.1 to a close approximation (as in Tsitsiklis and to the action-value function for an arbitrary target policy. Variations of the algorithm designed to reduce variance introduce additional bias but are also guaranteed convergent. We also illustrate our method empirically on a small policy evaluation problem. Our current results are limited to episodic tasks with episodes of bounded length.

Adaptive Step-Size for Online Temporal Difference Learning

Twenty Sixth Aaai Conference on Artificial Intelligence, 2012

The step-size, often denoted as α, is a key parameter for most incremental learning algorithms. Its importance is especially pronounced when performing online temporal difference (TD) learning with function approximation. Several methods have been developed to adapt the step-size online. These range from straightforward back-off strategies to adaptive algorithms based on gradient descent. We derive an adaptive upper bound on the step-size parameter to guarantee that online TD learning with linear function approximation will not diverge. We then empirically evaluate algorithms using this upper bound as a heuristic for adapting the stepsize parameter online. We compare performance with related work including HL(λ) and Autostep. Our results show that this adaptive upper bound heuristic out-performs all existing methods without requiring any meta-parameters. This effectively eliminates the need to tune the learning rate of temporal difference learning with linear function approximation. ∞ t=0 α t = ∞) & (∞ t=0 α 2 t < ∞), where α t denotes the step-size at time step t. Meeting these requirements will result in consistent learning, but can make convergence slower. On the other hand, choosing a fixed step-size that is too large can lead to very fast convergence in practice, but also has a greater chance of causing the process to diverge. The unfortunate irony is that often the best fixed step-size is the largest value that does not cause divergence. Thus, on one side of this

Accelerated gradient temporal difference learning algorithms

2014 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL), 2014

In this paper we study Temporal Difference (TD) Learning with linear value function approximation. The classic TD algorithm is known to be unstable with linear function approximation and off-policy learning. Recently developed Gradient TD (GTD) algorithms have addressed this problem successfully. Despite their prominent properties of good scalability and convergence to correct solutions, they inherit the potential weakness of slow convergence as they are a stochastic gradient descent algorithm. Accelerated stochastic gradient descent algorithms have been developed to speed up convergence, while still keeping computational complexity low. In this work, we develop an accelerated stochastic gradient descent method for minimizing the Mean Squared Projected Bellman Error (MSPBE), and derive a bound for the Lipschitz constant of the gradient of the MSPBE, which plays a critical role in our proposed accelerated GTD algorithms. Our comprehensive numerical experiments demonstrate promising performance in solving the policy evaluation problem, in comparison to the GTD algorithm family. In particular, accelerated TDC surpasses state-of-the-art algorithms.

Improving the Efficiency of Off-Policy Reinforcement Learning by Accounting for Past Decisions

ArXiv, 2021

Off-policy learning from multistep returns is crucial for sample-efficient reinforcement learning, particularly in the experience replay setting now commonly used with deep neural networks. Classically, off-policy estimation bias is corrected in a per-decision manner: past temporal-difference errors are reweighted by the instantaneous Importance Sampling (IS) ratio (via eligibility traces) after each action. Many important off-policy algorithms such as Tree Backup and Retrace rely on this mechanism along with differing protocols for truncating (“cutting”) the ratios (“traces”) to counteract the excessive variance of the IS estimator. Unfortunately, cutting traces on a per-decision basis is not necessarily efficient; once a trace has been cut according to local information, the effect cannot be reversed later, potentially resulting in the premature truncation of estimated returns and slower learning. In the interest of motivating efficient off-policy algorithms, we propose a multiste...

n-Step Temporal Difference Learning with Optimal n

arXiv (Cornell University), 2023

We consider the problem of finding the optimal value of n in the n-step temporal difference (TD) learning algorithm. We find the optimal n by resorting to a model-free optimization technique involving a one-simulation simultaneous perturbation stochastic approximation (SPSA) based procedure that we adopt to the discrete optimization setting by using a random projection approach. We prove the convergence of our proposed algorithm, SDPSA, using a differential inclusions approach and show that it finds the optimal value of n in n-step TD. Through experiments, we show that the optimal value of n is achieved with SDPSA for arbitrary initial values. I. INTRODUCTION Reinforcement learning (RL) algorithms are widely used for solving problems of sequential decisionmaking under uncertainty. An RL agent typically makes decisions based on data that it collects through interactions with the environment in order to maximize a certain long-term reward [1], [2]. Because of their model-free nature, RL algorithms have found extensive applications in various areas such as operations research, game theory, multi-agent systems, autonomous systems, communication networks and adaptive signal Processing [2]. Various classes of procedures such as the action-value methods, evolutionary algorithms, and policy gradient approaches are available for finding solutions to RL problems [3]. A widely popular class of approaches are the action-value methods that solve an RL problem by learning the action-value function under a given policy, and then the learned value function is used to design a better policy. Dynamic programming, Monte Carlo (MC), and temporal-difference (TD) learning are three of the most popular action-value methods [2]. While dynamic programming is a model-based approach, MC and TD methods are purely model-free and data-driven approaches and as a result, have been thoroughly studied in the literature. For example, the MC gradient method utilizing the sample paths is used to compute the policy gradient in [4].

Off-policy learning with recognizers

Advances in Neural …, 2006

We introduce a new algorithm for off-policy temporal-difference learning with function approximation that has lower variance and requires less knowledge of the behavior policy than prior methods. We develop the notion of a recognizer, a filter on actions that distorts the behavior policy to produce a related target policy with low-variance importance-sampling corrections. We also consider target policies that are deviations from the state distribution of the behavior policy, such as potential temporally abstract options, which further reduces variance. This paper introduces recognizers and their potential advantages, then develops a full algorithm for linear function approximation and proves that its updates are in the same direction as on-policy TD updates, which implies asymptotic convergence. Even though our algorithm is based on importance sampling, we prove that it requires absolutely no knowledge of the behavior policy for the case of state-aggregation function approximators.