DOP: Deep Optimistic Planning with Approximate Value Function Evaluation (original) (raw)

Q-CP: Learning Action Values for Cooperative Planning

2018 IEEE International Conference on Robotics and Automation (ICRA), 2018

Research on multi-robot systems has demonstrated promising results in manifold applications and domains. Still, efficiently learning an effective robot behaviors is very difficult, due to unstructured scenarios, high uncertainties, and large state dimensionality (e.g. hyper-redundant and groups of robot). To alleviate this problem, we present Q-CP a cooperative model-based reinforcement learning algorithm, which exploits action values to both (1) guide the exploration of the state space and (2) generate effective policies. Specifically, we exploit Qlearning to attack the curse-of-dimensionality in the iterations of a Monte-Carlo Tree Search. We implement and evaluate Q-CP on different stochastic cooperative (general-sum) games: (1) a simple cooperative navigation problem among 3 robots, (2) a cooperation scenario between a pair of KUKA YouBots performing handovers , and (3) a coordination task between two mobile robots entering a door. The obtained results show the effectiveness of Q-CP in the chosen applications, where action values drive the exploration and reduce the computational demand of the planning process while achieving good performance.

HI-VAL: Iterative Learning of Hierarchical Value Functions for Policy Generation

Advances in Intelligent Systems and Computing, 2018

Task decomposition is effective in various applications where the global complexity of a problem makes planning and decision-making too demanding. This is true, for example, in high-dimensional robotics domains, where (1) unpredictabilities and modeling limitations typically prevent the manual specification of robust behaviors, and (2) learning an action policy is challenging due to the curse of dimensionality. In this work, we borrow the concept of Hierarchical Task Networks (HTNs) to decompose the learning procedure, and we exploit Upper Confidence Tree (UCT) search to introduce HI-VAL, a novel iterative algorithm for hierarchical optimistic planning with learned value functions. To obtain better generalization and generate policies, HI-VAL simultaneously learns and uses action values. These are used to formalize constraints within the search space and to reduce the dimensionality of the problem. We evaluate our algorithm both on a fetching task using a simulated 7-DOF KUKA light weight arm and, on a pick and delivery task with a Pioneer robot.

Deep Reactive Planning in Dynamic Environments

arXiv (Cornell University), 2020

4 AIST Figure 1: Our proposed agent learns an end-to-end reactive planning technique by combining traditional path planning algorithms, supervised learning (SL) and reinforcement learning (RL) algorithms in a synergistic way. A deep CNN is used to learn the sequence of waypoints obtained from a kinematic planning algorithm (e.g., a Bidirectional RRT*) given a depth image of the environment. The agent learns to follow arbitrary waypoints using path-conditioned RL, thus resulting in efficient exploration. We show that our trained agent can achieve good sample efficiency, as well as generalization to novel environments in simulation as well as real environments. The whole learning process is done in the simulator by learning a Real2Sim transfer function to make the training process efficient and suitable for robotic systems.

Path Planning for Intelligent Robots Based on Deep Q-learning With Experience Replay and Heuristic Knowledge

IEEE/CAA Journal of Automatica Sinica, 2020

Path planning and obstacle avoidance are two challenging problems in the study of intelligent robots. In this paper, we develop a new method to alleviate these problems based on deep Q-learning with experience replay and heuristic knowledge. In this method, a neural network has been used to resolve the “curse of dimensionality” issue of the Q-table in reinforcement learning. When a robot is walking in an unknown environment, it collects experience data which is used for training a neural network; such a process is called experience replay. Heuristic knowledge helps the robot avoid blind exploration and provides more effective data for training the neural network. The simulation results show that in comparison with the existing methods, our method can converge to an optimal action strategy with less time and can explore a path in an unknown environment with fewer steps and larger average reward.

Improving Safety in Deep Reinforcement Learning using Unsupervised Action Planning

ArXiv, 2021

One of the key challenges to deep reinforcement learning (deep RL) is to ensure safety at both training and testing phases. In this work, we propose a novel technique of unsupervised action planning to improve the safety of onpolicy reinforcement learning algorithms, such as trust region policy optimization (TRPO) or proximal policy optimization (PPO). We design our safety-aware reinforcement learning by storing all the history of “recovery” actions that rescue the agent from dangerous situations into a separate “safety” buffer and finding the best recovery action when the agent encounters similar states. Because this functionality requires the algorithm to query similar states, we implement the proposed safety mechanism using an unsupervised learning algorithm, k-means clustering. We evaluate the proposed algorithm on six robotic control tasks that cover navigation and manipulation. Our results show that the proposed safety RL algorithm can achieve higher rewards compared with mult...

Deep Reinforcement Learning-Based Path Planning with Dynamic Collision Probability for Mobile Robots

WRC Symposium on Advanced Robotics and Automation (WRC SARA), 2024

This study proposed a novel approach for mobile robots path planning and avoiding collisions by using Collision Probability (CP) along with the Soft Actor-Critic Lagrangian (SACL-L) framework. Our approach enables the mobile robot to dynamically deal with static and dynamic environments while ensuring safety and efficiency. The proposed SAC-L (CP) aims to minimize the total costs, which is the combination of both negative rewards and collision occurs. This dual focus strategy ensures trajectory planning inherently safer and providing a robust solution for complex dynamic obstacles environments. The framework’s efficiency is validated through extensive simulations on the Gazebo platform involving three increasingly difficult scenarios, demonstrating superior performance, adaptability and safety of our approach compared to traditional Deep Reinforcement Learning (DRL) methods. Our results showcase significant improvements in social and ego safety scores, contributing to the advancement of autonomous navigation in complex environments. This framework marks a step towards safer, more reliable mobile robot navigation and opens new avenues for future research in mobile robot path planning. A supplementary video further demonstrates the effectiveness of our framework.

Towards a common implementation of reinforcement learning for multiple robotic tasks

Expert Systems With Applications, 2018

Mobile robots are increasingly being employed for performing complex tasks in dynamic environments. Reinforcement learning (RL) methods are recognized to be promising for specifying such tasks in a relatively simple manner. However, the strong dependency between the learning method and the task to learn is a well-known problem that restricts practical implementations of RL in robotics, often requiring major modifications of parameters and adding other techniques for each particular task. In this paper we present a practical core implementation of RL which enables the learning process for multiple robotic tasks with minimal per-task tuning or none. Based on value iteration methods, this implementation includes a novel approach for action selection, called Q-biased softmax regression (QBIASSR), which avoids poor performance of the learning process when the robot reaches new unexplored states. Our approach takes advantage of the structure of the state space by attending the physical variables involved (e.g., distances to obstacles, X , Y , θ pose, etc.), thus experienced sets of states may favor the decision-making process of unexplored or rarely-explored states. This improvement has a relevant role in reducing the tuning of the algorithm for particular tasks. Experiments with real and simulated robots, performed with the software framework also introduced here, show that our implementation is effectively able to learn different robotic tasks without tuning the learning method. Results also suggest that the combination of true online SARSA(λ) (TOSL) with QBIASSR can outperform the existing RL core algorithms in low-dimensional robotic tasks.

Deep Reactive Policies for Planning in Stochastic Nonlinear Domains

Proceedings of the AAAI Conference on Artificial Intelligence, 2019

Recent advances in applying deep learning to planning have shown that Deep Reactive Policies (DRPs) can be powerful for fast decision-making in complex environments. However, an important limitation of current DRP-based approaches is either the need of optimal planners to be used as ground truth in a supervised learning setting or the sample complexity of high-variance policy gradient estimators, which are particularly troublesome in continuous state-action domains. In order to overcome those limitations, we introduce a framework for training DRPs in continuous stochastic spaces via gradient-based policy search. The general approach is to explicitly encode a parametric policy as a deep neural network, and to formulate the probabilistic planning problem as an optimization task in a stochastic computation graph by exploiting the re-parameterization of the transition probability densities; the optimization is then solved by leveraging gradient descent algorithms that are able to handle...

Reinforcement Learning in Robotics: A Survey

Reinforcement learning offers to robotics a framework and set of tools for the design of sophisticated and hard-to-engineer behaviors. Conversely, the challenges of robotic problems provide both inspiration, impact, and validation for developments in reinforcement learning. The relationship between disciplines has sufficient promise to be likened to that between physics and mathematics. In this article, we attempt to strengthen the links between the two research communities by providing a survey of work in reinforcement learning for behavior generation in robots. We highlight both key challenges in robot reinforcement learning as well as notable successes. We discuss how contributions tamed the complexity of the domain and study the role of algorithms, representations, and prior knowledge in achieving these successes. As a result, a particular focus of our paper lies on the choice between model-based and model-free as well as between value function-based and policy search methods. By analyzing a simple problem in some detail we demonstrate how reinforcement learning approaches may be profitably applied, and we note throughout open questions and the tremendous potential for future research.

Flexible and Efficient Long-Range Planning Through Curious Exploration

2020

Identifying algorithms that flexibly and efficiently discover temporally-extended multi-phase plans is an essential step for the advancement of robotics and model-based reinforcement learning. The core problem of long-range planning is finding an efficient way to search through the tree of possible action sequences. Existing non-learned planning solutions from the Task and Motion Planning (TAMP) literature rely on the existence of logical descriptions for the effects and preconditions for actions. This constraint allows TAMP methods to efficiently reduce the tree search problem but limits their ability to generalize to unseen and complex physical environments. In contrast, deep reinforcement learning (DRL) methods use flexible neural-network-based function approximators to discover policies that generalize naturally to unseen circumstances. However, DRL methods struggle to handle the very sparse reward landscapes inherent to long-range multi-step planning situations. Here, we propos...