RL Tutorial on Stable Baselines (original) (raw)

www.dlr.de · Antonin RAFFIN · Stable Baselines Tutorial · JNRR 2019 · 18.10.2019

Stable Baselines Tutorial

Reinforcement Learning Made Easy

Ashley HILL
CEA

Edward Beeching
INSA Lyon

Antonin RAFFIN
German Aerospace Center (DLR)

Examples of Reinforcement Learning for Robotics

Learning Agile and Dynamic Motor Skills for Legged Robots (1)

Dexterous Manipulation

Learning to toss

Learning to Drive in Minutes

Wave RL

wave

Active damping on a model of a vibrating bridge
https://github.com/jaberkow/WaveRL

Deep Mimic

Bonus

Stable Baselines Library

Features

features

Algorithms

RL Algos

Active Community

git clones

Tutorial

Github repo:https://github.com/araffin/rl-tutorial-jnrr19

Getting Started Colab Notebook
Gym Wrappers, saving and loading models Colab Notebook
Multiprocessing Colab Notebook
Callbacks and hyperparameter tuning Colab Notebook
Creating a custom gym environment Colab Notebook

à vous de jouer!

https://github.com/araffin/rl-tutorial-jnrr19

Source: Deep Mimic (Jason Peng)

Branches of Machine Learning

taxonomy

Source: Outsider Tour RL by Ben Recht

Many Face of RL

many_faces

Source: David Silver Course

RL 101

rl101

Source: Lilian Weng blog

baymax

Credit: L.M Tenkes

Notation

Reinforcement Learning	Classical Control
State	sts_tst	xtx_txt
Action	ata_tat	utu_tut
Reward	rtr_trt	−ct-c_t−ct

Main Components of an RL algo

An RL algo may include one or more of these components:

Policy: agent's behavior function ($a_t = \pi(s_t)$)
Value function: how good is each state and/or action ($V(s_t)$ or Q(st,at)Q(s_t, a_t)Q(st,at))
Model: agent's representation of the environment ($s_{t+1} = f(s_t, a_t)$ or rt+1=g(st,at)r_{t+1} = g(s_t, a_t)rt+1=g(st,at))

Model Free vs Model Based

Model Free: No explicit representation of the environment
Model Based: Rely on a the model of the environment
- known, given to the agent (ex: Chess)
- learned explicitly (from scratch, parameter identification)

Model Based RL

Model based RL

Source: BAIR blog

On-Policy vs Off-policy

On-Policy: The trajectories must be generated by the most recent policy
Off-Policy: The trajectories can be collected by any behavior policy

Model Free RL Landscape

Model free RL landscape

Exploration vs Exploitation Trade-Off (1)

Exploration: Try a new beer

Exploitation: Drink your favorite beer

Exploration vs Exploitation Trade-Off (2)

Exploration: gather more information about the environment

Exploitation: use the best known strategy to maximize reward

Common Assumptions

Markov: the current state depends only on the previous step, not the complete history

Fully Observable: agent directly observe the environment state ($o_t = s_t$) Ex: Chess vs Poker

Recap

RL 101: state, action, reward
Policy, value function, model
Model free vs model based
On-policy vs off-policy
Model free landscape
Exploration vs exploitation
Common assumptions

Current Challenges of RL

Sample efficiency (millions of samples required)
Reward engineering ("RewArt")
Stability and reproducibility (improving)
Jittering (action noise and oscillations)

Topics not covered

Objective function (discounted and undiscounted return)
Credit assignment problem
Bias/variance trade-off
Modern tricks (replay buffer, ...)
How to explore ($\epsilon$-greedy, parameter space, ...)
Successes in simulation/games (Mujoco, Atari, Go, Dota2, Starcraft)

Resources

RL Zoo: A collection of 120+ trained RL agents

Provide a simple interface to train and enjoy RL agents
Benchmark the different Reinforcement Learning algorithms
Provide tuned hyperparameters for each environment and RL algorithm
Have fun with the trained agents!

https://github.com/araffin/rl-baselines-zoo

RL Zoo: Training

         					`
         						HalfCheetahBulletEnv-v0:
         							env_wrapper: utils.wrappers.TimeFeatureWrapper
         							n_timesteps: !!float 2e6
         							policy: 'MlpPolicy'
         							gamma: 0.99
         							buffer_size: 1000000
         							noise_type: 'normal'
         							noise_std: 0.1
         							learning_starts: 10000
         							batch_size: 100
         							learning_rate: !!float 1e-3
         							train_freq: 1000
         							gradient_steps: 1000
         							policy_kwargs: 'dict(layers=[400, 300])'
         					`
         				
         					`
         						python train.py --algo td3 --env HalfCheetahBulletEnv-v0
         						python enjoy.py --algo td3 --env HalfCheetahBulletEnv-v0
         						python -m utils.record_video --algo td3 --env HalfCheetahBulletEnv-v0 -n 1000
         					`

RL Zoo: Hyperparameter Optimization

Optuna

Easy to setup
Clean API
Good documentation

TPE, GP, CMAES, median pruner, ...

                           `
                           python train.py --algo ppo2 --env MountainCar-v0 \
                           --optimize --n-trials 1000 --n-jobs 2 \
                           --sampler tpe --pruner median
                       `