Roadmap to Stable-Baselines3 V1.0 (original) (raw)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Description

This issue is meant to be updated as the list of changes is not exhaustive

Dear all,

Stable-Baselines3 beta is now out 🎉 ! This issue is meant to reference what is implemented and what is missing before a first major version.

As mentioned in the README, before v1.0, breaking changes may occur. I would like to encourage contributors (especially the maintainers) to make comments on how to improve the library before v1.0 (and maybe make some internal changes).

I will try to review the features mentioned in hill-a/stable-baselines#576 (and hill-a/stable-baselines#733)
and I will create issues soon to reference what is missing.

What is implemented?

basic features (training/saving/loading/predict)
basic set of algorithms (A2C/PPO/SAC/TD3)
basic pre-processing (Box and Discrete observation/action spaces are handled)
callback support
complete benchmark for the continuous action case
basic rl zoo for training/evaluating plotting (https://github.com/DLR-RM/rl-baselines3-zoo)
consistent api
basic tests and most type hints
continuous integration (I'm in discussion with the organization admins for that)
handle more observation/action spaces Add support for MultiDiscrete/MultiBinary observation spaces #4 and Add support for MultiDiscrete/MultiBinary action spaces #5 (thanks @rolandgvc)
tensorboard integration Tensorboard integration #9 (thanks @rolandgvc)
basic documentation and notebooks
automatic build of the documentation
Vanilla DQN Implement Vanilla DQN #6 (thanks @Artemis-Skade)
Refactor off-policy critics to reduce code duplication Implement DDPG #3 (see Refactored ContinuousCritic for SAC/TD3 #78 )
DDPG Implement DDPG #3
do a complete benchmark for the discrete case Performance Check (Discrete actions) #49 (thanks @Miffyli !)
performance check for continuous actions Performance check (Continuous Actions) #48 (even better than gSDE paper)
get/set parameters for the base class (Get/set parameters and review of saving and loading #138 )
clean up type-hints in docs Custom parser for type hints #10 (cumbersome to read)
documenting the migration between SB and SB3 Migration guide #11
finish typing some methods Improve typing coverage #175
HER Implement HER #8 (thanks @megan-klaiber)
finishing to update and clean the doc Missing Documentation #166 (help is wanted)
finishing to update the notebooks and the tutorial Update colab notebooks #7 (I will do that, only HER notebook missing)

What are the new features?

much cleaner base code (and no more warnings =D )
independent saving/loading/predict for policies
State-Dependent Exploration (SDE) for using RL directly on real robots (this is a unique feature, it was the starting point of SB3, I published a paper on that: https://arxiv.org/abs/2005.05719)
proper evaluation (using separate env) is included in the base class (using EvalCallback)
all environments are VecEnv
better saving/loading (now can include the replay buffer and the optimizers)
any number of critics are allowed for SAC/TD3
custom actor/critic net arch for off-policy algos ([Feature request] Allow different network architectures for off-policy actor/critic #113 )
QR-DQN in SB3-Contrib
Truncated Quantile Critics (TQC) (see Implement Truncated Quantile Critics (TQC) #83 ) in SB3-Contrib
@Miffyli suggested a "contrib" repo for experimental features (it is here)

What is missing?

syncing some files with Stable-Baselines to remain consistent (we may be good now, but need to be checked)
finish code-review of exisiting code Review of Existing Code #17

Checklist for v1.0 release

Update Readme
Prepare blog post
Update doc: add links to the stable-baselines3 contrib
Update docker image to use newer Ubuntu version
Populate RL zoo

What is next? (for V1.1+)

basic dict/tuple support for observations (Dictionary Observations #243 )
simple recurrent policies? (recurrent policy implementation in ppo [feature-request] #18)
DQN extensions (double, PER, IQN) ([Feature Request] RAINBOW #622)
Implement TRPO (Add TRPO Stable-Baselines-Team/stable-baselines3-contrib#40)
multi-worker training for all algorithms ([Feature request] Adding multiprocessing support for off policy algorithms #179 )
n-step returns for off-policy algorithms [feature-request] N-step returns for TD methods #47 (@partiallytyped )
SAC discrete [Feature request] Implement SAC-Discrete #157 (need to be discussed, benefit vs DQN+extensions?)
Energy Based Prioritisation? (@RyanRizzo96)
implement action_proba in the base class?
test the doc snippets Sphinx doc tests support #14 (help is welcomed)
noisy networks (https://arxiv.org/abs/1706.10295) @partiallytyped ? exploration in parameter space? ([Feature Request] RAINBOW #622)
Munchausen Reinforcement Learning (MDQN) (probably in the contrib first, e.g. [WIP] MDQN pfnet/pfrl#74)

side note: should we change the default start_method to fork? (now that we don't have tf anymore)