[Bug] Infinite horizon tasks are handled like episodic tasks · Issue #284 · DLR-RM/stable-baselines3 (original) (raw)

Hi,
I wonder how to correctly use SAC with infinite horizon environments. I saw @araffin s answer to hill-a/stable-baselines#776 where he points out that algorithms are step-based. Our environments could always return done = False, but we would have to reset the environment manually then. As a consequence, we would add transitions to the replay buffer going from the last state to the initial state, which is bad.

Is the only solution to include a time-feature? That means messing with the observation_space size and handling dict spaces correctly + explaining what this "time-feature" is in papers. Let me know if I've missed a thread treating this issue already 😄
Greetings!

🐛 Bug / Background

My understanding is that SAC skips the target if s' is a terminal state:

q_backup = replay_data.rewards + (1 - replay_data.dones) * self.gamma * target_q

In infinite horizon tasks, we wrap our env with gym.wrappers.TimeLimit, which sets done = True when the maximum episode length is reached. This stops the episode in SAC and the transition is saved in the replay buffer for learning.

However, according to "Time Limits in Reinforcement Learning" (https://arxiv.org/abs/1712.00378), we should not see that last state as a "terminal" state, since the termination has nothing to do with the MDP. If we ignore this, we are doing "state aliasing" and violating the Markov Property.