SARSA (StateActionRewardStateAction) in Reinforcement Learning (original) (raw)

SARSA (State-Action-Reward-State-Action) in Reinforcement Learning

Last Updated : 11 May, 2026

SARSA (State-Action-Reward-State-Action) is an on-policy reinforcement learning (RL) algorithm that helps an agent to learn an optimal policy by interacting with its environment. The agent explores its environment, takes actions, receives feedback and continuously updates its behavior to maximize long-term rewards.

sarsa_algorithm_learning_process

SARSA algorithm Learning Process

Unlike off-policy algorithms like Q-learning which learn from the best possible actions, it updates its knowledge based on the actual actions the agent takes. This makes it suitable for environments where the agent's actions and their immediate feedback directly influence learning.

Components

Components of the SARSA Algorithm are as follows:

  1. **State (S): The current situation or position in the environment.
  2. **Action (A): The decision or move the agent makes in a given state.
  3. **Reward (R): The immediate feedback or outcome the agent receives after taking an action.
  4. **Next State (S'): The state the agent transitions to after taking an action.
  5. **Next Action (A'): The action the agent will take in the next state based on its current policy.
  6. **Discount Factor (γ): Determines how much importance is given to future rewards compared to immediate rewards.

SARSA focuses on updating the agent's Q-values (a measure of the quality of a given state-action pair) based on both the immediate reward and the expected future rewards.

**How does SARSA Updates Q-values?

SARSA updates the Q-value using the Bellman Equation for SARSA:

Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \left[ r_{t+1} + \gamma Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t) \right]

Where:

Understanding the Update

This helps the agent improve its decisions step by step.

**SARSA Algorithm Steps

Lets see how the SARSA algorithm works step-by-step:

**1. Initialize Q-values: Begin by setting arbitrary values for the Q-table (for each state-action pair).

**2. Choose Initial State: Start the agent in an initial state s_0.

**3. Episode Loop: For each episode (a complete run through the environment) we set the initial state s_t​ and choose an action a_t​ based on a policy like \varepsilon.

**4. Step Loop: For each step in the episode:

**5. End Condition: Repeat until the episode ends either because the agent reaches a terminal state or after a fixed number of steps.

Implementation

Let’s consider a practical example of implementing SARSA in a Grid World environment where the agent can move up, down, left or right to reach a goal.

**Step 1: Defining the Environment (GridWorld)

GridWorld environment simulates the agent's movement, applying the dynamics of state transitions and rewards.

Here we will be using Numpy library for its implementation.

Python `

import numpy as np import random

class GridWorld: def init(self, width, height, start, goal, obstacles): self.width = width self.height = height self.start = start self.goal = goal self.obstacles = obstacles self.state = start

def reset(self):
    self.state = self.start
    return self.state

def step(self, action):
    x, y = self.state
    if action == 0:
        x = max(x - 1, 0)
    elif action == 1:
        x = min(x + 1, self.height - 1)
    elif action == 2:
        y = max(y - 1, 0)
    elif action == 3:
        y = min(y + 1, self.width - 1)

    next_state = (x, y)

    if next_state in self.obstacles:
        reward = -10
        done = True
    elif next_state == self.goal:
        reward = 10
        done = True
    else:
        reward = -1
        done = False

    self.state = next_state
    return next_state, reward, done

`

**Step 2: Defining the SARSA Algorithm

The agent uses the SARSA algorithm to update its Q-values based on its interactions with the environment, adjusting its behavior over time to reach the goal.

Python `

def sarsa(env, episodes, alpha, gamma, epsilon): Q = np.zeros((env.height, env.width, 4))

for episode in range(episodes):
    state = env.reset()
    action = epsilon_greedy_policy(Q, state, epsilon)
    done = False

    while not done:
        next_state, reward, done = env.step(action)
        next_action = epsilon_greedy_policy(Q, next_state, epsilon)

        Q[state[0], state[1], action] += alpha * \
            (reward + gamma * Q[next_state[0], next_state[1],
             next_action] - Q[state[0], state[1], action])

        state = next_state
        action = next_action

return Q

`

**Step 3: Defining the Epsilon-Greedy Policy

The epsilon-greedy policy balances exploration and exploitation:

To avoid bias when multiple actions have the same Q-value, ties are broken randomly among the best actions.

Python `

def epsilon_greedy_policy(Q, state, epsilon): if random.uniform(0, 1) < epsilon: return random.randint(0, 3) else: q_values = Q[state[0], state[1]] max_q = np.max(q_values) best_actions = np.where(q_values == max_q)[0] return np.random.choice(best_actions)

`

**Step 4: Setting Up the Environment and Running SARSA

This step involves:

if name == "main":

width = 5
height = 5
start = (0, 0)
goal = (4, 4)
obstacles = [(2, 2), (3, 2)]
env = GridWorld(width, height, start, goal, obstacles)

episodes = 1000
alpha = 0.1
gamma = 0.99
epsilon = 0.1

Q = sarsa(env, episodes, alpha, gamma, epsilon)

print("Learned Q-values:")
print(Q)

`

**Output:

sarsa-

Learned Q-values

After running the SARSA algorithm the Q-values represent the expected cumulative reward for each state-action pair. The agent uses these Q-values to make decisions in the environment. Higher Q-values shows better actions for a given state.

You can download the complete code from here.

Feature **SARSA (On-Policy) **Q-Learning (Off-Policy)
Policy Used for Learning Learns from actions it actually takes Learns from best possible actions (max Q)
Update Uses Q(s’, a’) maxaQ(s’, a)
Exploration Effect Included in updates Ignored in updates
Behavior Learns a safer policy because updates depend on exploration Learns more aggressive policies
Convergence Speed Slower Faster
Best For Environments where exploration affects outcomes Environments where optimal actions are clear

**Exploration Strategies in SARSA

SARSA uses an exploration-exploitation strategy to choose actions. A common strategy is ε-greedy:

Over time, ε is often decayed to shift from exploration to exploitation as the agent gains more experience in the environment.

**Advantages

**Limitations