REINFORCE Algorithm (original) (raw)

Last Updated : 9 Oct, 2025

REINFORCE is a method used in reinforcement learning to improve how decisions are made. It learns by trying actions and then adjusting the chances of those actions based on the total reward received afterwards. Unlike other methods that estimate how good each action is REINFORCE directly learns the best way to choose actions. This makes it useful for tasks where there are many possible actions or continuous choices and when it is hard to estimate the value of each action.

How REINFORCE Works

The REINFORCE algorithm works in the following steps:

**1. Collect Episodes: The agent interacts with the environment for a fixed number of steps or until an episode is complete, following the current policy. This generates a trajectory consisting of states, actions and rewards.

**2. Calculate Returns: For each time step t, calculate the return G_t​ which is the total reward obtained from time t onwards. Typically, this is the discounted sum of rewards:

G_t = \sum_{k=t}^T \gamma^{k-t}

Where \gamma is the discount factor, T is the final time step of the episode and R_k​ is the reward received at time step k.

**3. Policy Gradient Update: The policy parameters θ are updated using the following formula:

\theta_{t+1} = \theta_t + \alpha \nabla_{\theta} \log \pi_{\theta}(a_t | s_t) G_t

Where:

**4. Repeat: This process is repeated for several episodes, iteratively updating the policy in the direction of higher rewards.

Implementation

In this example we will train a policy network to solve a basic environment such as CartPole from OpenAI's gym. The aim is to use REINFORCE to directly optimize the policy without using value function approximations.

Step 1: Set Up the Environment

The first step is to create the environment using OpenAI's Gym. For this example we use the CartPole-v1 environment where the agent's task is to balance a pole on a cart.

Python `

import gym import numpy as np import tensorflow as tf from tensorflow.keras import layers

env = gym.make('CartPole-v1') obs_space = env.observation_space.shape[0] act_space = env.action_space.n

`

Step 2: Define Hyperparameters

In this step we define hyperparameters for the algorithm like discount factor gamma, the learning rate, number of episodes and batch size. These hyperparameters control how the algorithm behaves during training.

Python `

gamma = 0.99 learning_rate = 0.01 num_episodes = 1000 batch_size = 64

`

Step 3: Define the Policy Network (Actor)

We define the policy network as a simple neural network with two dense layers. The input to the network is the state and the output is a probability distribution over the actions (softmax output). The network learns the policy that maps states to action probabilities.

Python `

class PolicyNetwork(tf.keras.Model): def init(self, hidden_units=128): super(PolicyNetwork, self).init() self.dense1 = layers.Dense(hidden_units, activation='relu') self.dense2 = layers.Dense(env.action_space.n, activation='softmax')

def call(self, state):
    x = self.dense1(state)
    return self.dense2(x)

`

Step 4: **Initialize the Policy and Optimizer

Here, we initialize the policy network and the Adam optimizer. The optimizer is used to update the weights of the policy network during training.

Python `

policy = PolicyNetwork() optimizer = tf.keras.optimizers.Adam(learning_rate)

`

Step 5: **Compute Returns

In reinforcement learning, the return G_t is the discounted sum of future rewards. This function computes the return for each time step t, based on the rewards collected during the episode.

Python `

def compute_returns(rewards, gamma): returns = np.zeros_like(rewards, dtype=np.float32) running_return = 0 for t in reversed(range(len(rewards))): running_return = rewards[t] + gamma * running_return returns[t] = running_return return returns

`

Step 6: **Define Training Step

The training step computes the gradients of the policy network using the log of action probabilities and the computed returns. The loss is the negative log-likelihood of the actions taken, weighted by the return. The optimizer updates the policy network’s parameters to maximize the expected return.

Python `

def train_step(states, actions, returns): with tf.GradientTape() as tape: # Calculate the probability of each action taken action_probs = policy(states) action_indices = np.array(actions, dtype=np.int32)

    # Gather the probabilities for the actions taken
    action_log_probs = tf.math.log(tf.reduce_sum(
        action_probs * tf.one_hot(action_indices, env.action_space.n), axis=1))

    # Calculate the loss (negative log likelihood * returns)
    loss = -tf.reduce_mean(action_log_probs * returns)

grads = tape.gradient(loss, policy.trainable_variables)
optimizer.apply_gradients(zip(grads, policy.trainable_variables))

`

Step 7: **Training Loop

The training loop collects experiences from episodes and then performs training in batches. The policy is updated after each batch of experiences. In each episode, we record the states, actions and rewards and then compute the returns. The policy is updated based on these returns.

Python `

for episode in range(num_episodes): state, _ = env.reset() done = False states, actions, rewards = [], [], []

while not done:
    state_input = np.array(state, dtype=np.float32).reshape(1, -1)
    probs = policy(state_input).numpy()[0]
    action = np.random.choice(act_space, p=probs)

    next_state, reward, terminated, truncated, _ = env.step(action)
    done = terminated or truncated

    states.append(state_input[0])
    actions.append(action)
    rewards.append(reward)
    state = next_state

# After episode ends
returns = compute_returns(rewards, gamma)
returns = (returns - np.mean(returns)) / (np.std(returns) + 1e-9)

states_batch = np.vstack(states)
train_step(states_batch, actions, returns)

if episode % 100 == 0:
    print(f"Episode {episode}/{num_episodes}")

`

Step 8: **Testing the Trained Agent

After training the agent, we evaluate its performance by letting it run in the environment without updating the policy. The agent chooses actions based on the highest probabilities (greedy behavior).

Python `

state, _ = env.reset() done = False total_reward = 0

while not done: state_input = np.array(state, dtype=np.float32).reshape(1, -1) probs = policy(state_input).numpy()[0] action = np.argmax(probs)

next_state, reward, terminated, truncated, _ = env.step(action)
done = terminated or truncated
total_reward += reward
state = next_state

print(f"Test Total Reward: {total_reward}")

`

**Output:

Episode 0/1000
Episode 100/1000
Episode 200/1000
Episode 300/1000
Episode 400/1000
Episode 500/1000
Episode 600/1000
Episode 700/1000
Episode 800/1000
Episode 900/1000
Test Total Reward: 49.0

Variants of REINFORCE Algorithm

Several modifications to the original REINFORCE algorithm have been proposed to address its high variance:

The update rule becomes:

\theta_{t+1} = \theta_t + \alpha \nabla_{\theta} \log \pi_{\theta}(a_t | s_t) (G_t - b_t)

Where b_t​ is the baseline such as the expected reward from state s_t​.

Advantages

Challenges

Applications

REINFORCE has been applied in several domains: