Value Iteration vs. Policy Iteration (original) (raw)

Last Updated : 9 Oct, 2025

Value Iteration and Policy Iteration are two popular techniques used in dynamic programming to solve Markov Decision Processes (MDPs). Both methods aim to find the best possible strategy known as the __op_timal policy for an agent to follow in a given environment. Understanding the differences, strengths and weaknesses of these two methods is important to choose the right approach for specific RL problems.

**What is Value Iteration?

V^*(s) = \max_a \left[ R(s, a) + \gamma \sum_{s'} P(s'|s, a) V^*(s') \right]

Where:

Value-Iteration-Network

value iteration network

Once the value function converges, the optimal policy can be derived by selecting the action a that maximizes the value function:

\pi^*(s) = \arg\max_a \left[ R(s, a) + \gamma \sum_{s'} P(s'|s, a) V^*(s') \right]

**What is Policy Iteration?

Policy Iteration is another dynamic programming algorithm used to compute the optimal policy. It alternates between two steps:

Policy-Iteration

Policy Iteration

V^\pi(s) = R(s, \pi(s)) + \gamma \sum_{s'} P(s'|s, \pi(s)) V^\pi(s')

\pi'(s) = \arg\max_a \left[ R(s, a) + \gamma \sum_{s'} P(s'|s, a) V^\pi(s') \right]

This process repeats until the policy converges meaning it no longer changes between iterations.

**Comparison Between Value Iteration and Policy Iteration

Feature **Value Iteration **Policy Iteration
**Approach Updates the value function iteratively until convergence. Alternates between policy evaluation and policy improvement.
**Convergence Converges when the value function converges. Converges when the policy stops changing.
**Computational Cost More computationally expensive per iteration due to full evaluation of all states. Requires more iterations but may converge faster in terms of fewer iterations.
**Policy Output The policy is derived after the value function has converged. The policy is updated during each iteration.
**Speed of Convergence May require many iterations for convergence, especially in large state spaces. Tends to converge faster in practice, especially when the policy improves significantly at each iteration.
**State Space Typically suited for smaller state spaces due to computational complexity. Can handle larger state spaces more efficiently.

**When to Use Value Iteration and Policy Iteration

**Use Value Iteration:

**Use Policy Iteration:

Value Iteration is simpler and more direct in its approach and Policy Iteration often converges faster in practice by improving the policy iteratively. The choice between the two methods depends largely on the problem’s scale and the computational resources available. In many real-world applications Policy Iteration may be preferred for its faster convergence especially in problems with large state spaces.