Markov Decision Process (original) (raw)

Last Updated : 2 May, 2026

Markov Decision Process (MDP) is a framework for decision-making under uncertainty and is formally defined by a tuple (S, A, P, R, γ), representing states, actions, transition probabilities, rewards and the discount factor. It helps us answer questions like:

What actions should the agent take?
What happens after an action?
Is the result good or bad?

In artificial intelligence Markov Decision Processes (MDPs) are used to model situations where decisions are made one after another and the results of actions are uncertain. They help in designing smart machines or agents that need to work in environments where each action might led to different outcomes.

Key Components of an MDP

An MDP has five main parts:

markov_decision_process

Components of Markov Decision Process

**1. States (S): A state is a situation or condition the agent can be in. For example, A position on a grid like being at cell (1,1).

**2. Actions (A): An action is something the agent can do. For example, Move UP, DOWN, LEFT or RIGHT. Each state can have one or more possible actions.

**3. Transition Model (T): The model tells us what happens when an action is taken in a state. It’s like asking: “If I move RIGHT from here, where will I land?” Sometimes the outcome isn’t always the same that’s uncertainty. For example:

80% chance of moving in the intended direction
10% chance of slipping to the left
10% chance of slipping to the right

This randomness is called a stochastic transition.

**4. Reward (R): A reward is a number given to the agent after it takes an action. If the reward is positive, it means the result of the action was good. If the reward is negative it means the outcome was bad or there was a penalty help the agent learn what’s good or bad. Examples:

+1 for reaching the goal
-1 for stepping into fire
-0.1 for each step to encourage fewer moves

**5. Policy (π): A policy is the agent’s plan. It tells the agent: “If you are in this state, take this action.” The goal is to find the best policy that helps the agent earn the highest total reward over time.

Let’s consider a 3x4 grid world. The agent starts at cell (1,1) and aims to reach the Blue Diamond at (4,3) while avoiding Fire at (4,2) and a Wall at (2,2). At each state the agent can take one of the following actions: UP, DOWN, LEFT or RIGHT

sender

Problem

1. Movement with Uncertainty (Transition Model)

The agent’s moves are stochastic (uncertain):

80% chance of going in the intended direction.
10% chance of going left of the intended direction.
10% chance of going right of the intended direction.

2. Reward System

+1 for reaching the goal.
-1 for falling into fire.
-0.04 for each regular move (to encourage shorter paths).
0 for hitting a wall (no movement or penalty).

3. Goal and Policy

The agent’s objective is to maximize total rewards.
It must find an optimal policy: the best action to take in each state to reach the goal quickly while avoiding danger.

4. Path Example

One possible optimal path is: UP → UP → RIGHT → RIGHT → RIGHT
But because of randomness the agent must plan carefully to avoid accidentally slipping into fire.

Applications

Robots use MDPs to decide how to move safely and efficiently in places like factories or warehouses and avoid obstacles.
In board games or video games MDPs help characters to choose the best moves to win or complete tasks even when outcomes are not certain.
Doctors can use it to plan treatments for patients, choosing actions that improve health while considering uncertain effects.
Self-driving cars or delivery vehicles use it to find safe routes and avoid accidents on unpredictable roads.
Stores and warehouses use MDPs to decide when to order more stock so they don’t run out or keep too much even when demand changes.