Map Reduce in Hadoop (original) (raw)

Last Updated : 1 Nov, 2025

MapReduce is the processing engine of Hadoop. While HDFS is responsible for storing massive amounts of data, MapReduce handles the actual computation and analysis. It provides a simple yet powerful programming model that allows developers to process large datasets in a distributed and parallel manner. It is a two-phase data processing model in Hadoop:

Map and Reduce interfaces

How MapReduce Works

**Example Input File: sample.txt

Hello I am GeeksforGeeks
How can I help you
How can I assist you
Are you an engineer
Are you looking for coding
Are you looking for interview questions
what are you doing these days
what are your strengths

When stored in HDFS, this file is divided into input splits (e.g., first.txt, second.txt, etc.). Each split is assigned to a Mapper for processing.

Step 1: Input Splitting & Record Reader

**Example:

(0, "Hello I am GeeksforGeeks")
(26, "How can I help you")

**File Formats in Hadoop

The way a RecordReader converts text into (key, value) pairs depends on the input file format. Hadoop provides several built‑in formats:

By default, Hadoop uses TextInputFormat.

Step 2: Map Phase

Each Mapper processes its assigned (key, value) pair and generates intermediate data.

For (0, "Hello I am GeeksforGeeks"):

(Hello, 1)
(I, 1)
(am, 1)
(GeeksforGeeks, 1)

For (26, "How can I help you"):

(How, 1)
(can, 1)
(I, 1)
(help, 1)
(you, 1)

All Mappers run in parallel, one per input split.

Step 3: Shuffling and Sorting

The Mapper outputs are not final. Before reducing:

(How, [1,1])
(Are, [1,1,1])
(I, [1,1])

This prepares data for the reducer.

Step 4: Reduce Phase

The Reducer aggregates values for each key:

(How, [1,1]) → (How, 2)
(Are, [1,1,1]) → (Are, 3)
(I, [1,1]) → (I, 2)

Final output (saved in result.output):

Hello - 1
I - 2
am - 1
GeeksforGeeks - 1
How - 2
Are - 3
are - 2
you - 3
what - 2
...

Advantages of MapReduce in Hadoop