Map Reduce in Hadoop (original) (raw)

Last Updated : 1 Nov, 2025

MapReduce is the processing engine of Hadoop. While HDFS is responsible for storing massive amounts of data, MapReduce handles the actual computation and analysis. It provides a simple yet powerful programming model that allows developers to process large datasets in a distributed and parallel manner. It is a two-phase data processing model in Hadoop:

**Map Phase: Breaks the data into smaller chunks, processes them, and generates intermediate (key, value) pairs.
**Reduce Phase : Aggregates and merges the intermediate results to produce the final output.

Map and Reduce interfaces

How MapReduce Works

**Example Input File: sample.txt

Hello I am GeeksforGeeks
How can I help you
How can I assist you
Are you an engineer
Are you looking for coding
Are you looking for interview questions
what are you doing these days
what are your strengths

When stored in HDFS, this file is divided into input splits (e.g., first.txt, second.txt, etc.). Each split is assigned to a Mapper for processing.

Step 1: Input Splitting & Record Reader

The input file is divided into splits.
Each split is converted into (key, value) pairs using a RecordReader.
In TextInputFormat (default), the key is the byte offset of the line and the value is the line itself.

**Example:

(0, "Hello I am GeeksforGeeks")
(26, "How can I help you")

**File Formats in Hadoop

The way a RecordReader converts text into (key, value) pairs depends on the input file format. Hadoop provides several built‑in formats:

TextInputFormat (Default) : Each line becomes (byte offset, line).
KeyValueTextInputFormat : Splits each line into (key, value) pairs using a delimiter.
SequenceFileInputFormat : Stores data in a binary (key, value) format (efficient for data exchange between MapReduce jobs).
SequenceFileAsTextInputFormat : Reads binary sequence files but presents keys and values as text.

By default, Hadoop uses TextInputFormat.

Step 2: Map Phase

Each Mapper processes its assigned (key, value) pair and generates intermediate data.

For (0, "Hello I am GeeksforGeeks"):

(Hello, 1)
(I, 1)
(am, 1)
(GeeksforGeeks, 1)

For (26, "How can I help you"):

(How, 1)
(can, 1)
(I, 1)
(help, 1)
(you, 1)

All Mappers run in parallel, one per input split.

Step 3: Shuffling and Sorting

The Mapper outputs are not final. Before reducing:

Shuffling groups all values of identical keys:

(How, [1,1])
(Are, [1,1,1])
(I, [1,1])

Sorting arranges keys so that they can be processed sequentially.

This prepares data for the reducer.

Step 4: Reduce Phase

The Reducer aggregates values for each key:

(How, [1,1]) → (How, 2)
(Are, [1,1,1]) → (Are, 3)
(I, [1,1]) → (I, 2)

Final output (saved in result.output):

Hello - 1
I - 2
am - 1
GeeksforGeeks - 1
How - 2
Are - 3
are - 2
you - 3
what - 2
...

Advantages of MapReduce in Hadoop

**Scalable : Efficiently processes petabytes of data across thousands of nodes.
**Parallel : Executes tasks simultaneously on distributed data blocks.
**Fault-Tolerant : Recovers automatically by reassigning failed tasks.
**Simple : Developers only write Map and Reduce logic; Hadoop manages execution.

Hadoop Distributed File System (HDFS)

Different Modes of Hadoop Operation

Hadoop Schedulers and its types

Hadoop Ecosystem Components