Hadoop Reducer in MapReduce (original) (raw)

Hadoop - Reducer in Map-Reduce

Last Updated : 3 Oct, 2025

**MapReduce is a core programming model in the **Hadoop ecosystem, designed to process large datasets in parallel across distributed machines (nodes). The execution flow is divided into two major phases: Map Phase and Reduce Phase.

Hadoop programs typically consist of three main components:

The Reducer is the second stage of MapReduce. It takes the intermediate key-value pairs generated by the Mapper and produces the final consolidated output, which is then written to HDFS (Hadoop Distributed File System).

Workflow of Reducer in MapReduce

Reducer-In-MapReduce

**1. Intermediate Data (Mapper Output): The Mapper produces output in the form of (key, value) pairs.

**2. Shuffle & Sort: Before passing the data to Reducer, Hadoop automatically performs two operations:

Sorting and Shuffling are executed in parallel for efficiency.

**3. Reduce Phase:

The Reducer receives (key, list of values) and applies user-defined computation logic such as aggregation, filtering, or summation. The output is then written back to HDFS.

Example – Faculty Salary Summation

Suppose we have faculty salary data stored in a CSV file. If we want to compute the total salary per department, we can:

The Reducer will aggregate all salary values for each department and produce the final result in the format:

Dept_Name Total_Salary
CSE 750000
ECE 620000
MECH 450000

Characteristics of Reducer in MapReduce

job.getConfiguration().set("mapreduce.output.basename", "GeeksForGeeks");

Phases of Reducer

**Note: The final output from the Reducer is not sorted by default.

Setting Number of Reducers in MapReduce

Hadoop allows users to configure the number of Reducers:

mapred.reduce.tasks=<number_of_reducers>

job.setNumReduceTasks(2);

If set to 0, only the Map phase is executed (useful for Map-only jobs).

Best Practices for Setting Reducer Count

The number of Reducers significantly affects performance and resource utilization. Ideally, it should be tuned based on cluster size and workload:

**Recommended formula:

NumReducers ≈ (0.95 or 1.75) × (Number of Nodes × Max Containers per Node)