Hadoop Reducer in MapReduce (original) (raw)
Hadoop - Reducer in Map-Reduce
Last Updated : 3 Oct, 2025
**MapReduce is a core programming model in the **Hadoop ecosystem, designed to process large datasets in parallel across distributed machines (nodes). The execution flow is divided into two major phases: Map Phase and Reduce Phase.
Hadoop programs typically consist of three main components:
- **Mapper Class: Processes input data and generates intermediate key-value pairs.
- **Reducer Class: Aggregates and processes the intermediate results.
- **Driver Class: Configures and manages the job execution.
The Reducer is the second stage of MapReduce. It takes the intermediate key-value pairs generated by the Mapper and produces the final consolidated output, which is then written to HDFS (Hadoop Distributed File System).
Workflow of Reducer in MapReduce

**1. Intermediate Data (Mapper Output): The Mapper produces output in the form of (key, value) pairs.
**2. Shuffle & Sort: Before passing the data to Reducer, Hadoop automatically performs two operations:
- **Shuffling: Transfers the relevant data from all Mappers to the appropriate Reducer.
- **Sorting: Groups the values based on their keys. Sorting ensures all values belonging to the same key are processed together.
Sorting and Shuffling are executed in parallel for efficiency.
**3. Reduce Phase:
The Reducer receives (key, list of values) and applies user-defined computation logic such as aggregation, filtering, or summation. The output is then written back to HDFS.
Example – Faculty Salary Summation
Suppose we have faculty salary data stored in a CSV file. If we want to compute the total salary per department, we can:
- Use the department name as the key.
- Use the salary as the value.
The Reducer will aggregate all salary values for each department and produce the final result in the format:
Dept_Name Total_Salary
CSE 750000
ECE 620000
MECH 450000
Characteristics of Reducer in MapReduce
- **Default Reducer Count: By default, Hadoop assigns 1 Reducer for a job. This can be configured as per requirements.
- **One-to-One Mapping: Each unique key is assigned to exactly one Reducer.
- The final output files are stored in HDFS under the job’s output directory, named as part-r-00000, part-r-00001, etc. according to the number of Reducers, along with a _SUCCESS file to indicate job completion.
- **Custom Output Filename: By default, output files have the pattern part-r-xxxxx. You can change this in the driver code:
job.getConfiguration().set("mapreduce.output.basename", "GeeksForGeeks");
Phases of Reducer
- **Shuffle: Moves Mapper output to the appropriate Reducer via HTTP.
- **Sort: Groups values belonging to the same key.
- **Reduce: Performs the actual computation (sum, average, filter, etc.).
**Note: The final output from the Reducer is not sorted by default.
Setting Number of Reducers in MapReduce
Hadoop allows users to configure the number of Reducers:
- Using Command Line:
mapred.reduce.tasks=<number_of_reducers>
- Using JobConf in Driver Code:
job.setNumReduceTasks(2);
If set to 0, only the Map phase is executed (useful for Map-only jobs).
Best Practices for Setting Reducer Count
The number of Reducers significantly affects performance and resource utilization. Ideally, it should be tuned based on cluster size and workload:
**Recommended formula:
NumReducers ≈ (0.95 or 1.75) × (Number of Nodes × Max Containers per Node)
- **0.95 factor: Creates slightly fewer Reducers than slots → ensures all reducers run in parallel.
- **1.75 factor: Creates more Reducers than slots → improves load balancing, though some Reducers may run sequentially.