Hadoop Reducer in MapReduce (original) (raw)

Hadoop - Reducer in Map-Reduce

Last Updated : 3 Oct, 2025

**MapReduce is a core programming model in the **Hadoop ecosystem, designed to process large datasets in parallel across distributed machines (nodes). The execution flow is divided into two major phases: Map Phase and Reduce Phase.

Hadoop programs typically consist of three main components:

**Mapper Class: Processes input data and generates intermediate key-value pairs.
**Reducer Class: Aggregates and processes the intermediate results.
**Driver Class: Configures and manages the job execution.

The Reducer is the second stage of MapReduce. It takes the intermediate key-value pairs generated by the Mapper and produces the final consolidated output, which is then written to HDFS (Hadoop Distributed File System).

Workflow of Reducer in MapReduce

Reducer-In-MapReduce

**1. Intermediate Data (Mapper Output): The Mapper produces output in the form of (key, value) pairs.

**2. Shuffle & Sort: Before passing the data to Reducer, Hadoop automatically performs two operations:

**Shuffling: Transfers the relevant data from all Mappers to the appropriate Reducer.
**Sorting: Groups the values based on their keys. Sorting ensures all values belonging to the same key are processed together.

Sorting and Shuffling are executed in parallel for efficiency.

**3. Reduce Phase:

The Reducer receives (key, list of values) and applies user-defined computation logic such as aggregation, filtering, or summation. The output is then written back to HDFS.

Example – Faculty Salary Summation

Suppose we have faculty salary data stored in a CSV file. If we want to compute the total salary per department, we can:

Use the department name as the key.
Use the salary as the value.

The Reducer will aggregate all salary values for each department and produce the final result in the format:

Dept_Name Total_Salary
CSE 750000
ECE 620000
MECH 450000

Characteristics of Reducer in MapReduce

**Default Reducer Count: By default, Hadoop assigns 1 Reducer for a job. This can be configured as per requirements.
**One-to-One Mapping: Each unique key is assigned to exactly one Reducer.
The final output files are stored in HDFS under the job’s output directory, named as part-r-00000, part-r-00001, etc. according to the number of Reducers, along with a _SUCCESS file to indicate job completion.
**Custom Output Filename: By default, output files have the pattern part-r-xxxxx. You can change this in the driver code:

job.getConfiguration().set("mapreduce.output.basename", "GeeksForGeeks");

Phases of Reducer

**Shuffle: Moves Mapper output to the appropriate Reducer via HTTP.
**Sort: Groups values belonging to the same key.
**Reduce: Performs the actual computation (sum, average, filter, etc.).

**Note: The final output from the Reducer is not sorted by default.

Setting Number of Reducers in MapReduce

Hadoop allows users to configure the number of Reducers:

Using Command Line:

mapred.reduce.tasks=<number_of_reducers>

Using JobConf in Driver Code:

job.setNumReduceTasks(2);

If set to 0, only the Map phase is executed (useful for Map-only jobs).

Best Practices for Setting Reducer Count

The number of Reducers significantly affects performance and resource utilization. Ideally, it should be tuned based on cluster size and workload:

**Recommended formula:

NumReducers ≈ (0.95 or 1.75) × (Number of Nodes × Max Containers per Node)

**0.95 factor: Creates slightly fewer Reducers than slots → ensures all reducers run in parallel.
**1.75 factor: Creates more Reducers than slots → improves load balancing, though some Reducers may run sequentially.

Hadoop – Mapper in MapReduce

MapReduce Architecture in Hadoop

Combiners in MapReduce

Hadoop Reducer in MapReduce (original) (raw)