Introduction to Hadoop (original) (raw)

Last Updated : 24 Jun, 2025

Hadoop is an open-source software framework that is used for storing and processing large amounts of data in a distributed computing environment. It is designed to handle big data and is based on the MapReduce programming model, which allows for the parallel processing of large datasets. Its framework is based on Java programming with some native code in C and shell scripts.

Hadoop is designed to process large volumes of data (Big Data) across many machines without relying on a single machine. It is built to be scalable, fault-tolerant and cost-effective. Instead of relying on expensive high-end hardware, Hadoop works by connecting many inexpensive computers (called nodes) in a cluster.

Hadoop Architecture

Hadoop has two main components:

Apart from the above-mentioned two core components, Hadoop framework also includes the following two modules **Hadoop Common which are Java libraries and utilities required by other Hadoop modules and **Hadoop YARN which is a framework for job scheduling and cluster resource management.

Hadoop

Hadoop Architecture

Hadoop Distributed File System (HDFS)

HDFS is the storage layer of Hadoop. It breaks large files into smaller blocks (usually 128 MB or 256 MB) and stores them across multiple DataNodes. Each block is replicated (usually 3 times) to ensure fault tolerance so even if a node fails, the data remains available.

**Key features of HDFS:

MapReduce

MapReduce is the computation layer in Hadoop. It works in two main phases:

  1. **Map Phase: Input data is divided into chunks and processed in parallel. Each mapper processes a chunk and produces key-value pairs.
  2. **Reduce Phase: These key-value pairs are then grouped and combined to generate final results.

This model is simple yet powerful, enabling massive parallelism and efficiency.

How Does Hadoop Work?

Here’s a overview of how Hadoop operates:

  1. Data is loaded into HDFS, where it's split into blocks and distributed across DataNodes.
  2. MapReduce jobs are submitted to the ResourceManager.
  3. The job is divided into map tasks, each working on a block of data.
  4. Map tasks produce intermediate results, which are shuffled and sorted.
  5. Reduce tasks aggregate results and generate final output.
  6. The results are stored back in HDFS or passed to other applications.

Thanks to this distributed architecture, Hadoop can process petabytes of data efficiently.

**Advantages and Disadvantages of Hadoop

**Advantages:

**Disadvantages:

Applications

Hadoop is used across a variety of industries:

  1. **Banking: Fraud detection, risk modeling.
  2. **Retail: Customer behavior analysis, inventory management.
  3. **Healthcare: Disease prediction, patient record analysis.
  4. **Telecom: Network performance monitoring.
  5. **Social Media: Trend analysis, user recommendation engines.