Introduction to Hadoop Distributed File System(HDFS) (original) (raw)

Last Updated : 11 Jul, 2025

With growing data velocity the data size easily outgrows the storage limit of a machine. A solution would be to store the data across a network of machines. Such filesystems are called **distributed filesystems. Since data is stored across a network all the complications of a network come in.
This is where Hadoop comes in. It provides one of the most reliable filesystems. **HDFS (Hadoop Distributed File System) is a unique design that provides storage for _extremely large files with streaming data access pattern, and it runs on _commodity hardware. Let's elaborate on the terms:

Nodes: Master-slave nodes typically form the **HDFS cluster.

  1. **NameNode(MasterNode):
    • Manages all the slave nodes and assigns work to them.
    • It executes filesystem namespace operations like opening, closing, and renaming files and directories.
    • It should be deployed on reliable hardware that has a high configuration. not on commodity hardware.
  2. **DataNode(SlaveNode):
    • Actual worker nodes do the actual work like reading, writing, processing, etc.
    • They also perform creation, deletion, and replication upon instruction from the master.
    • They can be deployed on commodity hardware.

**HDFS daemons: **Daemons are the processes running in the background.

**Data storage in HDFS: Now let's see how the data is stored in a distributed manner.

Lets assume that 100TB file is inserted, then masternode(namenode) will first _divide the file into blocks of 10TB (default size is _128 MB in Hadoop 2.x and above). Then these blocks are stored across different datanodes(slavenode). Datanodes(slavenode) _replicate the blocks among themselves and the information of what blocks they contain is sent to the master. Default replication factor is _3 means for each block 3 replicas are created (including itself). In hdfs.site.xml we can increase or decrease the replication factor i.e we can edit its configuration here.

**Note: MasterNode has the record of everything, it knows the location and info of each and every single data nodes and the blocks they contain, i.e. nothing is done without the permission of masternode.

**Why divide the file into blocks?

_Answer: Let's assume that we don't divide, now it's very difficult to store a 100 TB file on a single machine. Even if we store, then each read and write operation on that _whole file is going to take very high seek time. But if we have multiple blocks of size 128MB then its become easy to perform various read and write operations on it compared to doing it on a whole file at once. So we divide the file to have faster data access i.e. reduce seek time.

**Why replicate the blocks in data nodes while storing?

_Answer: Let's assume we don't replicate and only one yellow block is present on datanode D1. Now if the data node D1 crashes we will lose the block and which will make the overall data inconsistent and faulty. So we replicate the blocks to achieve _fault-tolerance.

**Terms related to HDFS:

**Note: No two replicas of the same block are present on the same datanode.

**Features:

**Limitations: Though HDFS provide many features there are some areas where it doesn't work well.