Apache Hadoop — Background

Technologies come and go, but data is here to grow!

Introduction

Data is the core of every modern application, driving insights, automation, and user experiences. Over time, technologies have evolved to collect, process, and analyze this data more efficiently.

Here’s a quick historical view of that evolution:

Excel
Oracle SQL, Microsoft SQL Server
Hadoop
Spark

Each stage emerged to overcome limitations of its predecessors — particularly in scale, performance, and flexibility.

The Big Data Problem

With the rise of the internet, smartphones, and IoT, data exploded in both volume and variety.

We began dealing with massive datasets in formats like JSON, XML, DOC, MP4, and Avro — data streaming in at high velocity from diverse sources.

Traditional RDBMS systems struggled because they weren’t designed for:

Volume – petabytes and beyond
Variety – structured + unstructured
Velocity – real-time and streaming workloads

These three challenges together form what we call the 3 Vs of Big Data.

Despite decades of improvements, traditional databases were not built to handle these scale problems efficiently — leading to the rise of distributed computing.

Distributed Computing

To solve challenges around scalability, availability, and cost, distributed systems like Apache Hadoop and Apache Spark emerged.

They distribute data across clusters of machines and process them in parallel, allowing:

High throughput
Fault tolerance
Linear scalability

Key Features of Distributed Systems

Scalability — Handle growing data volumes easily.
High Availability — Tolerate node failures without losing data.
Cost Efficiency — Scale horizontally using commodity hardware.

Among these, Apache Spark later gained traction due to its in-memory computation and ease of use — but Hadoop laid the foundation.

Basic Background: Hadoop

At its core, Hadoop consists of three major layers:

YARN        - Cluster resource manager
HDFS        - Distributed storage layer
MapReduce   - Computation framework

YARN acts as the operating system of the cluster, managing resources and scheduling jobs.
HDFS stores data in blocks across nodes, ensuring replication and fault tolerance.
MapReduce provides a computation framework for processing data in parallel using Java-based programs.

You can think of Hadoop as a distributed operating system for big data — simplifying how you work with clusters of machines.

Modern engines like Spark now build on this foundation while abstracting away much of the complexity.

Hadoop Ecosystem

While Hadoop originally consisted of HDFS, YARN, and MapReduce, a broader ecosystem has evolved around it.

Component	Purpose
Hive	SQL-like data warehousing on Hadoop
HBase	NoSQL database for large datasets
Sqoop	Transfer data between RDBMS and Hadoop
Oozie	Workflow scheduler for Hadoop jobs

These tools extend Hadoop’s capabilities, turning it into a complete big data platform.

Hadoop Architecture

To understand Hadoop deeply, you need to know its three main subsystems:

HDFS (Storage Layer)
YARN (Resource Management Layer)
MapReduce (Computation Layer)

Hadoop runs on clusters of commodity hardware. Each component runs as a Java daemon process — a long-running background service.

Let’s break this down starting with HDFS.

HDFS — Hadoop Distributed File System

HDFS was inspired by Google’s GFS (Google File System).
It allows reliable storage and access to very large datasets using a master-slave architecture.

Key Components

1. NameNode (Master)

Runs on the master node.
Stores metadata about files — directories, block locations, and replication info.
Think of it as the librarian that knows where every book (data block) is stored.

2. DataNode (Worker)

Stores the actual data blocks.
Each file is split into blocks (default size: 128 MB) which are replicated across nodes for reliability.

Example — Metadata in NameNode

When you store a 1GB file, it’s split into 8 blocks (assuming 128MB each). The NameNode keeps metadata like:

FILE: example.txt
DIRECTORY: /data
SIZE: 1GB
BLOCKS:
  - BLOCK1: ID=blk_001, LOCATION=datanode1:/disk1/data/block1
  - BLOCK2: ID=blk_002, LOCATION=datanode2:/disk1/data/block2
  ...

You can view file metadata using:

hdfs fsck hdfs://<file-path> -location -blocks -files

How Reading from HDFS Works

Client sends read request → NameNode
NameNode returns block locations
Client fetches blocks directly from DataNodes
Client reassembles blocks into the full file

How Writing to HDFS Works

Client sends write request → NameNode
NameNode responds with target DataNodes for each block
Client splits file into blocks and writes sequentially
DataNodes send acknowledgment back to NameNode
NameNode updates metadata once all blocks are written

Common HDFS Commands

Action	Command
List files	`hdfs dfs -ls <path>`
Copy to HDFS	`hdfs dfs -copyFromLocal <local> <hdfs>`
Change permissions	`hdfs dfs -chmod 777 <path>`
Rename file	`hdfs dfs -mv <src> <dest>`
View contents	`hdfs dfs -cat <file>`
Check disk usage	`hdfs dfs -du -s -h /user/test`

Hadoop YARN

Once the data is stored in HDFS, it needs to be processed.
This is where YARN (Yet Another Resource Negotiator) comes in — it manages resources across the cluster.

Like HDFS, YARN also follows a master-slave architecture:

Component	Role
ResourceManager	Manages overall cluster resources
NodeManager	Runs on each worker node, reporting usage and launching containers

YARN decouples resource management from computation, enabling multiple frameworks (e.g., MapReduce, Spark, Hive) to run simultaneously on the same cluster.

Summary

Hadoop revolutionized big data processing by introducing:

Distributed storage (HDFS)
Cluster resource management (YARN)
Parallel computation (MapReduce)

It laid the foundation for modern data platforms and continues to influence technologies like Spark, Flink, and Databricks.

Apache Hadoop — Background

Introduction

The Big Data Problem

Distributed Computing

Key Features of Distributed Systems

Basic Background: Hadoop

Hadoop Ecosystem

Hadoop Architecture

HDFS — Hadoop Distributed File System

Key Components

1. NameNode (Master)

2. DataNode (Worker)

Example — Metadata in NameNode

How Reading from HDFS Works

How Writing to HDFS Works

Common HDFS Commands

Hadoop YARN

Summary

Comments

More from this blog

AWS EMR – Private Subnets

Apache Spark Unified Memory

Serverless Framework

AWS S3 Multi-Part Uploads

Command Palette

Introduction

The Big Data Problem

Distributed Computing

Key Features of Distributed Systems

Basic Background: Hadoop

Hadoop Ecosystem

Hadoop Architecture

HDFS — Hadoop Distributed File System

Key Components

1. NameNode (Master)

2. DataNode (Worker)

Example — Metadata in NameNode

How Reading from HDFS Works

How Writing to HDFS Works

Common HDFS Commands

Hadoop YARN

Summary

Comments

More from this blog