Skip to main content

Command Palette

Search for a command to run...

Apache Hadoop — Background

Updated
5 min read
Apache Hadoop — Background
P

My name is Pulkit, and I am seasoned Data Engineer. Along with my expertise in Spark / Hadoop applications, I am deeply fond of AWS Cloud. I love to learn new tech and broaden my horizons with every single day.

Technologies come and go, but data is here to grow!

Introduction

Data is the core of every modern application, driving insights, automation, and user experiences. Over time, technologies have evolved to collect, process, and analyze this data more efficiently.

Here’s a quick historical view of that evolution:

  • Excel

  • Oracle SQL, Microsoft SQL Server

  • Hadoop

  • Spark

Each stage emerged to overcome limitations of its predecessors — particularly in scale, performance, and flexibility.


The Big Data Problem

With the rise of the internet, smartphones, and IoT, data exploded in both volume and variety.

We began dealing with massive datasets in formats like JSON, XML, DOC, MP4, and Avro — data streaming in at high velocity from diverse sources.

Traditional RDBMS systems struggled because they weren’t designed for:

  • Volume – petabytes and beyond

  • Variety – structured + unstructured

  • Velocity – real-time and streaming workloads

These three challenges together form what we call the 3 Vs of Big Data.

Despite decades of improvements, traditional databases were not built to handle these scale problems efficiently — leading to the rise of distributed computing.


Distributed Computing

To solve challenges around scalability, availability, and cost, distributed systems like Apache Hadoop and Apache Spark emerged.

They distribute data across clusters of machines and process them in parallel, allowing:

  • High throughput

  • Fault tolerance

  • Linear scalability

Key Features of Distributed Systems

  1. Scalability — Handle growing data volumes easily.

  2. High Availability — Tolerate node failures without losing data.

  3. Cost Efficiency — Scale horizontally using commodity hardware.

Among these, Apache Spark later gained traction due to its in-memory computation and ease of use — but Hadoop laid the foundation.


Basic Background: Hadoop

At its core, Hadoop consists of three major layers:

YARN        - Cluster resource manager
HDFS        - Distributed storage layer
MapReduce   - Computation framework
Hadoop Architecture
  • YARN acts as the operating system of the cluster, managing resources and scheduling jobs.

  • HDFS stores data in blocks across nodes, ensuring replication and fault tolerance.

  • MapReduce provides a computation framework for processing data in parallel using Java-based programs.

You can think of Hadoop as a distributed operating system for big data — simplifying how you work with clusters of machines.

Modern engines like Spark now build on this foundation while abstracting away much of the complexity.


Hadoop Ecosystem

While Hadoop originally consisted of HDFS, YARN, and MapReduce, a broader ecosystem has evolved around it.

Component Purpose
Hive SQL-like data warehousing on Hadoop
HBase NoSQL database for large datasets
Sqoop Transfer data between RDBMS and Hadoop
Oozie Workflow scheduler for Hadoop jobs

These tools extend Hadoop’s capabilities, turning it into a complete big data platform.


Hadoop Architecture

To understand Hadoop deeply, you need to know its three main subsystems:

  1. HDFS (Storage Layer)

  2. YARN (Resource Management Layer)

  3. MapReduce (Computation Layer)

Hadoop runs on clusters of commodity hardware. Each component runs as a Java daemon process — a long-running background service.

Let’s break this down starting with HDFS.


HDFS — Hadoop Distributed File System

HDFS was inspired by Google’s GFS (Google File System).
It allows reliable storage and access to very large datasets using a master-slave architecture.

HDFS Architecture

Key Components

1. NameNode (Master)

  • Runs on the master node.

  • Stores metadata about files — directories, block locations, and replication info.

  • Think of it as the librarian that knows where every book (data block) is stored.

2. DataNode (Worker)

  • Stores the actual data blocks.

  • Each file is split into blocks (default size: 128 MB) which are replicated across nodes for reliability.


Example — Metadata in NameNode

When you store a 1GB file, it’s split into 8 blocks (assuming 128MB each). The NameNode keeps metadata like:

FILE: example.txt
DIRECTORY: /data
SIZE: 1GB
BLOCKS:
  - BLOCK1: ID=blk_001, LOCATION=datanode1:/disk1/data/block1
  - BLOCK2: ID=blk_002, LOCATION=datanode2:/disk1/data/block2
  ...

You can view file metadata using:

hdfs fsck hdfs://<file-path> -location -blocks -files

How Reading from HDFS Works

  1. Client sends read request → NameNode

  2. NameNode returns block locations

  3. Client fetches blocks directly from DataNodes

  4. Client reassembles blocks into the full file


How Writing to HDFS Works

  1. Client sends write request → NameNode

  2. NameNode responds with target DataNodes for each block

  3. Client splits file into blocks and writes sequentially

  4. DataNodes send acknowledgment back to NameNode

  5. NameNode updates metadata once all blocks are written


Common HDFS Commands

Action Command
List files hdfs dfs -ls <path>
Copy to HDFS hdfs dfs -copyFromLocal <local> <hdfs>
Change permissions hdfs dfs -chmod 777 <path>
Rename file hdfs dfs -mv <src> <dest>
View contents hdfs dfs -cat <file>
Check disk usage hdfs dfs -du -s -h /user/test

Hadoop YARN

Once the data is stored in HDFS, it needs to be processed.
This is where YARN (Yet Another Resource Negotiator) comes in — it manages resources across the cluster.

Like HDFS, YARN also follows a master-slave architecture:

Component Role
ResourceManager Manages overall cluster resources
NodeManager Runs on each worker node, reporting usage and launching containers

YARN decouples resource management from computation, enabling multiple frameworks (e.g., MapReduce, Spark, Hive) to run simultaneously on the same cluster.


Summary

Hadoop revolutionized big data processing by introducing:

  • Distributed storage (HDFS)

  • Cluster resource management (YARN)

  • Parallel computation (MapReduce)

It laid the foundation for modern data platforms and continues to influence technologies like Spark, Flink, and Databricks.