TutorialsArena

Apache Hadoop: Distributed Storage and Processing Framework

Learn about Apache Hadoop, its architecture, and how it enables distributed storage and processing of massive datasets for big data analytics.



Apache Hadoop: A Distributed Storage and Processing Framework

Apache Hadoop is an open-source framework for storing, processing, and analyzing massive datasets. It's designed to handle data too large to be processed by a single machine, distributing the workload across a cluster of commodity hardware. This makes Hadoop very useful for big data analytics.

Hadoop Modules

  • HDFS (Hadoop Distributed File System): Distributes files into blocks across multiple nodes for fault tolerance and scalability. It's inspired by Google's GFS (Google File System).
  • YARN (Yet Another Resource Negotiator): Manages cluster resources and schedules jobs more efficiently than the original MapReduce job tracker.
  • MapReduce: A programming model for parallel data processing using key-value pairs. MapReduce processes data in two stages: map (transform data into key-value pairs) and reduce (combine values associated with the same key).
  • Hadoop Common: Provides core utilities and libraries used by other Hadoop modules.

Hadoop Architecture

A Hadoop cluster typically consists of a single master node and multiple slave nodes. The master node manages the entire cluster, while the slave nodes perform computations. The master node includes NameNode, DataNode and JobTracker whereas slave node contains only DataNode and TaskTracker.

  • NameNode: Manages the file system metadata (file locations, etc.).
  • DataNode: Stores data blocks.
  • JobTracker (in Hadoop 1): Schedules and manages MapReduce jobs (replaced by YARN in Hadoop 2).
  • TaskTracker (in Hadoop 1): Executes MapReduce tasks (replaced by NodeManagers in YARN).

Advantages of Hadoop

  • High Performance: Distributing data and processing across multiple machines speeds up operations.
  • Scalability: Easily scales by adding more nodes to the cluster.
  • Cost-Effectiveness: Uses commodity hardware, reducing infrastructure costs.
  • Fault Tolerance: Data replication ensures high availability even with hardware failures.

A Brief History of Hadoop

Hadoop's origins trace back to 2002, inspired by Google's work on the Google File System and MapReduce. Key milestones include:

Year Event
2003 Google publishes Google File System (GFS) paper.
2004 Google publishes MapReduce paper.
2006 Hadoop project starts; Hadoop 0.1.0 released.
2008 Hadoop sets a record for sorting 1TB of data.
2012 Hadoop 1.0 released.
2017 Hadoop 3.0 released.