TutorialsArena

Understanding Hadoop Distributed File System (HDFS): Distributed Storage for Big Data

Learn how the Hadoop Distributed File System (HDFS) provides scalable and fault-tolerant storage for massive datasets across a cluster of commodity hardware. Explore its key features and understand its role in big data processing.



Understanding Hadoop Distributed File System (HDFS)

What is HDFS?

HDFS (Hadoop Distributed File System) is a distributed storage system designed to store and process very large datasets across a cluster of commodity hardware. It distributes data across multiple machines and replicates it for fault tolerance and high availability. This makes it cost-effective compared to using specialized, expensive hardware.

When to Use HDFS

HDFS is well-suited for applications dealing with:

  • Very large files: Files in the hundreds of megabytes, gigabytes, or terabytes in size.
  • Streaming data access: Applications where the time to read the entire dataset is more important than the latency to access the first few bytes. HDFS follows a "write-once, read-many-times" pattern.
  • Commodity hardware: HDFS is designed to run efficiently on inexpensive, standard hardware.

When NOT to Use HDFS

HDFS is less suitable for applications requiring:

  • Low-latency data access: Applications needing immediate access to the first data records. HDFS prioritizes the efficient processing of the whole dataset.
  • Many small files: Storing a large number of small files can overwhelm the NameNode's memory, as it stores metadata for all files.
  • Frequent writes: HDFS is optimized for write-once, read-many-times; frequent updates to the same data are inefficient.

Key HDFS Concepts

  • Blocks: The basic unit of data in HDFS. Default block size is 128MB (configurable). Files are broken into blocks, stored independently. A small file (e.g., 5MB) only occupies 5MB of space, even if the block size is larger.
  • NameNode: The master node in HDFS. It manages the file system's metadata (file names, permissions, block locations). This metadata is stored in memory for fast access. Multiple clients can access the file system concurrently.
  • DataNodes: Worker nodes that store and retrieve blocks of data as instructed by the NameNode. They periodically report their block status to the NameNode.
  • Secondary NameNode: A helper node that periodically takes checkpoints of the NameNode's metadata. This helps reduce downtime and data loss if the NameNode fails.

Starting HDFS

To start HDFS, you must first format the NameNode and then start the HDFS daemons. The commands for formatting and starting HDFS are given below:

Commands to Format and Start HDFS

# Format the NameNode
$ hadoop namenode -format

# Start HDFS
$ start-dfs.sh

Basic HDFS File Operations

Here are some common HDFS commands:

  • hadoop fs -mkdir /user/test: Creates a directory in HDFS.
  • hadoop fs -copyFromLocal /usr/home/Desktop/data.txt /user/test: Copies a file from the local file system to HDFS.
  • hadoop fs -ls /user/test: Lists files and directories in HDFS.
  • hadoop fs -copyToLocal /user/test/data.txt /usr/bin/data_copy.txt: Copies a file from HDFS to the local file system.
  • hadoop fs -rmr /user/sonoo/: Recursively deletes a directory and its contents in HDFS.

(Note: Replace paths with your specific locations.)

The following table summarizes additional HDFS commands:

Command Description
put Copies data from the local filesystem to HDFS.
copyFromLocal Same as put.
moveFromLocal Copies data and then removes the local copy.
get [-crc] Copies data from HDFS to the local filesystem.
cat Displays the contents of a file.
moveToLocal Copies data and then removes the HDFS copy.
setrep [-R] [-w] rep Sets the replication factor for files.
touchz Creates an empty file.
test -[ezd] Checks file or directory properties (exists, zero length, directory).
stat [format] Prints file information.