Top Hadoop Interview Questions and Answers

What is Hadoop?

Hadoop is an open-source framework for storing and processing massive amounts of data (big data) across clusters of computers. It's designed to handle data that's too large to be processed by a single machine. It leverages distributed computing to parallelize tasks, improving performance and scalability significantly.

Hadoop System Requirements

Hadoop requires Java 1.6 or higher (preferably from Oracle/OpenJDK) and a supported operating system (Linux distributions are commonly used). The hardware requirements depend on the size of your data and the complexity of your jobs (more RAM and storage are needed for larger datasets).

Recommended Hardware for Hadoop

While Hadoop can run on various hardware, it generally performs best on servers with multiple processors or cores, sufficient RAM (typically 4-8 GB or more), and ECC (Error Correcting Code) memory. The specific requirements will vary depending on the scale of your data and the complexity of your tasks.

Common Hadoop Input Formats

  • TextInputFormat (default; each line is a record).
  • KeyValueInputFormat (key-value pairs).
  • SequenceFileInputFormat (for reading SequenceFiles).

Categorizing Big Data: The Four V's

  • Volume: The sheer amount of data.
  • Velocity: The speed at which data is generated and processed.
  • Variety: The different types of data (structured, semi-structured, unstructured).
  • Veracity: The trustworthiness and accuracy of data.

Using Bootstrap Classes for Media Objects, Panels, and Button Groups

Bootstrap provides CSS classes to style UI elements. These classes are applied to HTML elements to control layout and appearance.

  • .media: For creating media objects (images with text).
  • .panel: For creating panels (boxed content).
  • .btn-group: For grouping buttons together.
  • List types: Ordered lists (`<ol>`), unordered lists (`<ul>`), definition lists (`<dl>`).

Checking Hadoop Daemon Status

Use the jps command to view the Java processes running on your Hadoop nodes (including Hadoop daemons).

InputSplits in Hadoop

When a Hadoop job runs, large input files are divided into smaller logical units called InputSplits. Each InputSplit is assigned to a mapper for processing. The size of an InputSplit is configurable.

TextInputFormat

TextInputFormat treats each line in a text file as a separate record. The key is the byte offset, and the value is the line content.

SequenceFileInputFormat

SequenceFileInputFormat reads SequenceFiles, a binary file format for storing key-value pairs efficiently. It's used to pass data between MapReduce jobs.

Number of InputSplits

The number of InputSplits generated depends on the input file size and the block size. Hadoop automatically splits files into blocks; the number of splits will be approximately equal to the number of blocks in the file.

RecordReader in Hadoop

RecordReader is responsible for reading data from an InputSplit and converting it into key-value pairs for the mapper to process.

JobTracker in Hadoop (Deprecated)

In older Hadoop versions, the JobTracker was the central service that managed the execution of MapReduce jobs. It has been replaced by YARN (Yet Another Resource Negotiator) in newer Hadoop versions.

WebDAV in Hadoop

WebDAV (Web Distributed Authoring and Versioning) allows accessing HDFS (Hadoop Distributed File System) using standard HTTP methods. This enables you to manage HDFS files through web browsers and other tools.

Sqoop

Sqoop is a tool for transferring data between Hadoop HDFS and relational databases (RDBMS).

JobTracker Functionalities

  • Receives jobs from clients.
  • Determines data locations (using the NameNode).
  • Assigns tasks to TaskTracker nodes.
  • Monitors job progress.

TaskTracker in Hadoop (Deprecated)

In older Hadoop versions, TaskTrackers were the worker nodes that executed individual map and reduce tasks. YARN (Yet Another Resource Negotiator) now manages tasks.

MapReduce Jobs

MapReduce is a programming model for processing large datasets in parallel. It consists of two main phases: map and reduce.

Mapper and Reducer in Hadoop

  • Mapper: Reads input data and produces key-value pairs.
  • Reducer: Processes the mapper's output, aggregating and summarizing data.

Shuffling in MapReduce

Shuffling is the process of sorting and grouping the mapper's output before sending it to the reducers. This ensures that reducers receive data with the same key together.

NameNode in Hadoop

The NameNode is the master node in HDFS that stores metadata about files and directories and manages the file system namespace. It doesn't store the data itself.

Heartbeats in HDFS

Heartbeats are periodic messages from DataNodes to the NameNode, indicating their status and ensuring that the NameNode is aware of the cluster's health. The lack of a heartbeat often triggers alerts.

Data Locality in Hadoop

Data locality means processing data on the same node where it is stored. It minimizes network traffic and improves job performance.

File System Check (FSCK) in HDFS

fsck is a command-line utility used to check the integrity of the HDFS file system. It detects and reports inconsistencies or errors.

NAS (Network Attached Storage) vs. DAS (Direct Attached Storage)

NAS DAS
Storage is accessed over a network. Storage is directly connected to a computer.

Hadoop and Other Data Processing Tools

Hadoop excels at scaling horizontally to handle extremely large datasets. Other tools might be better for different scales of data.

Distributed Cache in Hadoop

The distributed cache allows you to cache files on worker nodes before a job starts, improving performance by reducing the need for repeated network access.

Listing and Killing Hadoop Jobs

Use hadoop job -list and hadoop job -kill job_id (or YARN equivalents).

JobTracker in Hadoop (Deprecated)

The JobTracker, in older Hadoop versions, was a central service for managing MapReduce jobs. It has been replaced by the YARN resource manager in newer versions.

Job and Task in Hadoop

A Hadoop job is divided into tasks (map tasks and reduce tasks).

Input Split vs. HDFS Block

An InputSplit is a logical division of data; an HDFS block is a physical division.

RDBMS vs. Hadoop

RDBMS Hadoop
Relational database; uses SQL; optimized for transactional operations. Distributed storage and processing framework; handles large datasets; optimized for analytical processing.

HDFS vs. NAS

HDFS NAS
Data is distributed across the nodes. Data is stored on a dedicated storage device.

Debugging Hadoop Code

Use tools like Hadoop's web UI, logs, and counters to debug your Hadoop jobs.

Multiple Inputs in Hadoop

Yes, you can specify multiple input paths using the input format.

TaskTracker and JobTracker Interaction in Hadoop (Deprecated)

In older Hadoop versions (before YARN), TaskTrackers would report task failures to the JobTracker. The JobTracker would then decide whether to retry the failed task on another node or mark it as failed. Newer Hadoop versions use YARN for task management.

Task Scheduling in Hadoop (Deprecated)

In older Hadoop versions, TaskTrackers sent heartbeat messages to the JobTracker, informing it about their available resources (slots). The JobTracker used this information to schedule tasks efficiently. This mechanism has been replaced by YARN in newer Hadoop versions.

Using Non-Java Code with Hadoop

Hadoop doesn't require Java for all tasks. Hadoop Streaming allows using scripts (e.g., shell scripts, Python scripts) as mapper or reducer functions. This enables integrating code from other languages into your Hadoop workflows.

Data Storage Component Used by Hadoop

HBase is a NoSQL, column-oriented database commonly used with Hadoop for storing and managing large datasets. Other NoSQL and SQL databases can also be integrated with Hadoop.