TutorialsArena

Setting up Hadoop: A Step-by-Step Installation Guide

Learn how to install and configure Hadoop with this comprehensive step-by-step guide. Covering prerequisites like Java and SSH, this tutorial provides clear instructions for setting up a Hadoop environment on your system.



Setting up Hadoop: A Step-by-Step Guide

Prerequisites

Before installing Hadoop, you need Java and SSH. Hadoop itself is designed to run on Unix-like systems (like Linux), but with tools like Cygwin, it is possible to run Hadoop on Windows. You'll need Java version 1.8 or higher for running MapReduce programs.

1. Java Installation (Linux)

  1. Check Java Version: Open your terminal and run java -version to see if Java is already installed.
  2. Download JDK: If Java isn't installed, download the appropriate JDK (Java Development Kit) from Oracle's website. You will need to accept the license agreement.
  3. Extract JDK: Extract the downloaded archive (e.g., `tar -xvf jdk-8u341-linux-x64.tar.gz`).
  4. Move and Set Path: Move the extracted JDK directory to `/usr/lib` (you'll likely need `sudo`):
  5. 
    sudo mv jdk-8u341-linux-x64 /usr/lib/jdk-8u341-linux-x64
            

    Add the following to your `.bashrc` file (or equivalent):

    
    export JAVA_HOME=/usr/lib/jdk-8u341-linux-x64
    export PATH=$PATH:$JAVA_HOME/bin
            
  6. Verify Installation: Run java -version again to confirm.

2. SSH Installation and Key Setup

  1. Create Hadoop User: Create a Hadoop user on all machines (master and slaves):
  2. 
    sudo useradd hadoop
    sudo passwd hadoop
            
  3. Configure Hosts File: Edit `/etc/hosts` on each machine to add entries mapping IP addresses to hostnames (e.g., `masterNode`, `slaveNode1`, `slaveNode2`).
  4. Generate and Copy SSH Keys: Log in as the Hadoop user on each machine and generate an SSH key pair, then copy the public key to the `authorized_keys` file on each machine:
  5. 
    ssh-keygen -t rsa
    ssh-copy-id hadoop@masterNode
    ssh-copy-id hadoop@slaveNode1
    ssh-copy-id hadoop@slaveNode2
    chmod 0600 ~/.ssh/authorized_keys
            

    (Replace hostnames with your actual hostnames.)

3. Hadoop Installation (Linux)

  1. Download Hadoop: Download Hadoop from the Apache Hadoop website. The file will be a compressed archive (e.g., a `.tar.gz` file).
  2. Extract Hadoop: Extract the archive to a location (e.g., `/usr/lib/hadoop`):
  3. 
    sudo mkdir /usr/lib/hadoop
    sudo tar -xzvf hadoop-3.3.5.tar.gz -C /usr/lib/hadoop
            
  4. Set Ownership: Change the ownership of the Hadoop directory to the Hadoop user:
  5. 
    sudo chown -R hadoop:hadoop /usr/lib/hadoop
            
  6. Configure Hadoop (core-site.xml, hdfs-site.xml, mapred-site.xml): Edit Hadoop's configuration files located in the `etc/hadoop` subdirectory within your Hadoop installation. These files require setting various properties, including paths to your data storage, namenode addresses, replication factors, etc. Specific configuration details are beyond the scope of this simple guide.
  7. Copy Hadoop to Slave Nodes: Copy the Hadoop installation directory to the slave nodes using `scp`.
  8. Configure Master and Slaves: Edit the `masters` and `slaves` files in `$HADOOP_HOME/etc/hadoop` to list your cluster nodes.
  9. Format NameNode: Format the namenode:
  10. 
    hdfs namenode -format
            
  11. Start Hadoop Daemons: Start all Hadoop daemons (namenode, datanodes, secondary namenode, resource manager, node managers) using:
  12. 
    start-dfs.sh
    start-yarn.sh
            

(Consider using a pre-configured Hadoop distribution like Cloudera for a simplified installation.)