Setting up Hadoop: A Step-by-Step Installation Guide
Learn how to install and configure Hadoop with this comprehensive step-by-step guide. Covering prerequisites like Java and SSH, this tutorial provides clear instructions for setting up a Hadoop environment on your system.
Setting up Hadoop: A Step-by-Step Guide
Prerequisites
Before installing Hadoop, you need Java and SSH. Hadoop itself is designed to run on Unix-like systems (like Linux), but with tools like Cygwin, it is possible to run Hadoop on Windows. You'll need Java version 1.8 or higher for running MapReduce programs.
1. Java Installation (Linux)
- Check Java Version: Open your terminal and run
java -version
to see if Java is already installed. - Download JDK: If Java isn't installed, download the appropriate JDK (Java Development Kit) from Oracle's website. You will need to accept the license agreement.
- Extract JDK: Extract the downloaded archive (e.g., `tar -xvf jdk-8u341-linux-x64.tar.gz`).
- Move and Set Path: Move the extracted JDK directory to `/usr/lib` (you'll likely need `sudo`):
- Verify Installation: Run
java -version
again to confirm.
sudo mv jdk-8u341-linux-x64 /usr/lib/jdk-8u341-linux-x64
Add the following to your `.bashrc` file (or equivalent):
export JAVA_HOME=/usr/lib/jdk-8u341-linux-x64
export PATH=$PATH:$JAVA_HOME/bin
2. SSH Installation and Key Setup
- Create Hadoop User: Create a Hadoop user on all machines (master and slaves):
- Configure Hosts File: Edit `/etc/hosts` on each machine to add entries mapping IP addresses to hostnames (e.g., `masterNode`, `slaveNode1`, `slaveNode2`).
- Generate and Copy SSH Keys: Log in as the Hadoop user on each machine and generate an SSH key pair, then copy the public key to the `authorized_keys` file on each machine:
sudo useradd hadoop
sudo passwd hadoop
ssh-keygen -t rsa
ssh-copy-id hadoop@masterNode
ssh-copy-id hadoop@slaveNode1
ssh-copy-id hadoop@slaveNode2
chmod 0600 ~/.ssh/authorized_keys
(Replace hostnames with your actual hostnames.)
3. Hadoop Installation (Linux)
- Download Hadoop: Download Hadoop from the Apache Hadoop website. The file will be a compressed archive (e.g., a `.tar.gz` file).
- Extract Hadoop: Extract the archive to a location (e.g., `/usr/lib/hadoop`):
- Set Ownership: Change the ownership of the Hadoop directory to the Hadoop user:
- Configure Hadoop (core-site.xml, hdfs-site.xml, mapred-site.xml): Edit Hadoop's configuration files located in the `etc/hadoop` subdirectory within your Hadoop installation. These files require setting various properties, including paths to your data storage, namenode addresses, replication factors, etc. Specific configuration details are beyond the scope of this simple guide.
- Copy Hadoop to Slave Nodes: Copy the Hadoop installation directory to the slave nodes using `scp`.
- Configure Master and Slaves: Edit the `masters` and `slaves` files in `$HADOOP_HOME/etc/hadoop` to list your cluster nodes.
- Format NameNode: Format the namenode:
- Start Hadoop Daemons: Start all Hadoop daemons (namenode, datanodes, secondary namenode, resource manager, node managers) using:
sudo mkdir /usr/lib/hadoop
sudo tar -xzvf hadoop-3.3.5.tar.gz -C /usr/lib/hadoop
sudo chown -R hadoop:hadoop /usr/lib/hadoop
hdfs namenode -format
start-dfs.sh
start-yarn.sh
(Consider using a pre-configured Hadoop distribution like Cloudera for a simplified installation.)