Word Count Example using MapReduce in Hadoop: A Practical Tutorial (Legacy API)
Learn MapReduce fundamentals with a practical word count example using the legacy Hadoop MapReduce API. This tutorial provides a step-by-step guide, covering key concepts and setup instructions. For newer Hadoop versions, consider using the YARN API.
Word Count Example using MapReduce in Hadoop
This tutorial demonstrates a basic word count example using MapReduce in Hadoop. This illustrates the fundamental concepts of MapReduce: mapping input data to key-value pairs and then reducing those pairs to generate aggregate results. Before you begin, ensure you have Java and Hadoop properly set up on your system. You should also be familiar with basic Java programming and MapReduce concepts. This example uses the older Hadoop MapReduce API (org.apache.hadoop.mapred); consider using the newer Hadoop YARN API for newer Hadoop versions.
Prerequisites
- Java Development Kit (JDK) installed and configured correctly.
- Hadoop installed and running.
You can find instructions for installing Java and Hadoop at [link to Hadoop installation instructions] (replace with appropriate link).
Steps to Execute the MapReduce Word Count Example
- Create Input File: Create a text file (e.g., `data.txt`) containing the text you want to analyze. Example:
- Create HDFS Directory: Create a directory in HDFS to store your input file:
hdfs dfs -mkdir /wordcount
- Upload Input File: Upload the input file to HDFS:
hdfs dfs -put /path/to/data.txt /wordcount
- Write the MapReduce Program: Create three Java files:
WC_Mapper.java
,WC_Reducer.java
, andWC_Runner.java
(code provided below). - Compile and Create JAR: Compile your Java code and create a JAR file (e.g., `wordcountdemo.jar`).
- Run the MapReduce Job: Execute the JAR file using the Hadoop command:
- Retrieve Output: The output is in
/wordcount_output/part-r-00000
. View it using:
Sample data.txt
This is a sample document for the word count example.
It contains several words, some repeated.
Running the Word Count Job
hadoop jar /path/to/wordcountdemo.jar com.javatpoint.WC_Runner /wordcount/data.txt /wordcount_output
Viewing the Output
hdfs dfs -cat /wordcount_output/part-r-00000
MapReduce Code (Java)
WC_Mapper.java
package com.javatpoint;
// ... imports ...
public class WC_Mapper extends MapReduceBase implements Mapper {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, OutputCollector output, Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken().toLowerCase()); // Convert to lowercase
output.collect(word, one);
}
}
}
WC_Reducer.java
package com.javatpoint;
// ... imports ...
public class WC_Reducer extends MapReduceBase implements Reducer {
public void reduce(Text key, Iterator values, OutputCollector output, Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
WC_Runner.java
package com.javatpoint;
// ... imports ...
public class WC_Runner {
public static void main(String[] args) throws IOException {
JobConf conf = new JobConf(WC_Runner.class);
// ... job configuration ...
}
}