Apache Pig Example: Finding the Most Frequent Starting Letter
Learn how to use Apache Pig to analyze text data and find the most frequent starting letter. This practical example demonstrates key Pig concepts and provides a step-by-step guide for data processing and analysis.
Apache Pig Example: Finding Most Frequent Starting Letter
Problem: Finding the Most Frequent Starting Letter
This example demonstrates how to use Apache Pig to find the most frequently occurring starting letter in a text file.
Solution: Step-by-Step Guide
We'll break down the solution into several steps:
- Load data: Load the data from the text file into a Pig bag named "lines". Each line from the file is stored as a
chararray
(string) element: - Tokenize the data: Tokenize the text in the "lines" bag to get individual words. Each word becomes a separate row:
- Extract the first letter: Extract the first letter of each word using the
SUBSTRING
function: - Group by letter: Group the letters to count occurrences of each letter:
- Count occurrences: Count the number of occurrences of each letter in each group:
- Order by count: Sort the results in descending order based on the count:
- Limit to the top 1: Keep only the top result (most frequent letter):
- Store the result: Save the result to HDFS:
Pig Script (Loading Data)
grunt> lines = LOAD "/user/Desktop/data.txt" AS (line: chararray);
Pig Script (Tokenization)
grunt> tokens = FOREACH lines GENERATE flatten(TOKENIZE(line)) AS token:chararray;
Pig Script (Extracting First Letter)
grunt> letters = FOREACH tokens GENERATE SUBSTRING(token, 0, 1) AS letter:chararray;
Pig Script (Grouping Letters)
grunt> lettergrp = GROUP letters BY letter;
Pig Script (Counting Occurrences)
grunt> countletter = FOREACH lettergrp GENERATE group, COUNT(letters);
Pig Script (Ordering Results)
grunt> OrderCnt = ORDER countletter BY $1 DESC;
Pig Script (Limiting Results)
grunt> result = LIMIT OrderCnt 1;
Pig Script (Storing Results)
grunt> STORE result INTO 'home/sonoo/output';
This complete script efficiently determines the most frequent starting letter in your text data using Apache Pig.