TutorialsArena

Apache Pig Example: Finding the Most Frequent Starting Letter

Learn how to use Apache Pig to analyze text data and find the most frequent starting letter. This practical example demonstrates key Pig concepts and provides a step-by-step guide for data processing and analysis.



Apache Pig Example: Finding Most Frequent Starting Letter

Problem: Finding the Most Frequent Starting Letter

This example demonstrates how to use Apache Pig to find the most frequently occurring starting letter in a text file.

Solution: Step-by-Step Guide

We'll break down the solution into several steps:

  1. Load data: Load the data from the text file into a Pig bag named "lines". Each line from the file is stored as a chararray (string) element:
  2. Pig Script (Loading Data)
    
    grunt> lines = LOAD "/user/Desktop/data.txt" AS (line: chararray);
      
  3. Tokenize the data: Tokenize the text in the "lines" bag to get individual words. Each word becomes a separate row:
  4. Pig Script (Tokenization)
    
    grunt> tokens = FOREACH lines GENERATE flatten(TOKENIZE(line)) AS token:chararray;
      
  5. Extract the first letter: Extract the first letter of each word using the SUBSTRING function:
  6. Pig Script (Extracting First Letter)
    
    grunt> letters = FOREACH tokens GENERATE SUBSTRING(token, 0, 1) AS letter:chararray;
      
  7. Group by letter: Group the letters to count occurrences of each letter:
  8. Pig Script (Grouping Letters)
    
    grunt> lettergrp = GROUP letters BY letter;
      
  9. Count occurrences: Count the number of occurrences of each letter in each group:
  10. Pig Script (Counting Occurrences)
    
    grunt> countletter = FOREACH lettergrp GENERATE group, COUNT(letters);
      
  11. Order by count: Sort the results in descending order based on the count:
  12. Pig Script (Ordering Results)
    
    grunt> OrderCnt = ORDER countletter BY $1 DESC;
      
  13. Limit to the top 1: Keep only the top result (most frequent letter):
  14. Pig Script (Limiting Results)
    
    grunt> result = LIMIT OrderCnt 1;
      
  15. Store the result: Save the result to HDFS:
  16. Pig Script (Storing Results)
    
    grunt> STORE result INTO 'home/sonoo/output';
      

This complete script efficiently determines the most frequent starting letter in your text data using Apache Pig.