Spark Character Count Tutorial: Scala Example
Learn how to perform a character count using Apache Spark and Scala. This tutorial provides a practical example of analyzing text files and counting character occurrences with Spark's distributed processing capabilities. Perfect for beginners learning basic text analysis with Spark.
Spark Character Count Example (Scala)
Introduction
This tutorial demonstrates a simple character count using Apache Spark and Scala. We'll count the occurrences of each character in a text file. This example shows how Spark's distributed processing capabilities can be used for basic text analysis.
Steps
- Create Input File: Create a text file (e.g., `input.txt`) containing your sample text. For example:
- Upload to HDFS: Upload this file to your Hadoop Distributed File System (HDFS) (replace `/mydata` with your HDFS directory):
- Start Spark Shell: Launch the Spark shell (
spark-shell
). - Create an RDD: Load the file into a Resilient Distributed Dataset (RDD):
- Split into Characters: Split each line into individual characters:
- Map to (Character, 1): Assign a count of 1 to each character:
- Reduce by Key: Sum the counts for each character:
- Display Results: Show the character counts:
This is a sample text file.
It contains various characters.
hdfs dfs -mkdir /mydata
hdfs dfs -put input.txt /mydata
val data = sc.textFile("/mydata/input.txt")
val splitData = data.flatMap(line => line.split(""))
val charCounts = splitData.map(c => (c, 1))
val reducedCounts = charCounts.reduceByKey(_ + _)
reducedCounts.collect.foreach(println)
(Example output will vary depending on the input text. The output will be a list of (character, count) pairs.)
(T,2)
(h,1)
(i,2)
(s,3)
( ,11)
(a,2)
(m,1)
(p,2)
(l,1)
(e,2)
(x,1)
(f,1)
(.,2)