Understanding RDD Transformations and Actions in Apache Spark
Learn about Resilient Distributed Datasets (RDDs) and how transformations and actions operate on them in Apache Spark. Explore the lazy evaluation model of transformations and how actions trigger computations on RDDs within a cluster.
RDD Transformations and Actions in Apache Spark
Resilient Distributed Datasets (RDDs) are fundamental to Apache Spark. RDDs are collections of data that are partitioned across a cluster. Operations on RDDs are categorized as either transformations (which create new RDDs from existing ones) or actions (which trigger computations and return a result to the driver program).
RDD Transformations
Transformations are lazy operations; they don't actually compute until an action is called. They create new RDDs based on existing RDDs.
Transformation | Description |
---|---|
map(func) |
Applies a function func to each element. |
filter(func) |
Selects elements where func returns true . |
flatMap(func) |
Applies func to each element and flattens the result. |
mapPartitions(func) |
Applies func to each partition. |
mapPartitionsWithIndex(func) |
Similar to mapPartitions , but provides partition index. |
sample(withReplacement, fraction, seed) |
Creates a random sample of the RDD. |
union(otherRDD) |
Combines two RDDs. |
intersection(otherRDD) |
Returns common elements between two RDDs. |
distinct([numPartitions]) |
Returns only unique elements. |
groupByKey([numPartitions]) |
Groups elements by key. |
reduceByKey(func, [numPartitions]) |
Combines values for each key using a reduce function. |
aggregateByKey(zeroValue)(seqOp, combOp, [numPartitions]) |
Aggregates values by key using a combine function and a zero value. |
sortByKey([ascending], [numPartitions]) |
Sorts key-value pairs by key. |
join(otherRDD, [numPartitions]) |
Performs a join operation with another RDD. |
cogroup(otherRDD, [numPartitions]) |
Groups elements from two RDDs by key. |
cartesian(otherRDD) |
Creates a Cartesian product of two RDDs. |
pipe(command, [envVars]) |
Pipes data through a shell command. |
coalesce(numPartitions) |
Reduces the number of partitions. |
repartition(numPartitions) |
Reshuffles data into a specified number of partitions. |
repartitionAndSortWithinPartitions(partitioner) |
Repartitions and sorts data within partitions. |
RDD Actions
Actions trigger computations and return a result to the driver program. They are not lazy; they perform the computations immediately.
Action | Description |
---|---|
reduce(func) |
Aggregates all elements using a reduce function func . |
collect() |
Returns all elements as an array to the driver. |
count() |
Returns the number of elements. |
first() |
Returns the first element. |
take(n) |
Returns the first n elements. |
takeSample(withReplacement, num, [seed]) |
Returns a random sample of elements. |
takeOrdered(n, [ordering]) |
Returns the first n ordered elements. |
saveAsTextFile(path) |
Saves the RDD as a text file. |
saveAsSequenceFile(path) |
Saves the RDD as a Hadoop SequenceFile. |
saveAsObjectFile(path) |
Saves the RDD using Java serialization. |
countByKey() |
Counts occurrences of each key (for key-value RDDs). |
foreach(func) |
Applies a function to each element (for side effects). |