TutorialsArena

Understanding Apache Spark Architecture: RDDs and DAGs

Explore the master-slave architecture of Apache Spark, including driver programs, executors, and worker nodes. Learn about the core abstractions of Resilient Distributed Datasets (RDDs) and Directed Acyclic Graphs (DAGs), and how RDDs function as distributed data collections across a cluster.



Apache Spark Architecture

Introduction to Spark Architecture

Apache Spark uses a master-slave architecture, consisting of a single driver program (the master) and multiple executors (the slaves) running on worker nodes. Two core abstractions underpin Spark's architecture: Resilient Distributed Datasets (RDDs) and Directed Acyclic Graphs (DAGs).

Key Abstractions in Spark

1. Resilient Distributed Datasets (RDDs)

RDDs are fundamental to Spark. They are collections of data elements that are distributed across multiple worker nodes in the cluster. The key characteristics of an RDD are:

  • Resilient: RDDs can recover from node failures. If a node fails, the lost data can be recomputed from the original data source.
  • Distributed: Data is spread across multiple nodes for parallel processing.
  • Immutable: Once created, an RDD cannot be modified. New RDDs are created from transformations.

2. Directed Acyclic Graphs (DAGs)

Spark represents a sequence of operations (transformations) on RDDs as a DAG. A DAG is a graph where:

  • Nodes represent RDD partitions.
  • Edges represent transformations applied to the data.
  • There are no cycles in the graph (hence "acyclic").

The DAG scheduler in Spark optimizes the execution of these transformations, maximizing parallelism and efficiency.

Components of the Spark Architecture

  • Driver Program: The main program that initiates the Spark application. It creates a SparkContext object, which manages the application's execution on the cluster.
  • Cluster Manager: Allocates resources to Spark applications. Examples include Hadoop YARN, Apache Mesos, and Kubernetes.
  • Worker Node: A machine in the cluster that executes tasks.
  • Executor: A process launched on a worker node to run tasks for a specific Spark application. It manages data storage (in-memory or on disk) for the application and communicates with the driver program.
  • Task: A unit of work that's sent to an executor for execution.
Spark Architecture Diagram