Apache Spark (original) (raw)

Last Updated : 27 Feb, 2026

Apache Spark is an open-source distributed computing framework built for large-scale data processing and analytics.
It processes massive datasets across clusters of machines with high speed and reliability.

Core Architecture

Spark uses a **master-worker (driver-executor) model, with modern enhancements like Spark Connect.

driver_program

Architecture Of Apache Spark

Key abstractions:

Built-in libraries (unified engine):

Working of Apche Spark

1. Application Submission & Driver Initialization

A Spark application begins when the user submits code written in PySpark, Scala, Java, or SQL. This code creates a SparkSession, which internally initializes the SparkContext.

The Driver Program is the central coordinator of the application and runs the user’s main function. It is responsible for:

The Driver contains critical internal components:

Together, the Driver and SparkContext oversee the entire job execution lifecycle.

2. Logical Plan Creation (Lazy Evaluation)

Spark follows a lazy evaluation model. When transformations such as filter, map, or groupBy are defined, Spark does not execute them immediately.

Instead:

Execution only begins when an action (e.g., show(), count(), write()) is called.

3. Query Optimization Using Catalyst Optimizer

Once an action is triggered, Spark hands the logical plan to the Catalyst Optimizer, which performs multiple optimization steps:

  1. **Analysis – Resolves column names, data types, and references.
  2. **Logical Optimization – Applies rule-based optimizations (predicate pushdown, projection pruning).
  3. **Cost-Based Optimization (CBO) – Chooses optimal join strategies using statistics.
  4. **Physical Planning – Converts optimized logic into executable physical operators.

Spark may also apply whole-stage code generation and columnar execution to further improve performance.

4. DAG Creation and Stage Breakdown

The optimized physical plan is translated into a Directed Acyclic Graph (DAG) of operations.

The DAG Scheduler:

Each stage consists of multiple tasks that can be executed in parallel.

5. Cluster Manager & Resource Allocation

The Cluster Manager (Standalone, YARN, Kubernetes, or Mesos) is responsible for:

The Spark Driver communicates with the Cluster Manager to request resources and schedule work.

6. Task Scheduling & Execution

The Task Scheduler assigns tasks to executors, which are long-lived processes running on worker nodes.

Executors:

7. Data Storage, Caching, and Memory Management

Spark supports in-memory computation, which is key to its performance advantage.

8. Fault Tolerance & Reliability

Spark ensures fault tolerance through:

This approach avoids costly data replication while maintaining reliability.

9. Result Handling & Output

After task execution:

Once execution completes, executors are released and the Spark application terminates

Key Use Cases (Real-World in 2026)

Spark powers critical systems across industries:

Companies like Netflix (recommendations + log analytics), Amazon (order fulfillment), banks (fraud), and OTT platforms rely on Spark at massive scale.

End-to-End Production Workflow Example

A typical production ETL + ML pipeline (e.g., customer churn prediction or daily reporting):

**2. Processing (Transform)

**3. Storage (Load)

**4. Orchestration & Scheduling

**5. Monitoring & Governance

**6. Consumption

Serve gold tables to BI (Power BI/Tableau), ML models (MLflow), or downstream apps via APIs.