Introduction to PySpark | Distributed Computing with Apache Spark (original) (raw)

Last Updated : 18 Jul, 2025

As data grows rapidly from sources like social media and e-commerce, traditional systems fall short. Distributed computing, with tools like Apache Spark and PySpark, enables fast, scalable data processing. This article covers the basics, key features and a hands-on PySpark.

What is Distributed Computing?

**Distributed computing is a computing model where large computational tasks are divided and executed across multiple machines (nodes) that work in parallel. Think of it as breaking a huge job into smaller parts and assigning each part to a different worker. It's key features include:

What is Apache Spark?

Apache Spark is an open-source distributed computing engine developed by the Apache Software Foundation. It is designed to process large datasets quickly and efficiently across a cluster of machines. It's key features include:

What is Pyspark?

**PySpark is the Python API for Apache Spark, allowing Python developers to use the full power of Spark’s distributed computing framework with familiar Python syntax. It bridges the gap between Python’s ease of use and Spark’s processing power.It's key features include:

PySpark Modules

PySpark is built in a modular way, offering specialized libraries for different data processing tasks:

Module Description
pyspark.sql Work with structured data using DataFrames and SQL queries.
pyspark.ml Build machine learning pipelines (classification, regression, clustering, etc.).
pyspark.streaming Process real-time data streams (e.g., Twitter feed, logs).
pyspark.graphx Handle graph computations and social network analysis (Scala/Java primarily).

How PySpark Works

When you run a PySpark application, it follows a structured workflow to process large datasets efficiently across a distributed cluster. Here’s a high-level overview:

  1. **Driver Program: Your Python script that initiates and controls the Spark job.
  2. **SparkContext: Connects the driver to the Spark cluster and manages job configuration.
  3. **RDDs/DataFrames: Data structures that are distributed and processed in parallel.
  4. **Cluster Manager: Schedules and allocates resources to worker nodes (e.g., YARN, Mesos, Kubernetes).
  5. **Executor Nodes: Run the actual tasks in parallel and return results to the driver.

Basic Example: Word Count with PySpark

Here’s a simple PySpark example that reads a text file and counts the frequency of each word:

Python `

from pyspark import SparkContext sc = SparkContext("local", "WordCount") txt = "PySpark makes big data processing fast and easy with Python" rdd = sc.parallelize([txt])

counts = rdd.flatMap(lambda x: x.split())
.map(lambda word: (word, 1))
.reduceByKey(lambda a, b: a + b)

print(counts.collect()) sc.stop()

`

**Output

[('PySpark', 1), ('makes', 1), ('big', 1), ('data', 1), ('processing', 1), ('fast', 1), ('and', 1), ('easy', 1), ('with', 1), ('Python', 1)]

**Explanation: