pyspark.sql.SparkSession — PySpark 3.5.5 documentation (original) (raw)

The entry point to programming Spark with the Dataset and DataFrame API.

A SparkSession can be used to create DataFrame, register DataFrame as tables, execute SQL over tables, cache tables, and read parquet files. To create a SparkSession, use the following builder pattern:

Changed in version 3.4.0: Supports Spark Connect.

builder[source]¶

Examples

Create a Spark session.

spark = ( ... SparkSession.builder ... .master("local") ... .appName("Word Count") ... .config("spark.some.config.option", "some-value") ... .getOrCreate() ... )

Create a Spark session with Spark Connect.

spark = ( ... SparkSession.builder ... .remote("sc://localhost") ... .appName("Word Count") ... .config("spark.some.config.option", "some-value") ... .getOrCreate() ... )

Methods

active()	Returns the active or default SparkSession for the current thread, returned by the builder.
addArtifact(*path[, pyfile, archive, file])	Add artifact(s) to the client session.
addArtifacts(*path[, pyfile, archive, file])	Add artifact(s) to the client session.
addTag(tag)	Add a tag to be assigned to all the operations started by this thread in this session.
clearTags()	Clear the current thread’s operation tags.
copyFromLocalToFs(local_path, dest_path)	Copy file from local to cloud storage file system.
createDataFrame(data[, schema, …])	Creates a DataFrame from an RDD, a list, a pandas.DataFrame or a numpy.ndarray.
getActiveSession()	Returns the active SparkSession for the current thread, returned by the builder
getTags()	Get the tags that are currently set to be assigned to all the operations started by this thread.
interruptAll()	Interrupt all operations of this session currently running on the connected server.
interruptOperation(op_id)	Interrupt an operation of this session with the given operationId.
interruptTag(tag)	Interrupt all operations of this session with the given operation tag.
newSession()	Returns a new SparkSession as new session, that has separate SQLConf, registered temporary views and UDFs, but shared SparkContext and table cache.
range(start[, end, step, numPartitions])	Create a DataFrame with single pyspark.sql.types.LongType column named id, containing elements in a range from start to end (exclusive) with step value step.
removeTag(tag)	Remove a tag previously added to be assigned to all the operations started by this thread in this session.
sql(sqlQuery[, args])	Returns a DataFrame representing the result of the given query.
stop()	Stop the underlying SparkContext.
table(tableName)	Returns the specified table as a DataFrame.

Attributes

builder
catalog	Interface through which the user may create, drop, alter or query underlying databases, tables, functions, etc.
client	Gives access to the Spark Connect client.
conf	Runtime configuration interface for Spark.
read	Returns a DataFrameReader that can be used to read data in as a DataFrame.
readStream	Returns a DataStreamReader that can be used to read data streams as a streaming DataFrame.
sparkContext	Returns the underlying SparkContext.
streams	Returns a StreamingQueryManager that allows managing all the StreamingQuery instances active on this context.
udf	Returns a UDFRegistration for UDF registration.
udtf	Returns a UDTFRegistration for UDTF registration.
version	The version of Spark on which this application is running.