pyspark.sql.SparkSession — PySpark 3.5.5 documentation (original) (raw)
The entry point to programming Spark with the Dataset and DataFrame API.
A SparkSession can be used to create DataFrame, register DataFrame as tables, execute SQL over tables, cache tables, and read parquet files. To create a SparkSession, use the following builder pattern:
Changed in version 3.4.0: Supports Spark Connect.
Examples
Create a Spark session.
spark = ( ... SparkSession.builder ... .master("local") ... .appName("Word Count") ... .config("spark.some.config.option", "some-value") ... .getOrCreate() ... )
Create a Spark session with Spark Connect.
spark = ( ... SparkSession.builder ... .remote("sc://localhost") ... .appName("Word Count") ... .config("spark.some.config.option", "some-value") ... .getOrCreate() ... )
Methods
active() | Returns the active or default SparkSession for the current thread, returned by the builder. |
---|---|
addArtifact(*path[, pyfile, archive, file]) | Add artifact(s) to the client session. |
addArtifacts(*path[, pyfile, archive, file]) | Add artifact(s) to the client session. |
addTag(tag) | Add a tag to be assigned to all the operations started by this thread in this session. |
clearTags() | Clear the current thread’s operation tags. |
copyFromLocalToFs(local_path, dest_path) | Copy file from local to cloud storage file system. |
createDataFrame(data[, schema, …]) | Creates a DataFrame from an RDD, a list, a pandas.DataFrame or a numpy.ndarray. |
getActiveSession() | Returns the active SparkSession for the current thread, returned by the builder |
getTags() | Get the tags that are currently set to be assigned to all the operations started by this thread. |
interruptAll() | Interrupt all operations of this session currently running on the connected server. |
interruptOperation(op_id) | Interrupt an operation of this session with the given operationId. |
interruptTag(tag) | Interrupt all operations of this session with the given operation tag. |
newSession() | Returns a new SparkSession as new session, that has separate SQLConf, registered temporary views and UDFs, but shared SparkContext and table cache. |
range(start[, end, step, numPartitions]) | Create a DataFrame with single pyspark.sql.types.LongType column named id, containing elements in a range from start to end (exclusive) with step value step. |
removeTag(tag) | Remove a tag previously added to be assigned to all the operations started by this thread in this session. |
sql(sqlQuery[, args]) | Returns a DataFrame representing the result of the given query. |
stop() | Stop the underlying SparkContext. |
table(tableName) | Returns the specified table as a DataFrame. |
Attributes
builder | |
---|---|
catalog | Interface through which the user may create, drop, alter or query underlying databases, tables, functions, etc. |
client | Gives access to the Spark Connect client. |
conf | Runtime configuration interface for Spark. |
read | Returns a DataFrameReader that can be used to read data in as a DataFrame. |
readStream | Returns a DataStreamReader that can be used to read data streams as a streaming DataFrame. |
sparkContext | Returns the underlying SparkContext. |
streams | Returns a StreamingQueryManager that allows managing all the StreamingQuery instances active on this context. |
udf | Returns a UDFRegistration for UDF registration. |
udtf | Returns a UDTFRegistration for UDTF registration. |
version | The version of Spark on which this application is running. |