pyspark.sql.DataFrame.corr — PySpark 4.0.0 documentation (original) (raw)

DataFrame.corr(col1, col2, method=None)[source]#

Calculates the correlation of two columns of a DataFrame as a double value. Currently only supports the Pearson Correlation Coefficient.DataFrame.corr() and DataFrameStatFunctions.corr() are aliases of each other.

New in version 1.4.0.

Changed in version 3.4.0: Supports Spark Connect.

Parameters

col1str

The name of the first column

col2str

The name of the second column

methodstr, optional

The correlation method. Currently only supports “pearson”

Returns

float

Pearson Correlation Coefficient of two columns.

Examples

df = spark.createDataFrame([(1, 12), (10, 1), (19, 8)], ["c1", "c2"]) df.corr("c1", "c2") -0.3592106040535498 df = spark.createDataFrame([(11, 12), (10, 11), (9, 10)], ["small", "bigger"]) df.corr("small", "bigger") 1.0