pyspark.sql.DataFrame.corr — PySpark 4.0.0 documentation (original) (raw)
DataFrame.corr(col1, col2, method=None)[source]#
Calculates the correlation of two columns of a DataFrame as a double value. Currently only supports the Pearson Correlation Coefficient.DataFrame.corr() and DataFrameStatFunctions.corr() are aliases of each other.
New in version 1.4.0.
Changed in version 3.4.0: Supports Spark Connect.
Parameters
col1str
The name of the first column
col2str
The name of the second column
methodstr, optional
The correlation method. Currently only supports “pearson”
Returns
float
Pearson Correlation Coefficient of two columns.
Examples
df = spark.createDataFrame([(1, 12), (10, 1), (19, 8)], ["c1", "c2"]) df.corr("c1", "c2") -0.3592106040535498 df = spark.createDataFrame([(11, 12), (10, 11), (9, 10)], ["small", "bigger"]) df.corr("small", "bigger") 1.0