How to Calculate Correlation Between Two Columns in Pandas? (original) (raw)

**Correlation is used to summarize the strength and direction of the linear association between two quantitative variables. It is denoted by **r and values between -1 and +1. A positive value for **r indicates a positive association and a negative value for r indicates a negative association. Let's explore several methods to calculate correlation between columns in a pandas DataFrame.

Using Series.corr()

corr() calculates the Pearson correlation coefficient between two individual columns (Series) in a pandas DataFrame. It’s simple and quick when you want to check the correlation between just two variables.

Python `

import pandas as pd

data = pd.DataFrame({ 'math_score': [85, 78, 92, 88, 76], 'science_score': [89, 81, 94, 90, 80], 'english_score': [78, 75, 85, 80, 72], 'history_score': [70, 68, 80, 72, 65] }) corr = data['math_score'].corr(data['science_score']) print(corr)

`

**Explanation: This code computes the correlation coefficient between math and **science scores, a value between -1 and +1 that measures the strength and direction of their linear relationship. +1 indicates a perfect positive correlation, -1 a perfect negative and 0 means no linear correlation.

Using Dataframe.corr()

Dataframe corr() computes the correlation matrix for all numeric columns in the DataFrame. It returns pairwise correlation coefficients between all columns, making it easy to see relationships across multiple variables at once.

Python `

import pandas as pd

data = pd.DataFrame({ 'math_score': [85, 78, 92, 88, 76], 'science_score': [89, 81, 94, 90, 80], 'english_score': [78, 75, 85, 80, 72], 'history_score': [70, 68, 80, 72, 65] }) res = data.corr() print(res)

`

**Output

Output

Using Dataframe.corr()

**Explanation: This code calculates the correlation matrix for all numeric columns in the dataframe, showing the pairwise correlation coefficients between each subject's scores. Each value ranges from -1 to +1, indicating the strength and direction of the linear relationships among the columns.

Using numpy.corrcoef()

corrcoef() from the NumPy library calculates the Pearson correlation coefficient matrix between two arrays. It is useful when working directly with NumPy arrays or when pandas is not required.

Python `

import numpy as np import pandas as pd

data = pd.DataFrame({ 'math_score': [85, 78, 92, 88, 76], 'science_score': [89, 81, 94, 90, 80], 'english_score': [78, 75, 85, 80, 72], 'history_score': [70, 68, 80, 72, 65] })

corr = np.corrcoef(data['math_score'], data['english_score'])[0, 1] print(corr)

`

**Explanation: This code calculates the Pearson correlation coefficient between math and English scores using **NumPy’s corrcoef function. It returns a value between -1 and +1 that measures the strength and direction of the linear relationship between these two columns.

Using scipy.stats.pearsonr()

This function calculates the Pearson correlation coefficient along with the p-value to test the hypothesis of no correlation. It is helpful if you want to know both the strength of the correlation and its statistical significance.

Python `

from scipy.stats import pearsonr import pandas as pd

data = pd.DataFrame({ 'math_score': [85, 78, 92, 88, 76], 'science_score': [89, 81, 94, 90, 80], 'english_score': [78, 75, 85, 80, 72], 'history_score': [70, 68, 80, 72, 65] }) corr, p_value = pearsonr(data['science_score'], data['history_score']) print(corr) print(p_value)

`