Correlation and Regression (original) (raw)

Last Updated : 12 Sep, 2025

**Correlation and regression are essential statistical tools used to analyze the relationship between variables. **Correlation measures the strength and direction of a linear relationship between two variables, indicating how one variable changes in response to another. **Regression, on the other hand, goes a step further by not only measuring this relationship but also predicting the value of a dependent variable based on one or more independent variables.

Correlation

Correlation quantifies how strong and in which direction the two variables are related. It is being measured using the correlation coefficient, which goes to the extent of negative one and positive one.

Types of Correlation Coefficients

Some of the common correlation coefficients are:

Regression

Regression analysis can be defined as a statistical tool that is applied in an endeavor to identify the relationship between a dependent variable and one or many independent variables. Of benefit when trying to forecast the position of the dependent variable, given the position of the independent variables.

Types of Regression

Common types of regression are:

Correlation and Regression Formula

Formulas related to Correlation and Regression are:

**Correlation Formula: The Pearson correlation coefficient (r) is calculated using the formula:

r = \frac{\sum (X - \bar{X})(Y - \bar{Y})}{\sqrt{\sum (X - \bar{X})^2 \sum (Y - \bar{Y})^2}}

**Simple Linear Regression Formula: The regression line is represented as:

Y = a + bX

Where:

Correlation vs Regression

Some of the common differences between correlation and regression are:

Correlation Regression
Measures the strength and direction of the linear relationship between two variables. Models the relationship between a dependent variable and one or more independent variables.
Does not assume a causal relationship; measures association. Implies a directional relationship from the independent to the dependent variable.
Correlation coefficient (r) ranges from -1 to 1. Regression coefficients (slope and intercept) describe the relationship.
No equation involved. Provides an equation (e.g., 𝑌=𝑎+𝑏𝑋 in simple linear regression).
Does not predict values. Used to predict the value of the dependent variable based on independent variables.
Pearson, Spearman, Kendall. Simple linear, multiple linear, logistic, and polynomial.
Measures linear relationships (in Pearson correlation). Can model linear and non-linear relationships (depending on the type of regression).
Requires two continuous variables (for Pearson correlation). Requires one dependent and one or more independent variables.

Solved Problems on Correlation and Regression

**Problem 1: Given two variables, X and Y, calculate the Pearson correlation coefficient.

**Solution:

r = \frac{\sum (X - \bar{X})(Y - \bar{Y})}{\sqrt{\sum (X - \bar{X})^2 \sum (Y - \bar{Y})^2}}

**Calculating the mean of X and Y:

**Calculating the Pearson correlation coefficient:

r = \frac{(1-3)(2-6) + (2-3)(4-6) + (3-3)(6-6) + (4-3)(8-6) + (5-3)(10-6)}{\sqrt{(1-3)^2 + (2-3)^2 + (3-3)^2 + (4-3)^2 + (5-3)^2} \sqrt{(2-6)^2 + (4-6)^2 + (6-6)^2 + (8-6)^2 + (10-6)^2}}= \frac{2*4 + 1*2 + 0 + 1*2 + 2*4}{\sqrt{4 + 1 + 0 + 1 + 4} \sqrt{16 + 4 + 0 + 4 + 16}} = \frac{20}{\sqrt{10} \cdot \sqrt{40}} = \frac{20}{\sqrt{400}} = \frac{20}{20} = 1

So, the Pearson correlation coefficient ( r = 1 ), indicating a perfect positive linear relationship between X and Y.

**Problem 2: Calculate the Spearman's rank correlation coefficient for the following data:

**Solution:

**1. Rank the data points in X and Y:

**2. Calculate the difference between ranks (d i ):

Differences (d): [2, 2, -2, -2, 0]

3. Calculate the square of differences (d_i^2)):

Squared differences (d2): [4, 4, 4, 4, 0]

**4. Sum the squared differences:

\sum d_i^2 = 4 + 4 + 4 + 4 + 0 = 16

**5. Use the Spearman's rank correlation formula:

r_s = 1 - \frac{6 \sum d_i^2}{n(n^2 - 1)}

Here, n = 5:

r_s = 1 - \frac{6 \cdot 16}{5(5^2 - 1)} = 1 - \frac{96}{120} = 1 - 0.8 = 0.2

So, the Spearman's rank correlation coefficient n =5, indicating a weak positive rank correlation between X and Y.

Problem 3: Given data points, predict the value of Y for X = 6 using the equationy = a + bX.

**Solution:

**Calculate the slope (b) and intercept (a):

b = \frac{\sum (X - \bar{X})(Y - \bar{Y})}{\sum (X - \bar{X})^2}

\bar{X} = 3, \quad \bar{Y} = 6

b = \frac{(1-3)(2-6) + (2-3)(4-6) + (3-3)(6-6) + (4-3)(8-6) + (5-3)(10-6)}{(1-3)^2 + (2-3)^2 + (3-3)^2 + (4-3)^2 + (5-3)^2} = \frac{20}{10} = 2

a = \bar{Y} - b\bar{X} = 6 - 2 \cdot 3 = 0

Equation: Y = 0 + 2X

For X = 6:

Y = 2 × 6 = 12

So, the predicted value of Y for X = 6 is 12.

**Problem 4: Given the following data, predict the value of Y for X1 = 3 and X2 = 4:

**Data:

**Solution:

The multiple linear regression equation is .

Using a statistical software or calculation tool (e.g., Excel, R), we can determine:

**So, the equation becomes:

Y = 1.5 + 0.5X1 + 0.5X2

For X1 = 3 and X2 = 4:

Y = 1.5 + 0.5 × 3 + 0.5 × 4 = 1.5 + 1.5 + 2 = 5

So, the predicted value of Y for X1 = 3 and X2 = 4 is 5.

**Problem 5: Given the following data, determine the probability of Y being 1 for X = 4:

**Data:

**Solution:

**The logistic regression model is:

\log\left(\frac{p}{1-p}\right) = a + bX

**Using a statistical software or calculation tool, we can determine:

**So, the equation becomes:

\log\left(\frac{p}{1-p}\right) = -3 + 1.1X

**For X = 4:

\log\left(\frac{p}{1-p}\right) = -3 + 1.1 \times 4 = 1.4

**Solving for p:

\frac{p}{1-p} = e^{1.4} \approx 4.055

p \approx \frac{4.055}{1 + 4.055} \approx 0.802

So, the probability of Y being 1 for X = 4 is approximately 0.802, or 80.2%.

**Also Check: