Correlation and Regression (original) (raw)
Last Updated : 12 Sep, 2025
**Correlation and regression are essential statistical tools used to analyze the relationship between variables. **Correlation measures the strength and direction of a linear relationship between two variables, indicating how one variable changes in response to another. **Regression, on the other hand, goes a step further by not only measuring this relationship but also predicting the value of a dependent variable based on one or more independent variables.
Correlation
Correlation quantifies how strong and in which direction the two variables are related. It is being measured using the correlation coefficient, which goes to the extent of negative one and positive one.
- **Positive Correlation: In this case, if the value of one of them goes up, then the value of the other one also goes up.
- **Negative Correlation: That is, when one variable is high, the other is low, and vice versa.
- **No Correlation: Not statistically significantly correlated: partners, financial wealth, and life satisfaction.
Types of Correlation Coefficients
Some of the common correlation coefficients are:
- **Pearson Correlation Coefficient: Calculates the form of a straight line that describes the direction of the relationship between two variables that are both of the interval or ratio scale.
- **Spearman's Rank Correlation: Used to establish the level of connection between two variables that have been ranked.
- **Kendall's Tau: Measures the extent to which one variable is ranked relative to another.
Regression
Regression analysis can be defined as a statistical tool that is applied in an endeavor to identify the relationship between a dependent variable and one or many independent variables. Of benefit when trying to forecast the position of the dependent variable, given the position of the independent variables.
Types of Regression
Common types of regression are:
- **Simple Linear Regression: Looks at two variables and tries to find out if one makes some contribution to the other by putting in place a linear equation.
- **Multiple Linear Regression: Looks at the effect of more than one independent variable on one dependent variable.
- **Logistic Regression: Applied where the dependent variable is qualitative.
- **Polynomial Regression: Plans the connection between the dependent variable and the independent variable by an nth-degree polynomial.
Correlation and Regression Formula
Formulas related to Correlation and Regression are:
**Correlation Formula: The Pearson correlation coefficient (r) is calculated using the formula:
r = \frac{\sum (X - \bar{X})(Y - \bar{Y})}{\sqrt{\sum (X - \bar{X})^2 \sum (Y - \bar{Y})^2}}
**Simple Linear Regression Formula: The regression line is represented as:
Y = a + bX
Where:
- Y is the dependent variable,
- X is the independent variable,
- a is the intercept,
- b is the slope.
Correlation vs Regression
Some of the common differences between correlation and regression are:
| Correlation | Regression |
|---|---|
| Measures the strength and direction of the linear relationship between two variables. | Models the relationship between a dependent variable and one or more independent variables. |
| Does not assume a causal relationship; measures association. | Implies a directional relationship from the independent to the dependent variable. |
| Correlation coefficient (r) ranges from -1 to 1. | Regression coefficients (slope and intercept) describe the relationship. |
| No equation involved. | Provides an equation (e.g., 𝑌=𝑎+𝑏𝑋 in simple linear regression). |
| Does not predict values. | Used to predict the value of the dependent variable based on independent variables. |
| Pearson, Spearman, Kendall. | Simple linear, multiple linear, logistic, and polynomial. |
| Measures linear relationships (in Pearson correlation). | Can model linear and non-linear relationships (depending on the type of regression). |
| Requires two continuous variables (for Pearson correlation). | Requires one dependent and one or more independent variables. |
Solved Problems on Correlation and Regression
**Problem 1: Given two variables, X and Y, calculate the Pearson correlation coefficient.
- X: [1, 2, 3, 4, 5]
- Y: [2, 4, 6, 8, 10]
**Solution:
r = \frac{\sum (X - \bar{X})(Y - \bar{Y})}{\sqrt{\sum (X - \bar{X})^2 \sum (Y - \bar{Y})^2}}
**Calculating the mean of X and Y:
- Mean of X (\bar{X}) = 3
- Mean of Y (\bar{Y}) = 6
**Calculating the Pearson correlation coefficient:
r = \frac{(1-3)(2-6) + (2-3)(4-6) + (3-3)(6-6) + (4-3)(8-6) + (5-3)(10-6)}{\sqrt{(1-3)^2 + (2-3)^2 + (3-3)^2 + (4-3)^2 + (5-3)^2} \sqrt{(2-6)^2 + (4-6)^2 + (6-6)^2 + (8-6)^2 + (10-6)^2}}= \frac{2*4 + 1*2 + 0 + 1*2 + 2*4}{\sqrt{4 + 1 + 0 + 1 + 4} \sqrt{16 + 4 + 0 + 4 + 16}} = \frac{20}{\sqrt{10} \cdot \sqrt{40}} = \frac{20}{\sqrt{400}} = \frac{20}{20} = 1
So, the Pearson correlation coefficient ( r = 1 ), indicating a perfect positive linear relationship between X and Y.
**Problem 2: Calculate the Spearman's rank correlation coefficient for the following data:
- X: [10, 20, 30, 40, 50]
- Y: [30, 40, 10, 20, 50]
**Solution:
**1. Rank the data points in X and Y:
- Ranks of X: [1, 2, 3, 4, 5]
- Ranks of Y: [3, 4, 1, 2, 5]
**2. Calculate the difference between ranks (d i ):
Differences (d): [2, 2, -2, -2, 0]
3. Calculate the square of differences (d_i^2)):
Squared differences (d2): [4, 4, 4, 4, 0]
**4. Sum the squared differences:
\sum d_i^2 = 4 + 4 + 4 + 4 + 0 = 16
**5. Use the Spearman's rank correlation formula:
r_s = 1 - \frac{6 \sum d_i^2}{n(n^2 - 1)}
Here, n = 5:
r_s = 1 - \frac{6 \cdot 16}{5(5^2 - 1)} = 1 - \frac{96}{120} = 1 - 0.8 = 0.2
So, the Spearman's rank correlation coefficient n =5, indicating a weak positive rank correlation between X and Y.
Problem 3: Given data points, predict the value of Y for X = 6 using the equationy = a + bX.
- **Data points: (1, 2), (2, 4), (3, 6), (4, 8), (5, 10)
**Solution:
**Calculate the slope (b) and intercept (a):
b = \frac{\sum (X - \bar{X})(Y - \bar{Y})}{\sum (X - \bar{X})^2}
\bar{X} = 3, \quad \bar{Y} = 6
b = \frac{(1-3)(2-6) + (2-3)(4-6) + (3-3)(6-6) + (4-3)(8-6) + (5-3)(10-6)}{(1-3)^2 + (2-3)^2 + (3-3)^2 + (4-3)^2 + (5-3)^2} = \frac{20}{10} = 2
a = \bar{Y} - b\bar{X} = 6 - 2 \cdot 3 = 0
Equation: Y = 0 + 2X
For X = 6:
Y = 2 × 6 = 12
So, the predicted value of Y for X = 6 is 12.
**Problem 4: Given the following data, predict the value of Y for X1 = 3 and X2 = 4:
**Data:
- (X1, X2, Y)
- (1, 2, 3)
- (2, 3, 4)
- (3, 4, 5)
- (4, 5, 6)
- (5, 6, 7)
**Solution:
The multiple linear regression equation is .
Using a statistical software or calculation tool (e.g., Excel, R), we can determine:
- b1 = 0.5
- b2 = 0.5
- a = 1.5
**So, the equation becomes:
Y = 1.5 + 0.5X1 + 0.5X2
For X1 = 3 and X2 = 4:
Y = 1.5 + 0.5 × 3 + 0.5 × 4 = 1.5 + 1.5 + 2 = 5
So, the predicted value of Y for X1 = 3 and X2 = 4 is 5.
**Problem 5: Given the following data, determine the probability of Y being 1 for X = 4:
**Data:
- (X, Y)
- (1, 0)
- (2, 0)
- (3, 1)
- (4, 1)
- (5, 1)
**Solution:
**The logistic regression model is:
\log\left(\frac{p}{1-p}\right) = a + bX
**Using a statistical software or calculation tool, we can determine:
- b = 1.1
- a = -3
**So, the equation becomes:
\log\left(\frac{p}{1-p}\right) = -3 + 1.1X
**For X = 4:
\log\left(\frac{p}{1-p}\right) = -3 + 1.1 \times 4 = 1.4
**Solving for p:
\frac{p}{1-p} = e^{1.4} \approx 4.055
p \approx \frac{4.055}{1 + 4.055} \approx 0.802
So, the probability of Y being 1 for X = 4 is approximately 0.802, or 80.2%.