Statistics For Data Science (original) (raw)

Last Updated : 15 Apr, 2026

Statistics is the science of collecting, analyzing, and interpreting data to uncover patterns and make decisions. In data science, it acts as the backbone for understanding data and building reliable models.

Types of Statistics

There are commonly two types of statistics, which are discussed below:

  1. **Descriptive Statistics: De­scriptive Statistics helps us simplify and organize big chunks of data. This makes large amounts of data easier to understand.
  2. **Inferential Statistics: Inferential Statistics is a little different. It uses smaller data to conclude a larger group. It helps us predict and draw conclusions about a population.

What is Data in Statistics?

Data is a collection of observations, it can be in the form of numbers, words, measurements, or statements.

Types of Data

**1. **Qualitative Data: This data is descriptive. For example - She is beautiful, He is tall, etc.

**2. Quantitative Data: This is numerical information. For example - A horse has four legs.

Basics of Statistics

Basic formulas of statistics are,

Parameters Definition Formulas
**Population Mean (μ) Average of the entire group. \Sigma{\frac{x}{N}}
**Sample Mean Average of a subset of the population \Sigma{\frac{x}{n}}
**Sample/Population Standard Deviation Measures how spread out the data is from the mean \text{Population σ} = \sqrt{\frac{1}{N} \sum_{i=1}^{n} (x_i - \mu)^2}\\\\\text{Sample s} = \sqrt{\frac{1}{N-1} \sum_{i=1}^{n} (x_i - \bar{x})^2}
**Sample/Population Variance Shows how far values are from the mean, squared Variance(Population)~=~\frac{{\sum(x-\overline{x})^2}}{n}\\Variance(Sample)~=~\frac{{\sum(x-\overline{x})^2}}{n-1}
**Class Interval(CI) Range of values in a group CI = Upper Limit − Lower Limit
**Frequency(f) How often a value appears Count of occurrences
**Range (R) Difference between largest and smallest values Range = Max−Min

Measure of Central Tendency

**1. Mean: The mean can be calculated by summing all values present in the sample divided by total number of values present in the sample or population.

Formula:Mean (\mu) = \frac{Sum \, of \, Values}{Number \, of \, Values}

**2. Median: The median is the middle of a dataset when arranged from lowest to highest or highest to lowest in order to find the median, the data must be sorted. For an odd number of data points the median is the middle value and for an even number of data points median is the average of the two middle values.

**3. Mode: The most frequently occurring value in the Sample or Population is called as Mode.​

Measure of Dispersion

Formula:\sigma^2~=~\frac{\Sigma(X-\mu)^2}{n}

Formula:\sigma=\sqrt(\sigma^2)=\sqrt(\frac{\Sigma(X-\mu)^2}{n})

Formula: IQR = Q_3 -Q_1

Q1 (First Quartile): Median of the lower 50% of the dataset (25th percentile).

Q2 (Second Quartile / Median): Median of the entire dataset (50th percentile).

Q3 (Third Quartile): Median of the upper 50% of the dataset (75th percentile).

Formula: Mean \, Absolute \, Deviation = \frac{\sum_{i=1}^{n}{|X - \mu|}}{n}

CV = (\frac{\sigma}{\mu}) * 100

Measure of Shape

1. Skewness

Skewness is the measure of asymmetry of probability distribution about its mean.

Skewness

Types of Skewed data

**Types of Skewed data

2. Kurtosis

Kurtosis quantifies the degree to which a probability distribution deviates from the normal distribution. It assesses the "tailedness" of the distribution, indicating whether it has heavier or lighter tails than a normal distribution. High kurtosis implies more extreme values in the distribution, while low kurtosis indicates a flatter distribution.

Kurtosis

Types of Kurtosis

Types of Kurtosis

Measure of Relationship

Cov(x,y) = \frac{\sum(X_i-\overline{X})(Y_i - \overline{Y})}{n}

\rho(X, Y) = \frac{cov(X,Y)}{\sigma_X \sigma_Y}

Probability Theory

Here are some basic concepts or terminologies used in probability:

Term Definition
Sample Space The set of all possible outcomes in a probability experiment.
Event A subset of the sample space.
Joint Probability (Intersection of Event) Probability of occurring events A and B. Formula: P(A and B) = P(A) × P(B)
Union of Events Probability of occurring events A or B. Formula: P(A or B) = P(A) + P(B) - P(A and B)
Conditional Probability Probability of occurring events A when event B has occurred. Formula: P(A | B) = P(A and B)/P(B)

Bayes Theorem

Bayes' Theorem is a fundamental concept in probability theory that relates conditional probabilities. It is named after the Reverend Thomas Bayes, who first introduced the theorem. Bayes' Theorem is a mathematical formula that provides a way to update probabilities based on new evidence. The formula is as follows:

P(A|B) = \frac{P(B|A) \times P(A)}{P(B)}

where

Types of Probability Functions

Probability Distributions Functions

1. Normal or Gaussian Distribution

The normal distribution is a continuous probability distribution characterized by its bell-shaped curve and can be by described by mean (μ) and standard deviation (σ).

**Formula: f(X|\mu,\sigma)=\frac{\epsilon^{-0.5(\frac{X-\mu}{\sigma})^2}}{\sigma\sqrt(2\pi)}

**Empirical Rule (68-95-99.7 Rule): ~68% data within 1σ, ~95% within 2σ, ~99.7% within 3σ.

Normal-Distribution

**Use: Detecting outliers, modeling natural phenomena.

**Central Limit Theorem: The Central Limit Theorem (CLT) states that, regardless of the shape of the original population distribution, the sampling distribution of the sample mean will be approximately normally distributed if the sample size tends to infinity.

2. Student t-distribution

The t-distribution, also known as Student's t-distribution, is a probability distribution that is used in statistics.

f(t) =\frac{\Gamma\left(\frac{df+1}{2}\right)}{\sqrt{df\pi} \, \Gamma\left(\frac{df}{2}\right)} \left(1 + \frac{t^2}{df}\right)^{-\frac{df+1}{2}}

3. Chi-square Distribution

The chi-squared distribution, denoted as \chi ^2 is a probability distribution used in statistics it is related to the sum of squared standard normal deviates.

\chi^2 = \frac 1{2^{k/2}\Gamma {(k/2)}} x^{{\frac k 2}-1} e^{\frac {-x}2}

4. Binomial Distribution

The binomial distribution models the number of successes in a fixed number of independent Bernoulli trials, where each trial has the same probability of success (_p).

**Formula: P(X=k)=(^n_k)p^k(1-p)^{n-k}

5. Poisson Distribution

The poisson distribution models the number of events that occur in a fixed interval of time or space. It's characterized by a single parameter (_λ), the average rate of occurrence.

**Formula: P(X=k)=\frac{\epsilon^{-\lambda}\lambda^k}{k!}

6. Uniform Distribution

The uniform distribution represents a constant probability for all outcomes in a given range.

Formula: f(X)=\frac{1}{b-a}

Parameter estimation for Statistical Inference

Bias(\widehat{\theta}) = E(\widehat{\theta}) - \theta

Hypothesis Testing

Hypothesis testing makes inferences about a population parameter based on sample statistic.

type-error

**1. Null Hypothesis (H₀): There is no significant difference or effect.

**2. Alternative Hypothesis (H₁): There is a significant effect i.e the given statement can be false.

**3. Degrees of freedom: Degrees of freedom (df) in statistics represent the number of values or quantities in the final calculation of a statistic that are free to vary. It is mainly defined as sample size-one (n-1).

*4. Level of Significance(\alpha)*: This is the threshold used to determine statistical significance. Common values are 0.05, 0.01, or 0.10.

**5. p-value: The p-value probability of observing results if H₀ is true.

**6. Type I Error and Type II Error

**7. Confidence Intervals: A confidence interval is a range of values that is used to estimate the true value of a population parameter with a certain level of confidence. It provides a measure of the uncertainty or margin of error associated with a sample statistic, such as the sample mean or proportion.

**Example of Hypothesis Testing (Website Redesign)

An e-commerce company wants to know if a website redesign affects average user session time.

**Hypotheses:

**Significance Level: α = 0.05
**Test: Difference in means -> calculate p-value

**Interpretation:

Statistical Tests

Parametric test are statistical methods that make assumption that the data follows normal distribution.

Z-test t-test F-test
Tests if a sample mean differs from a known population mean. Compares means when population standard deviation is unknown. Compares variances of two or more groups.
Population standard deviation is known and sample size is large. Small samples or unknown population standard deviation. To test if group variances are significantly different.
One-Sample Test:Z = \frac{\overline{X}-\mu}{\frac{\sigma}{\sqrt{n}}}Two-Sample Test:Z = \frac{\overline{X_1} -\overline{X_2}}{\sqrt{\frac{\sigma_{1}^{2}}{n_1} + \frac{\sigma_{2}^{2}}{n_2}}} One- sample: t = \frac{\overline{X}- \mu}{\frac{s}{\sqrt{n}}}Two-Sample Test: t= \frac{\overline{X_1} - \overline{X_2}}{\sqrt{\frac{s_{1}^{2}}{n_1} + \frac{s_{2}^{2}}{n_2}}}Paired t-Test:t=\frac{\overline{d}}{\frac{s_d}{\sqrt{n}}}d= difference F = \frac{s_{1}^{2}}{s_{2}^{2}}

ANOVA (Analysis Of Variance)

Source of Variation Sum of Squares Degrees Of Freedom Mean Squares F-Value
Between Groups SSB= \Sigma n _1(\bar x_1 - \bar x)^2 df1=k-1 MSB= SSB/ (k-1) f=MSB/MSE
Error SSE=\Sigma\Sigma (\bar x_1 - \bar x)^2 df2=N-1 MSE=SSE/(N-k)
Total SST= SSB+SSE df3=N-1

There are mainly **two types of ANOVA:

1. One-way ANOVA: Compares means of 3+ groups.

2. Two-way ANOVA: Tests impact of two categorical variables and their interaction

Chi-Squared Test

The chi-squared test is a statistical test used to determine if there is a significant association between two categorical variables. It compares the observed frequencies in a contingency table with the frequencies.

****Formula:**X^2=\Sigma{\frac{(O_{ij}-E_{ij})^2}{E_{ij}}}

This test is also performed on big data with multiple number of observations.

Non-Parametric Test

Non-parametric test does not make assumptions about the distribution of the data. They are useful when data does not meet the assumptions required for parametric tests.

A/B Testing or Split Testing

A/B testing, also known as split testing, is a method used to compare two versions (A and B) of a webpage, app, or marketing asset to determine which one performs better.

**Example: a product manager change a website's "Shop Now" button color from green to blue to improve the click-through rate (CTR). Formulating null and alternative hypotheses, users are divided into A and B groups and CTRs are recorded. Statistical tests like chi-square or t-test are applied with a 5% confidence interval. If the p-value is below 5%, the manager may conclude that changing the button color significantly affects CTR, informing decisions for permanent implementation.

Regression

Regression is a statistical technique used to model the relationship between a dependent variable and one or more independent variables.

The equation for regression: y=\alpha+ \beta x

Where,

Regression coefficient is a measure of the strength and direction of the relationship between a predictor variable (independent variable) and the response variable (dependent variable) \beta = \frac{\sum(X_i-\overline{X})(Y_i - \overline{Y})}{\sum(X_i-\overline{X})^2}