Non Parametric Methods in Statistics (original) (raw)
Non-parametric methods in statistics are techniques that do not assume a specific probability distribution for the data. Unlike parametric methods, which rely on fixed parameters (e.g., mean, variance), non-parametric methods are more flexible and useful when dealing with unknown or complex distributions. These methods are widely applied in hypothesis testing, regression, density estimation and classification.
Common Non-Parametric Statistical Tests
Wilcoxon Rank-Sum Test (Mann-Whitney U Test)
Used to compare two independent groups when normality assumptions do not hold.
U = n_1 n_2 + \frac{n_1 (n_1 + 1)}{2} - R_1
**where:
- U is the Mann-Whitney statistic,
- n1, n2 are the sample sizes,
- R1 is the sum of ranks for group 1. Python `
from scipy.stats import mannwhitneyu x = [3, 5, 7, 9] y = [2, 4, 6, 8] stat, p = mannwhitneyu(x, y) print("Mann-Whitney U test statistic:", stat, "p-value:", p)
`
**Output
Mann-Whitney U test statistic: 10.0 p-value: 0.6857142857142857
Kruskal-Wallis Test
A non-parametric alternative to ANOVA for comparing more than two groups.
H = \frac{12}{N(N+1)} \sum \frac{R_i^2}{n_i} - 3(N+1)
where:
- H is the Kruskal-Wallis statistic,
- Ri is the rank sum for group i,
- ni is the sample size of group i,
- N is the total sample size. Python `
from scipy.stats import kruskal stat, p = kruskal([1, 2, 3], [4, 5, 6], [7, 8, 9]) print("Kruskal-Wallis test statistic:", stat, "p-value:", p)
`
**Output
Kruskal-Wallis test statistic: 7.200000000000003 p-value: 0.02732372244729252
Non-Parametric Regression
1. Kernel Density Estimation (KDE)
KDE is a technique to estimate the probability density function (PDF) of a dataset.
\hat{f}(x) = \frac{1}{nh} \sum_{i=1}^{n} K \left( \frac{x - x_i}{h} \right)
**where:
- K(.) is the kernel function (e.g., Gaussian kernel),
- h is the bandwidth parameter,
- xi are sample points. Python `
import numpy as np import seaborn as sns import matplotlib.pyplot as plt
data = np.random.randn(100) sns.kdeplot(data, bw_adjust=0.5) plt.show()
`
**Output

2. k-Nearest Neighbors (k-NN) Regression
k-NN is a simple, non-parametric regression method that predicts the target variable based on the mean (or median) of the nearest k neighbors.
\hat{y} = \frac{1}{k} \sum_{i=1}^{k} y_i
where yi are the values of the k nearest neighbors.
Implementation of K-Nearest Neighbors Regression
Python `
from sklearn.neighbors import KNeighborsRegressor X = np.array([[1], [2], [3], [4], [5]]) y = np.array([2, 4, 6, 8, 10]) knn = KNeighborsRegressor(n_neighbors=2) knn.fit(X, y) print(knn.predict([[3.5]]))
`
**Output
[7.]
3. Bootstrap Methods
Bootstrap methods are resampling techniques used to estimate the sampling distribution of a statistic.
Algorithm:
- Randomly sample with replacement from the original dataset.
- Compute the statistic of interest (e.g., mean, median) on the resampled dataset.
- Repeat this process many times (e.g., 1000 iterations).
- Use the empirical distribution of the computed statistic for inference. Python `
from sklearn.utils import resample import numpy as np
sample = np.array([3, 5, 7, 9, 11]) bootstrap_samples = [resample(sample, replace=True, n_samples=len(sample)) for _ in range(1000)] bootstrap_means = [np.mean(s) for s in bootstrap_samples] print("Bootstrap Mean Estimate:", np.mean(bootstrap_means))
`
**Output
Bootstrap Mean Estimate: 6.9883999999999995
Advantages
- No need for strict assumptions about data distribution.
- More flexible in handling real-world data.
- Useful for small datasets where parametric assumptions fail.
Disadvantages
- Less efficient for large datasets compared to parametric methods.
- Higher computational cost due to resampling or rank calculations.
- May require larger sample sizes to achieve reliable results.