Non Parametric Methods in Statistics (original) (raw)

Non-parametric methods in statistics are techniques that do not assume a specific probability distribution for the data. Unlike parametric methods, which rely on fixed parameters (e.g., mean, variance), non-parametric methods are more flexible and useful when dealing with unknown or complex distributions. These methods are widely applied in hypothesis testing, regression, density estimation and classification.

Common Non-Parametric Statistical Tests

Wilcoxon Rank-Sum Test (Mann-Whitney U Test)

Used to compare two independent groups when normality assumptions do not hold.

U = n_1 n_2 + \frac{n_1 (n_1 + 1)}{2} - R_1

**where:

U is the Mann-Whitney statistic,
n1, n2 are the sample sizes,
R1 is the sum of ranks for group 1. Python `

from scipy.stats import mannwhitneyu x = [3, 5, 7, 9] y = [2, 4, 6, 8] stat, p = mannwhitneyu(x, y) print("Mann-Whitney U test statistic:", stat, "p-value:", p)

**Output

Mann-Whitney U test statistic: 10.0 p-value: 0.6857142857142857

Kruskal-Wallis Test

A non-parametric alternative to ANOVA for comparing more than two groups.

H = \frac{12}{N(N+1)} \sum \frac{R_i^2}{n_i} - 3(N+1)

where:

H is the Kruskal-Wallis statistic,
Ri is the rank sum for group i,
ni is the sample size of group i,
N is the total sample size. Python `

from scipy.stats import kruskal stat, p = kruskal([1, 2, 3], [4, 5, 6], [7, 8, 9]) print("Kruskal-Wallis test statistic:", stat, "p-value:", p)

**Output

Kruskal-Wallis test statistic: 7.200000000000003 p-value: 0.02732372244729252

Non-Parametric Regression

1. Kernel Density Estimation (KDE)

KDE is a technique to estimate the probability density function (PDF) of a dataset.

\hat{f}(x) = \frac{1}{nh} \sum_{i=1}^{n} K \left( \frac{x - x_i}{h} \right)

**where:

K(.) is the kernel function (e.g., Gaussian kernel),
h is the bandwidth parameter,
xi are sample points. Python `

import numpy as np import seaborn as sns import matplotlib.pyplot as plt

data = np.random.randn(100) sns.kdeplot(data, bw_adjust=0.5) plt.show()

**Output

Density

2. k-Nearest Neighbors (k-NN) Regression

k-NN is a simple, non-parametric regression method that predicts the target variable based on the mean (or median) of the nearest k neighbors.

\hat{y} = \frac{1}{k} \sum_{i=1}^{k} y_i

where yi are the values of the k nearest neighbors.

Implementation of K-Nearest Neighbors Regression

Python `

from sklearn.neighbors import KNeighborsRegressor X = np.array([[1], [2], [3], [4], [5]]) y = np.array([2, 4, 6, 8, 10]) knn = KNeighborsRegressor(n_neighbors=2) knn.fit(X, y) print(knn.predict([[3.5]]))

**Output

[7.]

3. Bootstrap Methods

Bootstrap methods are resampling techniques used to estimate the sampling distribution of a statistic.

Algorithm:

Randomly sample with replacement from the original dataset.
Compute the statistic of interest (e.g., mean, median) on the resampled dataset.
Repeat this process many times (e.g., 1000 iterations).
Use the empirical distribution of the computed statistic for inference. Python `

from sklearn.utils import resample import numpy as np

sample = np.array([3, 5, 7, 9, 11]) bootstrap_samples = [resample(sample, replace=True, n_samples=len(sample)) for _ in range(1000)] bootstrap_means = [np.mean(s) for s in bootstrap_samples] print("Bootstrap Mean Estimate:", np.mean(bootstrap_means))

**Output

Bootstrap Mean Estimate: 6.9883999999999995

Advantages

No need for strict assumptions about data distribution.
More flexible in handling real-world data.
Useful for small datasets where parametric assumptions fail.

Disadvantages

Less efficient for large datasets compared to parametric methods.
Higher computational cost due to resampling or rank calculations.
May require larger sample sizes to achieve reliable results.