Last Minute Notes (LMNs) Machine Learning (original) (raw)

Last Updated : 23 Jul, 2025

Machine Learning (ML) is a branch of artificial intelligence where computers learn from data to make decisions or predictions without being explicitly programmed.

There are two main types of learning in ML:

**Supervised Learning****:** The model is trained on labeled data, meaning it learns from input-output pairs to predict outcomes for new data.
**Unsupervised Learning****:** The model is given data without labels and must find patterns or groupings on its own.

Supervised Learning

There are two main types of task in supervised learning:

**Regression model predicts a continuous value. For example, predicting house prices based on features like location and size. **Algorithms: simple linear regression, multiple linear regression.
**Classification model predicts a category or class. For example, spam vs. non-spam emails. Algorithms: logistic regression, k-nearest neighbor, naïve bayes, linear discriminant analysis, support vector machine, decision tree

**1. Simple Linear Regression

Simple Linear Regression models the relationship between two variables by fitting a linear equation to the observed data. It predicts the value of a dependent variable based on the value of an independent variable. The relationship is represented as:

Y=β_0+β_1X+ϵ

Key metrics use to evaluate are Mean Squared Error (MSE) and R^2

**Assumptions of Linear Regression:

Linearity
Normality of residuals
Homoscedasticity (constant variance of residuals)
Independence of observations

**2. Multiple Linear Regression

Multiple Linear Regression model predicts a continuous output based on multiple input features. It extends simple linear regression by using more than one independent variable to model the relationship.

Y=β_0 +β_1X_1+β_2X_2+…+β_pX_p+ϵ

**Assumptions of Multiple Linear Regression:

Same as simple linear regression.
No multicollinearity (check using VIF).

**2. Logistic Regression

Logistic Regression is used for binary classification problems, where the goal is to predict the probability of an outcome belonging to one of two classes. Unlike linear regression, which predicts continuous values, logistic regression predicts values between 0 and 1.

\text{Probability of Class 1}=\frac{1}{1+e^{-(b_0+b_1X_1+b_2X_2+...+b_nX_n)}}

where b_0, b_1, ... , b_n are the coefficients, and X_1, X_2, ... , X_n are the input features.

If \text{Probability of Class 1} > 0.5 , classify as 1, else 0.

The model is trained using the log-loss (or binary cross-entropy), which measures the accuracy of the predicted probabilities.

**Assumptions of Logistic Regression:

Linearity between predictors and log-odds.
Independent observations.
No multicollinearity.

**3. K-Nearest Neighbors (KNN)

K-Nearest Neighbors (KNN) works by finding k closet data points (neighbors) in the training dataset to make predictions. The distance between points is typically measured using Euclidean, Manhattan and Minkowski distance.

Choosing k:

A small value of k (e.g., k=1) makes the model sensitive to noise (overfitting).
A large value of k smoothens the decision boundary but might underfit (too simplistic).
Typically, odd values for k are chosen in binary classification to avoid ties.

KNN is a lazy learning algorithm, meaning it does not learn a model during training but makes predictions by comparing test points to the entire training dataset.

**4. Naïve Bayes Classifier

Naïve Bayes Classifier classify data using Bayes' theorem assuming that all features are conditionally independent given the class label. This simplifies the computation of probabilities by breaking down the joint probability of features into the product of individual probabilities for each feature.

P(Y|X_1, X_2, ..., X_n) = \frac{P(Y) \prod_{i=1}^n P(X_i|Y)}{P(X_1, X_2, ..., X_n)}

P(Y): Prior probability of class Y.
P(X_i|Y): Likelihood of feature X_i given class Y.
P(X_1, X_2, ..., X_n): Evidence (normalization constant).

**Pros: Simple, fast, and effective for text data.
**Cons: Assumes independence; struggles with correlated features.

**5. Linear Discriminant Analysis (LDA)

Linear discriminant Analysis projects high dimensional data into lower-dimensional space while maximizing class separability. The goal is to find a linear combination of features that best separates classes by maximizing the distance between class means while minimizing the variance within each class.

LDA calculates two scatter matrices:

**1. Within-Class Scatter Matrix S_W: Measures spread of data points within each class.

S_W = \sum_{c=1}^k \sum_{x \in C_c} (x - \mu_c)(x - \mu_c)^T

**2. Between-Class Scatter Matrix (S_B*)*: Measures spread of class means relative to the overall mean.

S_B = \sum_{c=1}^k N_c (\mu_c - \mu)(\mu_c - \mu)^T

The algorithm optimizes the ratio of between-class variance to within-class variance by solving the eigenvalue problem S_W^{-1} S_B w = \lambda_w.

**6. Support Vector Machine (SVM)

Support Vector Machines finds the optimal hyperplane that maximizes the margin between the classes. The margin is the distance between the hyperplane and the nearest data points from each class (support vectors).

For linearly separable data, it aims to find a straight-line or flat hyperplane.
For non-linear data, SVM uses kernel functions to map the data into a higher-dimensional space, making it linearly separable.

**1. Objective Function (Linear SVM):

\text{Minimize: } \frac{1}{2} \|w\|^2

Subject to:

y_i (w^T x_i + b) \geq 1 - \xi_i \quad \text{and} \quad \xi_i \geq 0

Here, w is the weight vector that defines the hyperplane, b is the bias term, and \xi_i represents slack variables that allow for some misclassification to balance margin maximization and error tolerance. y_i denotes the class label (+1 \text{or} -1) for each data point.

**2. Kernel Trick (Non-Linear SVM):

Maps data into a higher-dimensional space using kernel functions. Common kernel functions are:

**Linear Kernel: (x \cdot x')
**Polynomial Kernel: (\gamma x \cdot x' + r)^d
**Radial Basis Function (RBF): \exp(-\gamma \|x - x'\|^2)
**Sigmoid Kernel: \tanh(\gamma x \cdot x' + r)

**7. Decision Trees

Decision tree splits the dataset into subsets based on the most significant attribute at each step. It consists of:

**Root Node: The topmost decision node.
**Internal Nodes: Decision nodes that split the data.
**Leaf Nodes: Terminal nodes that represent the final output (class label or value).

The splitting criteria used in decision tree:

**Gini Index: Measures the impurity of a node.
Gini = 1 - \sum_{i=1}^k p_i^2
Where p_i is the proportion of class i.
**Entropy (Information Gain): Measures the reduction in uncertainty.
Entropy = - \sum_{i=1}^k p_i \log_2(p_i)

8. Feedforward Neural Network

Feedforward neural network are type of artificial neural network where connections between nodes do not form cycles. Data flows in one direction: from the input layer to the output layer, without any feedback loops.

**Input Layer: Receives the input features.
**Hidden Layers (optional): Intermediate layers between input and output.
**Output Layer: Produces the final output.

9. Multi-Layer Perceptron

Multi-layer perceptron are type of feedforward neural network that compose multiple layers of interconnected neurons, where each neuron processes inputs and passes outputs to the next layer.

**Steps in Training MLP:

**Forward Pass: Compute outputs of each neuron layer-by-layer using:
z = W \cdot x
Where W is the weight matrix, x is the input, and b is the bias.
**Loss Calculation: Compute the loss using the output from the forward pass and the target labels.
**Backpropagation: Calculate gradients of the loss with respect to weights and biases using the chain rule.
**Weight Update: Adjust weights and biases using an optimization algorithm:
W = W - \eta \cdot \frac{\partial \text{Loss}}{\partial W}
Where \eta is the learning rate.
**Repeat: Iterate through forward pass, loss calculation, backpropagation, and weight update for multiple epochs.

**Unsupervised Learning

Unsupervised learning finds hidden patterns or structures in unlabeled data. The goal of unsupervised learning algorithms:

**Clustering: Group similar data points based on their features
**Dimensionality Reduction: Reduce the number of features while retaining important information

1. K-Means Clustering

K-means clustering groups data points into _k clusters by minimizing the variance within each cluster. The goal is to partition the data into _k clusters such that:

Data points in the same cluster are as similar as possible (minimize intra-cluster variance).
Data points in different clusters are as different as possible (maximize inter-cluster variance).

It is an iterative process involving following steps:

**Choose Initial Centroids: Randomly select k initial centroids.
**Assign Data Points: Assign each data point to the nearest centroid based on a distance metric.
**Update Centroids: Recalculate each centroid as the mean of all points assigned to it:
c_j = \frac{1}{n_j} \sum_{i=1}^{n_j}
Where c_j is the centroid of cluster j, n_j is the number of points in the cluster, and x_i is a data point in that cluster.
**Repeat: Repeat the assignment and update steps until convergence, where centroids no longer change significantly.

2. **K-Medoids Clustering

K-Medoids clustering selects **actual data points (medoids) as cluster centers, minimizing the sum of dissimilarities between points and their assigned medoid.

Partition the data into k clusters such that:

Data points within the same cluster are as similar as possible (minimize intra-cluster dissimilarity).
Dissimilarity is measured by metrics like **Manhattan distance or **Euclidean distance.

3. Hierarchical Clustering

Hierarchical clustering algorithm creates a tree-like structure called a **dendrogram to represent nested groupings of data. To create a hierarchy of clusters, where similar data points are grouped together at each level, from individual points to a single cluster containing all data.

**Types of Hierarchical Clustering

**1. Agglomerative (Bottom-Up):

Starts with each data point as its own cluster.
Iteratively merges the closest clusters until one cluster remains.

**2. Divisive (Top-Down):

Starts with all data points in a single cluster.
Recursively splits clusters until each data point forms its own cluster.

**Steps in Agglomerative Hierarchical Clustering

**Calculate Distance: Compute the distance (dissimilarity) between all data points using metrics like Euclidean distance, Manhattan distance and cosine similarity.
**Merge Closest Clusters: Find the two closest clusters (using linkage criteria) and merge them.
**Update Distance Matrix: Recalculate distances between the new cluster and remaining clusters based on the linkage method.
**Repeat: Continue merging clusters until one cluster remains or the desired number of clusters is reached.

**Linkage Methods (Distance Between Clusters):

**Single Linkage: Distance between the closest points in two clusters.

d(A, B) = \min \{d(x, y): x \in A, y \in B\}

**Complete Linkage: Distance between the farthest points in two clusters.

d(A, B) = \max \{d(x, y): x \in A, y \in B\}

**Average Linkage: Average distance between all points in two clusters.

d(A, B) = \frac{1}{|A| |B|} \sum_{x \in A} \sum_{y \in B}

**Centroid Linkage: Distance between the centroids of two clusters.

4. **Principal Component Analysis (PCA)

Principal component analysis transforms high-dimensional data into a lower-dimensional space by finding the most important directions (principal components) that capture the maximum variance in the data.

**Steps in PCA:

**1. Standardize the Data: Center the data by subtracting the mean and scaling to unit variance (z-score normalization).

**2. Compute Covariance Matrix: Calculate the covariance matrix of the standardized data to measure relationships between features.

C = \frac{1}{n-1} X^T X

Here, X is the standardized data matrix.

**3. Calculate Eigenvalues and Eigenvectors: Find eigenvalues (variance explained) and eigenvectors (directions of principal components) of the covariance matrix.

\text{Explained Variance Ratio} = \frac{\text{Eigenvalue of a Principal Component}}{\text{Sum of All Eigenvalues}}

**4. Select Principal Components: Choose the top _k eigenvectors corresponding to the largest eigenvalues.

**5. Project Data: Transform the original data onto the selected principal components.

Z = XW

Where W is the matrix of selected eigenvectors, and Z is the transformed data.

**Model Evaluation and Selection Techniques

1. Bias-Variance Tradeoff

Bias-variance tradeoff is a tradeoff between two sources of error in machine learning models:

**1. Bias: Error due to overly simplistic assumptions in the model (underfitting).

High bias → Model is too simple (e.g., linear model for non-linear data).
Leads to **underfitting (poor performance on both training and test data).

**2. Variance: Error due to the model's sensitivity to small fluctuations in the training set (overfitting).

High variance → Model is too complex (e.g., overfitting to noise in the data).
Leads to **overfitting (good performance on training data but poor on test data).

**3. Total Error:

\text{Total Error} =\text{Bias}^2 +\text{Variance}+\text{Irreducible Error}

**Irreducible Error: Noise in the data that cannot be reduced.

To reduce **bias, increase model complexity (e.g., add features, use more sophisticated algorithms) and to reduce **variance, use regularization, simplify the model, or collect more data.

2. Cross-Validation Techniques

Cross-validation is a technique for evaluating the performance of a model by partitioning the data into subsets, training on some subsets, and validating on the remaining subsets.

The purpose of the cross-validation techniques:

Estimate how well a model will generalize to unseen data.
Reduce the risk of overfitting.

**1. **k-Fold Cross-Validation

Steps to perform k-Fold Cross-Validation:

Split the dataset into _k equal-sized folds.
Train the model on k -1 folds and validate on the remaining fold.
Repeat this process _k times, each time using a different fold as the validation set.
Average the performance metrics (e.g., accuracy, MSE) across all _k folds.

K-fold cross validation reduces variance in model evaluation compared to a single train-test split. But, it is computationally expensive for large dataset.

**2. Leave-One-Out Cross-Validation (LOOCV)

Steps to perform LOOCV:

Use a single data point as the validation set and the remaining n-1 points as the training set.
Repeat this process n times, each time using a different data point as the validation set.
Average the performance metrics across all n iterations.

LOOCV uses the maximum possible data for training in each iteration but, can have high variance in results due to single-point testing.