Last Minute Notes (LMNs) Machine Learning (original) (raw)

Last Updated : 23 Jul, 2025

Machine Learning (ML) is a branch of artificial intelligence where computers learn from data to make decisions or predictions without being explicitly programmed.

There are two main types of learning in ML:

Supervised Learning

There are two main types of task in supervised learning:

  1. **Regression model predicts a continuous value. For example, predicting house prices based on features like location and size. **Algorithms: simple linear regression, multiple linear regression.
  2. **Classification model predicts a category or class. For example, spam vs. non-spam emails. Algorithms: logistic regression, k-nearest neighbor, naïve bayes, linear discriminant analysis, support vector machine, decision tree

**1. Simple Linear Regression

Simple Linear Regression models the relationship between two variables by fitting a linear equation to the observed data. It predicts the value of a dependent variable based on the value of an independent variable. The relationship is represented as:

Y=β_0+β_1X+ϵ

Key metrics use to evaluate are Mean Squared Error (MSE) and R^2

**Assumptions of Linear Regression:

**2. Multiple Linear Regression

Multiple Linear Regression model predicts a continuous output based on multiple input features. It extends simple linear regression by using more than one independent variable to model the relationship.

Y=β_0 +β_1X_1+β_2X_2+…+β_pX_p+ϵ

**Assumptions of Multiple Linear Regression:

**2. Logistic Regression

Logistic Regression is used for binary classification problems, where the goal is to predict the probability of an outcome belonging to one of two classes. Unlike linear regression, which predicts continuous values, logistic regression predicts values between 0 and 1.

\text{Probability of Class 1}=\frac{1}{1+e^{-(b_0+b_1X_1+b_2X_2+...+b_nX_n)}}

where b_0, b_1, ... , b_n are the coefficients, and X_1, X_2, ... , X_n are the input features.

If \text{Probability of Class 1} > 0.5 , classify as 1, else 0.

The model is trained using the log-loss (or binary cross-entropy), which measures the accuracy of the predicted probabilities.

**Assumptions of Logistic Regression:

**3. K-Nearest Neighbors (KNN)

K-Nearest Neighbors (KNN) works by finding k closet data points (neighbors) in the training dataset to make predictions. The distance between points is typically measured using Euclidean, Manhattan and Minkowski distance.

Choosing k:

KNN is a lazy learning algorithm, meaning it does not learn a model during training but makes predictions by comparing test points to the entire training dataset.

**4. Naïve Bayes Classifier

Naïve Bayes Classifier classify data using Bayes' theorem assuming that all features are conditionally independent given the class label. This simplifies the computation of probabilities by breaking down the joint probability of features into the product of individual probabilities for each feature.

P(Y|X_1, X_2, ..., X_n) = \frac{P(Y) \prod_{i=1}^n P(X_i|Y)}{P(X_1, X_2, ..., X_n)}

**Pros: Simple, fast, and effective for text data.
**Cons: Assumes independence; struggles with correlated features.

**5. Linear Discriminant Analysis (LDA)

Linear discriminant Analysis projects high dimensional data into lower-dimensional space while maximizing class separability. The goal is to find a linear combination of features that best separates classes by maximizing the distance between class means while minimizing the variance within each class.

LDA calculates two scatter matrices:

**1. Within-Class Scatter Matrix S_W: Measures spread of data points within each class.

S_W = \sum_{c=1}^k \sum_{x \in C_c} (x - \mu_c)(x - \mu_c)^T

**2. Between-Class Scatter Matrix (S_B*​)*: Measures spread of class means relative to the overall mean.

S_B = \sum_{c=1}^k N_c (\mu_c - \mu)(\mu_c - \mu)^T

The algorithm optimizes the ratio of between-class variance to within-class variance by solving the eigenvalue problem S_W^{-1} S_B w = \lambda_w.

**6. Support Vector Machine (SVM)

Support Vector Machines finds the optimal hyperplane that maximizes the margin between the classes. The margin is the distance between the hyperplane and the nearest data points from each class (support vectors).

**1. Objective Function (Linear SVM):

\text{Minimize: } \frac{1}{2} \|w\|^2

Subject to:

y_i (w^T x_i + b) \geq 1 - \xi_i \quad \text{and} \quad \xi_i \geq 0

Here, w is the weight vector that defines the hyperplane, b is the bias term, and \xi_i represents slack variables that allow for some misclassification to balance margin maximization and error tolerance. y_i​ denotes the class label (+1 \text{or} -1) for each data point.

**2. Kernel Trick (Non-Linear SVM):

Maps data into a higher-dimensional space using kernel functions. Common kernel functions are:

**7. Decision Trees

Decision tree splits the dataset into subsets based on the most significant attribute at each step. It consists of:

The splitting criteria used in decision tree:

8. Feedforward Neural Network

Feedforward neural network are type of artificial neural network where connections between nodes do not form cycles. Data flows in one direction: from the input layer to the output layer, without any feedback loops.

9. Multi-Layer Perceptron

Multi-layer perceptron are type of feedforward neural network that compose multiple layers of interconnected neurons, where each neuron processes inputs and passes outputs to the next layer.

**Steps in Training MLP:

  1. **Forward Pass: Compute outputs of each neuron layer-by-layer using:
    z = W \cdot x
    Where W is the weight matrix, x is the input, and b is the bias.
  2. **Loss Calculation: Compute the loss using the output from the forward pass and the target labels.
  3. **Backpropagation: Calculate gradients of the loss with respect to weights and biases using the chain rule.
  4. **Weight Update: Adjust weights and biases using an optimization algorithm:
    W = W - \eta \cdot \frac{\partial \text{Loss}}{\partial W}​
    Where \eta is the learning rate.
  5. **Repeat: Iterate through forward pass, loss calculation, backpropagation, and weight update for multiple epochs.

**Unsupervised Learning

Unsupervised learning finds hidden patterns or structures in unlabeled data. The goal of unsupervised learning algorithms:

1. K-Means Clustering

K-means clustering groups data points into _k clusters by minimizing the variance within each cluster. The goal is to partition the data into _k clusters such that:

It is an iterative process involving following steps:

  1. **Choose Initial Centroids: Randomly select k initial centroids.
  2. **Assign Data Points: Assign each data point to the nearest centroid based on a distance metric.
  3. **Update Centroids: Recalculate each centroid as the mean of all points assigned to it:
    c_j = \frac{1}{n_j} \sum_{i=1}^{n_j}
    Where c_j​ is the centroid of cluster j, n_j is the number of points in the cluster, and x_i is a data point in that cluster.
  4. **Repeat: Repeat the assignment and update steps until convergence, where centroids no longer change significantly.

2. **K-Medoids Clustering

K-Medoids clustering selects **actual data points (medoids) as cluster centers, minimizing the sum of dissimilarities between points and their assigned medoid.

Partition the data into k clusters such that:

3. Hierarchical Clustering

Hierarchical clustering algorithm creates a tree-like structure called a **dendrogram to represent nested groupings of data. To create a hierarchy of clusters, where similar data points are grouped together at each level, from individual points to a single cluster containing all data.

**Types of Hierarchical Clustering

**1. Agglomerative (Bottom-Up):

**2. Divisive (Top-Down):

**Steps in Agglomerative Hierarchical Clustering

  1. **Calculate Distance: Compute the distance (dissimilarity) between all data points using metrics like Euclidean distance, Manhattan distance and cosine similarity.
  2. **Merge Closest Clusters: Find the two closest clusters (using linkage criteria) and merge them.
  3. **Update Distance Matrix: Recalculate distances between the new cluster and remaining clusters based on the linkage method.
  4. **Repeat: Continue merging clusters until one cluster remains or the desired number of clusters is reached.

**Linkage Methods (Distance Between Clusters):

d(A, B) = \min \{d(x, y): x \in A, y \in B\}

d(A, B) = \max \{d(x, y): x \in A, y \in B\}

d(A, B) = \frac{1}{|A| |B|} \sum_{x \in A} \sum_{y \in B}

4. **Principal Component Analysis (PCA)

Principal component analysis transforms high-dimensional data into a lower-dimensional space by finding the most important directions (principal components) that capture the maximum variance in the data.

**Steps in PCA:

**1. Standardize the Data: Center the data by subtracting the mean and scaling to unit variance (z-score normalization).

**2. Compute Covariance Matrix: Calculate the covariance matrix of the standardized data to measure relationships between features.

C = \frac{1}{n-1} X^T X

Here, X is the standardized data matrix.

**3. Calculate Eigenvalues and Eigenvectors: Find eigenvalues (variance explained) and eigenvectors (directions of principal components) of the covariance matrix.

\text{Explained Variance Ratio} = \frac{\text{Eigenvalue of a Principal Component}}{\text{Sum of All Eigenvalues}}

**4. Select Principal Components: Choose the top _k eigenvectors corresponding to the largest eigenvalues.

**5. Project Data: Transform the original data onto the selected principal components.

Z = XW

Where W is the matrix of selected eigenvectors, and Z is the transformed data.

**Model Evaluation and Selection Techniques

1. Bias-Variance Tradeoff

Bias-variance tradeoff is a tradeoff between two sources of error in machine learning models:

**1. Bias: Error due to overly simplistic assumptions in the model (underfitting).

**2. Variance: Error due to the model's sensitivity to small fluctuations in the training set (overfitting).

**3. Total Error:

\text{Total Error} =\text{Bias}^2 +\text{Variance}+\text{Irreducible Error}

To reduce **bias, increase model complexity (e.g., add features, use more sophisticated algorithms) and to reduce **variance, use regularization, simplify the model, or collect more data.

2. Cross-Validation Techniques

Cross-validation is a technique for evaluating the performance of a model by partitioning the data into subsets, training on some subsets, and validating on the remaining subsets.

The purpose of the cross-validation techniques:

**1. **k-Fold Cross-Validation

Steps to perform k-Fold Cross-Validation:

  1. Split the dataset into _k equal-sized folds.
  2. Train the model on k -1 folds and validate on the remaining fold.
  3. Repeat this process _k times, each time using a different fold as the validation set.
  4. Average the performance metrics (e.g., accuracy, MSE) across all _k folds.

K-fold cross validation reduces variance in model evaluation compared to a single train-test split. But, it is computationally expensive for large dataset.

**2. Leave-One-Out Cross-Validation (LOOCV)

Steps to perform LOOCV:

  1. Use a single data point as the validation set and the remaining n-1 points as the training set.
  2. Repeat this process n times, each time using a different data point as the validation set.
  3. Average the performance metrics across all n iterations.

LOOCV uses the maximum possible data for training in each iteration but, can have high variance in results due to single-point testing.