VGG16 | CNN model (original) (raw)

VGG-16 | CNN model

Last Updated : 12 May, 2026

VGG-16 is a convolutional neural network (CNN) designed for image classification tasks, known for its simple and uniform architecture that delivers strong performance on visual recognition problems.

This model achieves 92.7% test accuracy on the ImageNet dataset which contains 14 million images belonging to 1000 classes.

2056957910

VGG-16 Architecture

**VGG-16 Model Objective

The ImageNet dataset contains images of fixed size 224×224 with RGB channels, forming an input tensor of shape(224, 224, 3). The model processes this input and outputs a vector of 1000 values:

\hat{y} =\begin{bmatrix} \hat{y_0}\\ \hat{y_1} \\ \hat{y_2} \\. \\ . \\ . \\ \hat{y}_{999} \end{bmatrix}

This vector represents the classification probabilities for each class. For example, if the model assigns different probabilities to classes such as 0, 1, 2, 3, 780, and 999, with all others being 0, the classification vector can be written as:

\hat{y}=\begin{bmatrix} \hat{y_{0}}=0.1\\ 0.05\\ 0.05\\ 0.03\\ .\\ .\\ .\\ \hat{y_{780}} = 0.72\\ .\\ .\\ \hat{y_{999}} = 0.05 \end{bmatrix}

To make sure these probabilities add to _1, we use softmax function. This softmax function is defined as follows:

\hat{y}_i = \frac{e^{z_i}}{\sum_{j=1}^{n} e^{z_j}}

After this we take the 5 most probable candidates into the vector.

C =\begin{bmatrix} 780\\ 0\\ 1\\ 2\\ 999 \end{bmatrix}

and our ground truth vector is defined as follows:

G = \begin{bmatrix} G_{0}\\ G_{1}\\ G_{2} \end{bmatrix}=\begin{bmatrix} 780\\ 2\\ 999 \end{bmatrix}

Then we define our Error function as follows:

E = \frac{1}{n}\sum_{k}min_{i}d(c_{i}, G_{k})

It calculates the minimum distance between each ground truth class and the predicted candidates where the distance function d is defined as:

So, the loss function for this example is :

\begin{aligned}E &=\frac{1}{3}\left ( min_{i}d(c_{i}, G_{1}) +min_{i}d(c_{i}, G_{2})+min_{i}d(c_{i}, G_{3}) \right )\\&=0\end{aligned}

Since, all the categories in ground truth are in the Predicted top-5 matrix, so the loss becomes 0.

**VGG **Architecture

The VGG-16 architecture is a deep convolutional neural network (CNN) designed for image classification, known for its simple and uniform structure. It consists of 16 layers (13 convolutional + 3 fully connected) arranged in blocks, where convolutional layers are followed by max-pooling for downsampling.

2056957911

VGG-16 Architecture Map

VGG-16 for Object Localization

For object localization, instead of predicting only class scores, the model predicts bounding box coordinates. A bounding box is represented by a 4D vector: (x, y, height, width). There are two approaches:

Since this is a regression task, the loss function is changed from classification loss to regression loss (e.g., MSE), which measures the difference between predicted and actual bounding box values.

**Result: VGG-16 performed strongly in the ILSVRC 2014 competition. It achieved a top-5 classification error of 7.32%, finishing second in classification (after GoogLeNet with 6.66%). It also won the localization task with a 25.32% error rate.

Advantages

**Limitations