Attention Mechanism in ML (original) (raw)

Last Updated : 11 May, 2026

The attention mechanism allows models to focus on the most important parts of input data by assigning different weights to different elements. This helps prioritize relevant information instead of treating everything equally and forms the core of models like Transformers and BERT.

Types of Attention Mechanisms

To know more about the types of attention mechanism, refer to: Types of Attention Mechanism.

Working

The working of attention mechanism can be broken down into several key steps

**Step 1: Input Encoding: The input sequence is first encoded using an encoder like RNN, LSTM, GRU or Transformer to generate hidden states representing the input context.

**Step 2: Query, Key and Value Vectors: Each input is transformed into:

These are linear transformations of the input embeddings.

**Step 3: Similarity Computation: The model computes similarity between the query and each key to determine relevance.

\text{Score}(s,i) = \begin{cases}h_s \cdot y_i & \text{(Dot Product)} \\h_s^T W y_i & \text{(General)} \\v^T \tanh(W[h_s; y_i]) & \text{(Concat)}\end{cases}

Where:

**Step 4: Attention Weights Calculation: The similarity scores are passed through a softmax function to convert them into attention weights:

\alpha(s,i) = \text{softmax}(\text{Score}(s,i))

**Step 5: Weighted Sum: The attention weights are used to compute a weighted sum of the value vectors:

c_t = \sum_{i=1}^{T_s} \alpha(s,i) V_i

Here, ​T_s is the total number of key-value pairs.

**Step 6: Context Vector: The context vector c_t summarizes the most relevant information from the input sequence and is fed to the decoder.

**Step 7: Integration: The decoder uses both its own hidden state and the context vector to generate the next output token.

Attention Mechanism Architecture

Attention is a mechanism used within architectures like encoder-decoder models to improve how information is processed. It works alongside components such as the encoder and decoder by helping the model focus on the most relevant parts of the input.

string_constant_pool_5

Encoder-Decoder with Attention

1. Encoder

The Encoder processes the input sequence like a sentence and converts it into a series of hidden states that represent contextual information about each token.

h_0, h_1, h_2, h_3

h_t =f(h_{t-1},x_t)

2. Attention Mechanism

The Attention component determines how much importance should be given to each encoder hidden state when generating a particular word in the output. Its main goal is to create a context vector C_t ​, which captures the most relevant information from the encoder outputs for the current decoding step.

**Step 1: Feed-Forward Alignment Function: The decoder’s current hidden state S_t and each encoder hidden state h_i are combined to compute alignment scores e_{t,i}:

e_{t,i}=g(S_t,h_i )

Here,

Typically, g uses a non-linear activation such as tanh, ReLU or sigmoid.

**Step 2: Softmax Normalization: The alignment scores are normalized using a softmax function to produce attention weights \alpha_{t,i }​ which act like probabilities indicating the importance of each encoder hidden state:

\alpha_{t,i} = \frac{\exp(e_{t,i})}{\sum_{k=1}^{T_s} \exp(e_{t,k})}

Here,

**Step 3: Context Vector Generation: Once attention weights are obtained, they are used to compute a weighted sum of encoder hidden states, forming the context vector C_t :

C_t = \sum_{i=1}^{T_s} \alpha_{t,i} \, h_i

Here,

This vector represents the most relevant information from the input sentence needed to predict the next output word.

3. Decoder

The Decoder uses both the context vector C_t ​ from the attention layer and its own previous hidden state S_t to generate the next output word.

At each decoding step:

  1. The decoder receives C_t ​ and the previous predicted word.
  2. It produces a new hidden state S_{t+1} and predicts the next token.
  3. This process repeats for each word in the target sequence.

Mathematically:

y_t =\text{Decoder}(y_{t-1},S_t,C_t)

Here,

This combination enables the model to generate contextually accurate translations hence focusing on the most relevant parts of the source sequence for each predicted word.

Improvement Using Attention Mechanism

Traditional deep learning models like RNNs, LSTMs and CNNs have limitations when handling long or complex dependencies. The attention mechanism enhances their effectiveness as follows:

Implementation

Let's see the python implementation of Attention Mechanism

Step 1: Define the Attention Class

import torch import torch.nn as nn import torch.nn.functional as F

class Attention(nn.Module): def init(self, hidden_dim): super(Attention, self).init() self.attn = nn.Linear(hidden_dim * 2, hidden_dim) self.v = nn.Parameter(torch.rand(hidden_dim))

def forward(self, hidden, encoder_outputs):
    batch_size = encoder_outputs.shape[0]
    seq_len = encoder_outputs.shape[1]
    hidden = hidden.unsqueeze(1).repeat(1, seq_len, 1)
    energy = torch.tanh(
        self.attn(torch.cat((hidden, encoder_outputs), dim=2)))
    energy = energy.permute(0, 2, 1)
    v = self.v.repeat(batch_size, 1).unsqueeze(1)
    attention_scores = torch.bmm(v, energy).squeeze(1)
    attention_weights = F.softmax(attention_scores, dim=1)
    context = torch.bmm(attention_weights.unsqueeze(1), encoder_outputs)
    return context, attention_weights

`

Step 2: Create Sample Input

torch.manual_seed(0)

batch_size = 1 seq_len = 4 hidden_dim = 8 encoder_outputs = torch.randn(batch_size, seq_len, hidden_dim) decoder_hidden = torch.randn(batch_size, hidden_dim)

`

Step 3: Initialize and Run Attention

**1. Attention(hidden_dim): constructs the attention module with the chosen hidden dimension.

**2. Calling the module returns:

attention = Attention(hidden_dim) context, attn_weights = attention(decoder_hidden, encoder_outputs)

`

Step 4: Inspect Result

print("Encoder Outputs:\n", encoder_outputs) print("\nDecoder Hidden State:\n", decoder_hidden) print("\nAttention Weights:\n", attn_weights) print("\nContext Vector:\n", context)

`

**Output:

Screenshot-2025-10-28-145011

Result

You can download source code from here.

Applications

Advantages

Limitations