Open In App

Self - Attention in NLP

Last Updated : 23 Aug, 2025
Comments
Improve
Suggest changes
9 Likes
Like
Report

In Transformer models, self-attention allows the model to look at all words in a sentence at once but it doesn’t naturally understand the order of those words. This is a problem because word order matters in language. To solve this Transformers use positional embeddings extra information added to each word that tells the model where it appears in the sentence. This helps the model understand both the meaning of each word and its position so it can process sentences more effectively.

Attention in NLP

  • The goal of self attention mechanism is to improve performance of traditional models such as encoder decoder models used in RNNs (Recurrent Neural Networks).
  • In traditional encoder decoder models input sequence is compressed into a single fixed-length vector which is then used to generate the output.
  • This works well for short sequences but struggles with long ones because important information can be lost when compressed into a single vector.
  • To overcome this problem self attention mechanism was introduced.

Encoder Decoder Model

An encoder decoder model is used in machine learning tasks that involve sequences like translating sentences, generating text or creating captions for images. Here's how it works:

  • Encoder: It takes the input sequence like sentences and processes them. It converts input into a fixed size summary called a latent vector or context vector. This vector holds all the important information from the input sequence.
  • Decoder: It then uses this summary to generate an output sequence such as a translated sentence. It tries to reconstruct the desired output based on the encoded information.
frame_3053
Encoder Decoder Model

Attention Layer in Transformer

  1. Input Embedding: Input text like a sentences are first converted into embeddings. These are vector representations of words in a continuous space.
  2. Positional Encoding: Since Transformer doesn’t process words in a sequence like RNNs positional encodings are added to the input embeddings and these encode the position of each word in the sentence.
  3. Multi Head Attention: In this multiple attention heads are applied in parallel to process different part of sequences simultaneously. Each head finds the attention scores based on queries (Q), keys (K) and values (V) and adds information from different parts of input.
  4. Add and Norm: This layer helps in residual connections and layer normalization. This helps to avoid vanishing gradient problems and ensures stable training.
  5. Feed Forward: After attention output is passed through a feed forward neural network for further transformation.
  6. Masked Multi Head Attention for the Decoder: This is used in the decoder and ensures that each word can only attend to previous words in the sequence not future ones.
  7. Output Embedding: Finally transformed output is mapped to a final output space and processed by softmax function to generate output probabilities.

Self Attention Mechanism

This mechanism captures long range dependencies by calculating attention between all words in the sequence and helping the model to look at the entire sequence at once. Unlike traditional models that process words one by one it helps the model to find which words are most relevant to each other helpful for tasks like translation or text generation.

Here’s how the self attention mechanism works:

  1. Input Vectors and Weight Matrices: Each encoder input vector is multiplied by three trained weight matrices (W(Q), W(K), W(V)) to generate the key, query and value vectors.
  2. Query Key Interaction: Multiply the query vector of the current input by the key vectors from all other inputs to calculate the attention scores.
  3. Scaling Scores: Attention scores are divided by the square root of the key vector's dimension (dk) usually 64 to prevent the values from becoming too large and making calculations unstable.
  4. Softmax Function: Apply the softmax function to the calculated attention scores to normalize them into probabilities.
  5. Weighted Value Vectors: Multiply the softmax scores by the corresponding value vectors.
  6. Summing Weighted Vectors: Sum the weighted value vectors to produce the self attention output for the input.

Above procedure is applied to all the input sequences. Mathematically self attention matrix for input matrices (Q, K, V) is calculated as:

Attention\left ( Q, K, V \right ) = softmax\left ( \frac{QK^{T}}{\sqrt{d_{k}}} \right )V

where Q, K, V are the concatenation of query, key and value vectors

Multi Head Attention

In multi headed attention mechanism, multiple attention heads are used in parallel which allows the model to focus on different parts of the input sequence simultaneously. This approach increases model's ability to capture various relationships between words in the sequence.

MultiHead\left ( Q, K, V \right ) = concat\left ( head_{1} head_{2} ... head_{n} \right )W_{O}

Here’s a step by step breakdown of how multi headed attention works:

MHA
Multi-headed-attention
  1. Generate Embeddings: For each word in the input sentence it generate its embedding representation.
  2. Create Multiple Attention Heads: Create h (e.g h=8) attention heads and each with its own weight matrices W(Q),W(K),W(V).
  3. Matrix Multiplication: Multiply the input matrix by each of the weight matrices W(Q),W(K),W(V) for each attention head to produce key, query and value matrices.
  4. Apply Attention: Apply attention mechanism to the key, query and value matrices for each attention head which helps in generating an output matrix from each head.
  5. Concatenate and Transform: Concatenate the output matrices from all attention heads and apply a dot product with weight W_{O} to generate the final output of the multi-headed attention layer.

Use in Transformer Architecture

  • Encoder Decoder Attention: In this layer queries come from the previous decoder layer while the keys and values come from the encoder’s output. This allows each position in the decoder to focus on all positions in the input sequence.
  • Encoder Self Attention: This layer receives queries, keys and values from the output of the previous encoder layer. Each position in the encoder looks at all positions from the previous layer to calculate attention scores.
encoderselfattention
Encoder Self-Attention
  • Decoder Self Attention: Similar to the encoder's self attention but here the queries, keys and values come from the previous decoder layer. Each position can attend to the current and previous positions but future positions are masked to prevent the model from looking ahead when generating the output and this is called masked self attention.
decoderselfattetnion
Decoder Self Attention

Implementation

Step 1: Install Necessary Libraries

This line imports numpy for matrix operations and softmax from scipy.special to convert attention scores into probability distributions.

Python
import numpy as np
from scipy.special import softmax

Step 2: Extract Dimensions

This function starts by extracting the input shape: batch size, sequence length and model dimension. It sets d_k, the dimension of keys and queries equal to the model dimension for simplicity.

Python
def self_attention(X):
    batch_size, seq_len, d_model = X.shape
    d_k = d_model

Step 3: Initialize Weight Matrix

These lines initialize random weight matrices for queries (W_q), keys (W_k) and values (W_v). In real models these are learnable parameters used to project the input into Q, K, and V representations.

Python
W_q = np.random.randn(d_model, d_k)
W_k = np.random.randn(d_model, d_k)
W_v = np.random.randn(d_model, d_k)

Step 4: Compute Q, K, V matrices

These lines project the input X into query (Q), key (K) and value (V) matrices by multiplying with their respective weights. This transforms the input into different views used for computing attention.

Python
Q = X @ W_q
K = X @ W_k
V = X @ W_v

Step 5: Compute Attention scores and weights

This computes the final output by weighting the values (V) with the attention scores, aggregating relevant information from the sequence. It then returns both the attention output and the attention weights for further use or analysis.

Python
output = attention_weights @ V
return output, attention_weights

Step 6: Example Usage

This sets a random seed for reproducibility creates a sample input tensor with shape (1, 3, 4) runs the self-attention function on it and then prints the resulting output and attention weights.

Python
np.random.seed(42)
X = np.random.rand(1, 3, 4)
output, weights = self_attention(X)

print("Output:\n", output)
print("\nAttention Weights:\n", weights)

Output:

output
Output

Advantages

  1. Parallelization: Unlike sequential models it allows for full parallel processing which speeds up training.
  2. Long Range Dependencies: It provides direct access to distant elements making it easier to model complex structures and relationships across long sequences.
  3. Contextual Understanding: Each token’s representation is influenced by the entire sequence which integrates global context and improves accuracy.
  4. Interpretable Weights: Attention maps can show which parts of the input were most influential in making decisions.

Challenges

  1. Computational Cost: Self attention requires computing pairwise interactions between all input tokens which causes a time and memory complexity of O(n^2) where n is the sequence length. This becomes inefficient for long sequences.
  2. Memory Usage: Large number of pairwise calculations in self attention uses high memory while working with very long sequences or large batch sizes.
  3. Lack of Local Context: It focuses on global dependencies across all tokens but it may not effectively capture local patterns. This can cause inefficiencies when local context is more important than global context.
  4. Overfitting: Due to its ability to model complex relationships it can overfit when it is trained on small datasets.

Article Tags :

Explore