Self - Attention in NLP
In Transformer models, self-attention allows the model to look at all words in a sentence at once but it doesn’t naturally understand the order of those words. This is a problem because word order matters in language. To solve this Transformers use positional embeddings extra information added to each word that tells the model where it appears in the sentence. This helps the model understand both the meaning of each word and its position so it can process sentences more effectively.
Attention in NLP
- The goal of self attention mechanism is to improve performance of traditional models such as encoder decoder models used in RNNs (Recurrent Neural Networks).
- In traditional encoder decoder models input sequence is compressed into a single fixed-length vector which is then used to generate the output.
- This works well for short sequences but struggles with long ones because important information can be lost when compressed into a single vector.
- To overcome this problem self attention mechanism was introduced.
Encoder Decoder Model
An encoder decoder model is used in machine learning tasks that involve sequences like translating sentences, generating text or creating captions for images. Here's how it works:
- Encoder: It takes the input sequence like sentences and processes them. It converts input into a fixed size summary called a latent vector or context vector. This vector holds all the important information from the input sequence.
- Decoder: It then uses this summary to generate an output sequence such as a translated sentence. It tries to reconstruct the desired output based on the encoded information.

Attention Layer in Transformer
- Input Embedding: Input text like a sentences are first converted into embeddings. These are vector representations of words in a continuous space.
- Positional Encoding: Since Transformer doesn’t process words in a sequence like RNNs positional encodings are added to the input embeddings and these encode the position of each word in the sentence.
- Multi Head Attention: In this multiple attention heads are applied in parallel to process different part of sequences simultaneously. Each head finds the attention scores based on queries (Q), keys (K) and values (V) and adds information from different parts of input.
- Add and Norm: This layer helps in residual connections and layer normalization. This helps to avoid vanishing gradient problems and ensures stable training.
- Feed Forward: After attention output is passed through a feed forward neural network for further transformation.
- Masked Multi Head Attention for the Decoder: This is used in the decoder and ensures that each word can only attend to previous words in the sequence not future ones.
- Output Embedding: Finally transformed output is mapped to a final output space and processed by softmax function to generate output probabilities.
Self Attention Mechanism
This mechanism captures long range dependencies by calculating attention between all words in the sequence and helping the model to look at the entire sequence at once. Unlike traditional models that process words one by one it helps the model to find which words are most relevant to each other helpful for tasks like translation or text generation.
Here’s how the self attention mechanism works:
- Input Vectors and Weight Matrices: Each encoder input vector is multiplied by three trained weight matrices (
W(Q) ,W(K) ,W(V) ) to generate the key, query and value vectors. - Query Key Interaction: Multiply the query vector of the current input by the key vectors from all other inputs to calculate the attention scores.
- Scaling Scores: Attention scores are divided by the square root of the key vector's dimension (
dk ) usually 64 to prevent the values from becoming too large and making calculations unstable. - Softmax Function: Apply the softmax function to the calculated attention scores to normalize them into probabilities.
- Weighted Value Vectors: Multiply the softmax scores by the corresponding value vectors.
- Summing Weighted Vectors: Sum the weighted value vectors to produce the self attention output for the input.
Above procedure is applied to all the input sequences. Mathematically self attention matrix for input matrices (
Attention\left ( Q, K, V \right ) = softmax\left ( \frac{QK^{T}}{\sqrt{d_{k}}} \right )V
where
Multi Head Attention
In multi headed attention mechanism, multiple attention heads are used in parallel which allows the model to focus on different parts of the input sequence simultaneously. This approach increases model's ability to capture various relationships between words in the sequence.
MultiHead\left ( Q, K, V \right ) = concat\left ( head_{1} head_{2} ... head_{n} \right )W_{O}
Here’s a step by step breakdown of how multi headed attention works:

- Generate Embeddings: For each word in the input sentence it generate its embedding representation.
- Create Multiple Attention Heads: Create
h (e.gh=8 ) attention heads and each with its own weight matricesW(Q),W(K),W(V) . - Matrix Multiplication: Multiply the input matrix by each of the weight matrices
W(Q),W(K),W(V) for each attention head to produce key, query and value matrices. - Apply Attention: Apply attention mechanism to the key, query and value matrices for each attention head which helps in generating an output matrix from each head.
- Concatenate and Transform: Concatenate the output matrices from all attention heads and apply a dot product with weight
W_{O} to generate the final output of the multi-headed attention layer.
Use in Transformer Architecture
- Encoder Decoder Attention: In this layer queries come from the previous decoder layer while the keys and values come from the encoder’s output. This allows each position in the decoder to focus on all positions in the input sequence.
- Encoder Self Attention: This layer receives queries, keys and values from the output of the previous encoder layer. Each position in the encoder looks at all positions from the previous layer to calculate attention scores.
- Decoder Self Attention: Similar to the encoder's self attention but here the queries, keys and values come from the previous decoder layer. Each position can attend to the current and previous positions but future positions are masked to prevent the model from looking ahead when generating the output and this is called masked self attention.
Implementation
Step 1: Install Necessary Libraries
This line imports numpy for matrix operations and softmax from scipy.special to convert attention scores into probability distributions.
import numpy as np
from scipy.special import softmax
Step 2: Extract Dimensions
This function starts by extracting the input shape: batch size, sequence length and model dimension. It sets d_k, the dimension of keys and queries equal to the model dimension for simplicity.
def self_attention(X):
batch_size, seq_len, d_model = X.shape
d_k = d_model
Step 3: Initialize Weight Matrix
These lines initialize random weight matrices for queries (W_q), keys (W_k) and values (W_v). In real models these are learnable parameters used to project the input into Q, K, and V representations.
W_q = np.random.randn(d_model, d_k)
W_k = np.random.randn(d_model, d_k)
W_v = np.random.randn(d_model, d_k)
Step 4: Compute Q, K, V matrices
These lines project the input X into query (Q), key (K) and value (V) matrices by multiplying with their respective weights. This transforms the input into different views used for computing attention.
Q = X @ W_q
K = X @ W_k
V = X @ W_v
Step 5: Compute Attention scores and weights
This computes the final output by weighting the values (V) with the attention scores, aggregating relevant information from the sequence. It then returns both the attention output and the attention weights for further use or analysis.
output = attention_weights @ V
return output, attention_weights
Step 6: Example Usage
This sets a random seed for reproducibility creates a sample input tensor with shape (1, 3, 4) runs the self-attention function on it and then prints the resulting output and attention weights.
np.random.seed(42)
X = np.random.rand(1, 3, 4)
output, weights = self_attention(X)
print("Output:\n", output)
print("\nAttention Weights:\n", weights)
Output:

Advantages
- Parallelization: Unlike sequential models it allows for full parallel processing which speeds up training.
- Long Range Dependencies: It provides direct access to distant elements making it easier to model complex structures and relationships across long sequences.
- Contextual Understanding: Each token’s representation is influenced by the entire sequence which integrates global context and improves accuracy.
- Interpretable Weights: Attention maps can show which parts of the input were most influential in making decisions.
Challenges
- Computational Cost: Self attention requires computing pairwise interactions between all input tokens which causes a time and memory complexity of
O(n^2) wheren is the sequence length. This becomes inefficient for long sequences. - Memory Usage: Large number of pairwise calculations in self attention uses high memory while working with very long sequences or large batch sizes.
- Lack of Local Context: It focuses on global dependencies across all tokens but it may not effectively capture local patterns. This can cause inefficiencies when local context is more important than global context.
- Overfitting: Due to its ability to model complex relationships it can overfit when it is trained on small datasets.