ML - Attention mechanism

Last Updated : 02 May, 2025

Attention mechanism helps models to focus on the most important parts of input data like humans prioritize certain information in a complex environment. It helps in improving models ability to perform tasks like language translation, image recognition and speech processing. In this article, we will see attention mechanism in detail, how it works and its application.

Understanding Attention Mechanism

Attention mechanism is a type of neural network that helps a model focus on specific parts of the input data, it is done by assigning weights to different elements in input which helps the model to decide which parts of information are most importart. This makes the model better at understanding complex relationships and dependencies in data. It helps the model to manage long-term dependencies and improves its ability to focus on important features.

For example, a recurrent neural network (RNN) picks different parts of image to explore over time. This method works better than the traditional convolutional neural networks (CNNs) for classification tasks and can be used in applications like robotics where it helps robots to decide how to act based on feedback from previous actions. This makes the system more flexible and able to learn and adapt over time.

How Attention Mechanism Works?

In a neural network model it works by:

1. Input Encoding: Input data is transformed into a format that the model can process and creating representations of the data.

2. Query Generation: A query vector is generated based on the current state or context of the model. This query tells the model what it is looking for in the input data.

3. Key-Value Pair Creation: Input is splitted into key-value pairs:

Keys represents the important information required to measure the relevant data
Values hold actual data.

4. Similarity Computation: Model calculates similarity between the query vector and each key. This helps find how relevant each part of the input is. Various methods can be used to calculate this similarity such as dot products or cosine similarity.

Score(s,i) = \left\{\begin{matrix} h_s^{(1)}\cdot y_i & \text{Dot Product} \\ (h_s^{(2)})^TW y_i & \text{General}\\ v^{T}\tanh\left(W \left[ \frac{h_s}{y_i}\right ]\right) & \text{concat} \end{matrix}\right.

where

hs: Encoder source hidden state at position s
yi: Encoder Target hidden state at the position i
W: Weight Matrix
v : Weight vector

5. Attention Weights Calculation: Similarity scores are passed through a softmax function to find attention weights. These weights indicate the importance of each key-value pair.

\text{Attention Weight }(\alpha(s,i))= \text{softmax}(\text{Similarity Scores}(s,i))

6. Weighted Sum: Attention weights are applied to the corresponding values which helps in generating a weighted sum. This step adds the relevant information from the input based on their importance calculated by the attention mechanism.

c_t =\sum_{i=1}^{T_s}\alpha(s,i)h_j^{(1)}

Here

T_s: Total number of key-value pairs (source hidden states) in the encoder.

7. Context Vector: Weighted sum act as a context vector which represents attended information from the input. It captures the relevant context for the current task.

8. Integration with the Model: Context vector is combined with the model's current state or hidden representation which provides additional information for other steps of the model.

9. Repeat: Steps 2 to 8 are repeated for each step of the model which allows the attention mechanism to focus on different parts of the input sequence or data.

Attention Mechanism Architecture for Machine Translation

Attention mechanism in machine translation has three main components: Encoder, Attention and Decoder. Here's how each of these components works:

Attention Mechanism Architecture with Encoder Decode-Geeksforgeeksr — Encoder-Decoder with Attention

1. Encoder:

Encoder processes the input sequence which is usually a sentence and generates hidden states. It uses Recurrent Neural Networks (RNNs), LSTMs, GRUs or transformer-based models to process the input step by step. At each step encoder produces a hidden state that contains both the data from the previous hidden state and the current input token. This allows the encoder to capture the relationships between the tokens in the input sequence. As the input sequence is processed encoder generates a series of hidden states which together represent the entire input sequence. These hidden states are passed on to the attention mechanism for further processing.

It is important because it creates a representation of the input sequence that the attention mechanism and decoder will later use to generate an accurate output sequence.

Contains an RNN layer (Can be LSTMs or GRU):

Let's say, there are 4 words sentence then inputs will be: x_{0}, x_{1}, x_{2}, x_{3}
Each input goes through an Embedding Layer, It can be RNN, LSTM, GRU or transformers
Each of the inputs generates a hidden representation.
This generates the outputs for the Encoder: h_{0}, h_{1}, h_{2}, h_{3}

2. Attention:

Attention component find importance of each encoder's hidden state with respect to the current target hidden state. It generates a context vector that captures the relevant information from the encoder's hidden states. Its mechanism can be represented mathematically as follows:

Our goal is to generate the context vectors.
For example, context vector C_{1} tells us how much importance, attention should be given the inputs: x_{0}, x_{1}, x_{2}, x_{3} .
This layer in turn contains 3 subparts:
1. Feed Forward Network
2. Softmax Calculation
3. Context vector generation

Feed Forward Network:

It transform target hidden state into a form that is compatible with the attention mechanism. This transformation is done using a linear transformation followed by a non-linear activation function like ReLU or sigmoid. This step helps in finding alignment between target and the encoder's hidden states.

Each A_{00}, A_{01}, A_{02}, A_{03} is a simple feed-forward neural network with one hidden layer. The input for this feed-forward network is:

Previous Decoder state
The output of Encoder states.

Each unit generates outputs: e_{00},e_{01},e_{02},e_{03} i.e e_{0i} = g(S_{0}, h_{i}) . Here g can be any activation function such as sigmoid, tanh or ReLu.

Softmax Calculation:

Once the feed-forward network generates the relevant scores these scores are passed through a softmax function. It converts scores into probability-like attention weights. These weights indicate the relative importance of each encoder's hidden state in generating current target token. For example higher attention weights shows that a particular encoder's hidden state is more important for the current step in the decoding process.

E_{0i} = \frac{exp(e_{0i})}{\sum_{i=0}^{3}exp(e_{0i})}

These E_{00}, E_{01}, E_{02}, E_{03} are called the attention weights. It decides how much importance should be given to the inputs x_{0}, x_{1}, x_{2}, x_{3} .

Context Vector Generation:

When the attention weights are calculated they are applied to encoder's hidden states which helps in generating a weighted sum. This weighted sum forms the context vector which finds most relevant information from the encoder's hidden states based on their attention weights.

C_{0} = E_{00} \ast h_{0} + E_{01} \ast h_{1} + E_{02} \ast h_{2} + E_{03} \ast h_{3}

We find C_{1}, C_{2}, C_{3} in the same way and feed it to different RNN units of the Decoder layer. So this is the final vector which is the product of Probability Distribution and Encoder's output which is nothing but the attention paid to the input words.

3. Decoder:

It receives the context vector from the attention layer which contains relevant information from the encoder’s hidden states along with its own current hidden state. Using this combined information decoder predicts next token in the output sequence. Decoder’s hidden state is then updated based on the predicted token and this process repeats again and again for each token in the output sequence. This helps model to generate entire output sequence like translation word by word with each step focusing on the most important parts of the input.

Applications of attention mechanisms

Machine Translation: It allows model to focus on different parts of the source sentence when generating each word in the target sentence which improves translation accuracy.
Sentiment Analysis & Named Entity Recognition: In tasks like sentiment analysis, question answering and named entity recognition it helps models to focus on key words that helps in entity identification.
Text Summarization: It helps in selecting important information by helping model to create short and accurate summaries of large text.
Image Captioning: It help image captioning models to focus on specific image part while generating captions which improves description's detail.

Attention mechanism will keep getting better and making models more efficient and useful. This will help them perform well in more tasks and handle even more complex problems.

ML - Attention mechanism

KeshavBalachandar

Improve

Article Tags :

ML - Attention mechanism

Understanding Attention Mechanism

How Attention Mechanism Works?

Attention Mechanism Architecture for Machine Translation

1. Encoder:

2. Attention:

3. Decoder:

Applications of attention mechanisms

Similar Reads

Thank You!

What kind of Experience do you want to share?