What is Layer Normalization?
Layer Normalization stabilizes and accelerates the training process in deep learning. In typical neural networks, activations of each layer can vary drastically which leads to issues like exploding or vanishing gradients which slow down training. Layer Normalization addresses this by normalizing the output of each layer which helps in ensuring that the activations stay within a stable range.
It works by normalizing the input to each neuron such that the mean activation becomes 0 and the variance becomes 1. Unlike Batch Normalization which normalizes over the batch i.e across all samples in the batch, it normalizes over the features for each individual data point.
Working of Layer Normalization
Let's consider an example where we have three vectors:
x_1 = [3.0, 5.0, 2.0, 8.0] x_2 = [1.0, 3.0, 5.0, 8.0] x_3 = [3.0, 2.0, 7.0, 9.0]
For each input
1. Compute Mean and Variance for Each Feature
Mean and variance are calculated for each input but instead of across the batch, it’s done for the features (i.e per data point):
where,
Now, let's compute Mean and Variance for each feature (Per Data Point). For
- Mean (
\mu_1 ):\mu_1 = \frac{1}{4} (3.0 + 5.0 + 2.0 + 8.0) = \frac{18.0}{4} = 4.5 - Variance (
\sigma_1^2 ):\sigma_1^2 = \frac{1}{4} \left[ (3.0 - 4.5)^2 + (5.0 - 4.5)^2 + (2.0 - 4.5)^2 + (8.0 - 4.5)^2 \right] = \frac{21.0}{4} = 5.25
Similarly computer for

2. Normalize the Input
Each feature is then normalized using the formula:
Here
Now, we will normalize each feature in each vector by subtracting the mean and dividing by the standard deviation (square root of the variance) with a small constant
For
We calculate each normalized value for
For
We calculate each normalized value for
For
We calculate each normalized value for
3. Apply Scaling and Shifting
To ensure that the normalized activations can still represent a wide range of values, learnable parameters
This allows the network to scale and shift the normalized activations during training.
Here let’s assume
- For
x_1 :y_1 = [-0.4820, 0.8273, -1.1366, 2.7913] - For
x_2 :y_2 = [-1.3851, -0.2250, 0.9350, 2.6751] - For
x_3 :y_3 = [-0.6795, -1.2037, 1.4174, 2.4658]
These are the exact normalized values and the final outputs after applying Layer Normalization.
Implementation of Layer Normalization in a Simple Neural Network with PyTorch
We will be using Pytorch library for its implementation.
- nn.Linear(input_size, output_size): Creates a fully connected layer with the specified input and output dimensions.
- nn.LayerNorm(128): Applies Layer Normalization on the input of size 128.
- forward(self, x): Defines forward pass for the model by applying transformations to the input x step by step.
- torch.randn(10, 64): Generates a tensor of size (10, 64) filled with random values from a normal distribution.
- torch.relu(x): Applies ReLU (Rectified Linear Unit) activation function element-wise to x.
import torch
import torch.nn as nn
class SimpleNN(nn.Module):
def __init__(self, input_size, output_size):
super(SimpleNN, self).__init__()
self.fc1 = nn.Linear(input_size, 128)
self.layer_norm = nn.LayerNorm(128)
self.fc2 = nn.Linear(128, output_size)
def forward(self, x):
x = self.fc1(x)
x = self.layer_norm(x)
x = torch.relu(x)
x = self.fc2(x)
return x
input_data = torch.randn(10, 64)
model = SimpleNN(64, 10)
output = model(input_data)
print(output)
Output:
Advantages of Layer Normalization
- Works with Small Batches: It is not dependent on batch size which make it ideal for cases with small batches or when working with reinforcement learning where each input is processed individually.
- Stabilizes Learning: It normalizes activations within a layer helping to prevent issues like exploding and vanishing gradients which ensures smoother and more efficient training.
- Independent of Batch Size: Unlike Batch Normalization, it is calculated for each individual data point helps in making it more flexible and suitable for situations with variable batch sizes.
- Works Well with RNNs: It is useful in Recurrent Neural Networks to maintain stable gradient flow which helps in enhancing performance in sequential tasks.
- Faster Convergence: By normalizing activations it helps the model converge faster which allows for quicker training without the need for fine-tuning batch size or learning rate.
Applications of Layer Normalization
Layer Normalization is commonly used in various deep learning architectures:
- Recurrent Neural Networks (RNNs): In RNNs, LSTMs and GRUs it is used to stabilize training as Batch Normalization struggles with sequential data. It normalizes activations at each time step to maintain stable gradient flow.
- Transformers: Models like BERT and GPT apply Layer Normalization after attention layers to normalize activations which improves the stability and efficiency of training.
- Generative Models: In Generative Adversarial Networks it stabilizes the training of both the generator and discriminator networks.
- Speech Recognition: It is commonly applied in speech recognition systems to boast performance by normalizing the activations of the model’s recurrent layers.
Layer normalization is effective in scenarios where Batch Normalization would not be practical such as with small batch sizes or sequential models like RNNs. It helps to ensure a smoother and faster training process which leads to better performance across wide range of applications.