What are Diffusion Models?
Diffusion models are a type of generative AI that create new data like images, audio or even video by starting with random noise and gradually turning it into something meaningful. They work by simulating a diffusion process where data is slowly corrupted by noise during training and then learning to reverse this process step by step. By doing so the model learns how to generate high quality samples from scratch.
Table of Content
Understanding Diffusion Models

- Diffusion models are generative models that learn to reverse a diffusion process to generate data. The diffusion process involves gradually adding noise to data until it becomes pure noise.
- Through this process a simple distribution is transformed into a complex data distribution in a series of small incremental steps.
- Essentially these models operate as a reverse diffusion phenomenon where noise is introduced to the data in a forward manner and removed in a reverse manner to generate new data samples.
- By learning to reverse this process diffusion models start from noise and gradually denoise it to produce data that closely resembles the training examples.
Key Components
- Forward Diffusion Process: This process involves adding noise to the data in a series of small steps. Each step slightly increases the noise, making the data progressively more random until it resembles pure noise.
- Reverse Diffusion Process: The model learns to reverse the noise-adding steps. Starting from pure noise, the model iteratively removes the noise, generating data that matches the training distribution.
- Score Function: This function estimates the gradient of the data distribution concerning the noise. It helps guide the reverse diffusion process to produce realistic samples.
Architecture of Diffusion Models
The architecture of diffusion models typically involves two main components:
- Forward Diffusion Process
- Reverse Diffusion Process
1. Forward Diffusion Process
In this process noise is incrementally added to the data over a series of steps. This is akin to a Markov chain where each step slightly degrades the data by adding Gaussian noise.

Mathematically, this can be represented as:
q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{\alpha_t} x_{t-1}, (1 - \alpha_t)I)
where,
x_t is the noisy data at step t\alpha_t controls the amount of noise added.
2. Reverse Diffusion Process
The reverse process aims to reconstruct the original data by denoising the noisy data in a series of steps reversing the forward diffusion.

This is typically modelled using a neural network that predicts the noise added at each step:
p_\theta(x_{t-1} | x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \sigma_\theta(x_t, t))
where,
\mu_\theta and\sigma_\theta are learned parameters.
Working Principle of Diffusion Models
During training the model learns to predict the noise added at each step of the forward process. This is done by minimizing a loss function that measures the difference between the predicted and actual noise.
Forward Process (Diffusion)
- The forward process involves gradually corrupting the data
x_0 with Gaussian noise over a sequence of time steps - Let
x_t represent the noisy data at time step t. The process is defined as:
x_t = \sqrt{1 - \beta_t} x_{t-1} + \sqrt{\beta_t} \epsilon
- where
\beta_t is the noise schedule that controls the amount of noise added at each step and\epsilon is is Gaussian noise. - As t increases,
x_t becomes more noisy until it approximates a Gaussian distribution.
Reverse Process (Denoising)
- The reverse process aims to reconstruct the original data
x_0 from the noisy datax_T at the final time step T. - This process is modelled using a neural network to approximate the conditional probability
p_\theta(x_{t-1} | x_t) . - The reverse process can be formulated as:
x_{t-1} = \frac{1}{\sqrt{1 - \beta_t}} \left( x_t - \frac{\beta_t}{\sqrt{1 - \beta_t}} \epsilon_\theta(x_t, t) \right)
- where
\epsilon_\theta is a neural network parameterized by\theta that predicts the noise.
Training Diffusion Models
- The training objective for diffusion models involves minimizing the difference between the true noise
\epsilon added in the forward process and the noise predicted by the neural network\epsilon_\theta . - The score function which estimates the gradient of the data distribution concerning the noise plays an important role in guiding the reverse process.
- The loss function is typically the mean squared error (MSE) between these two quantities:
L(\theta) = \mathbb{E}_{x_0, \epsilon, t} \left[ \| \epsilon - \epsilon_\theta(x_t, t) \|^2 \right]
- This encourages the model to accurately predict the noise and, consequently, to denoise effectively during the reverse process.
Implementation
Step 1: Install Necessary Libraries
import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision
from torchvision import transforms
from torchvision.datasets import MNIST
from torch.utils.data import DataLoader
import matplotlib.pyplot as plt
import numpy as np
Step 2: Beta Schedule and Noise Schedule
Defines how much noise is added at each time step. It increases linearly over time starting with a small value and ending with a larger one.
def linear_beta_schedule(timesteps):
beta_start = 0.0001
beta_end = 0.02
return torch.linspace(beta_start, beta_end, timesteps)
T = 200
betas = linear_beta_schedule(T)
alphas = 1. - betas
alphas_cumprod = torch.cumprod(alphas, axis=0)
Step 3: Forward Diffusion Process
Gradually adds Gaussian noise to the original image across many steps. The image becomes more and more noisy with each step.
def forward_diffusion_sample(x_0, t, noise=None):
"""
Add noise to the image x_0 at timestep t
"""
if noise is None:
noise = torch.randn_like(x_0)
sqrt_alphas_cumprod = torch.sqrt(alphas_cumprod[t])[:, None, None, None]
sqrt_one_minus_alphas_cumprod = torch.sqrt(1 - alphas_cumprod[t])[:, None, None, None]
return sqrt_alphas_cumprod * x_0 + sqrt_one_minus_alphas_cumprod * noise, noise
Step 4: Neural Network (U Net or simple CNN)
A simple convolutional neural network that takes the noisy image and the time step and learns to predict the noise that was added.
class SimpleModel(nn.Module):
def __init__(self):
super().__init__()
self.net = nn.Sequential(
nn.Conv2d(1, 32, 3, padding=1), nn.ReLU(),
nn.Conv2d(32, 32, 3, padding=1), nn.ReLU(),
nn.Conv2d(32, 1, 3, padding=1),
)
def forward(self, x, t):
return self.net(x)
Step 5: Training Loop
The model is trained using many noisy images to minimize the difference between the actual and predicted noise (mean squared error).
def get_data():
transform = transforms.Compose([
transforms.ToTensor(),
lambda x: x * 2 - 1
])
dataset = MNIST(root="./data", train=True, download=True, transform=transform)
dataloader = DataLoader(dataset, batch_size=128, shuffle=True)
return dataloader
def train(model, dataloader, optimizer, epochs=5):
for epoch in range(epochs):
for step, (x, _) in enumerate(dataloader):
x = x.to(device)
t = torch.randint(0, T, (x.shape[0],), device=device).long()
x_noisy, noise = forward_diffusion_sample(x, t)
noise_pred = model(x_noisy, t)
loss = F.mse_loss(noise_pred, noise)
optimizer.zero_grad()
loss.backward()
optimizer.step()
print(f"Epoch {epoch}: Loss {loss.item():.4f}")
Step 6: Sampling (Reverse Process)
Starts from pure noise and uses the model to remove noise step by step generating a clean image at the end.
@torch.no_grad()
def sample(model, image_size, num_samples):
model.eval()
x = torch.randn((num_samples, 1, image_size, image_size), device=device)
for t in reversed(range(T)):
t_tensor = torch.full((num_samples,), t, device=device, dtype=torch.long)
pred_noise = model(x, t_tensor)
alpha = alphas[t]
alpha_bar = alphas_cumprod[t]
beta = betas[t]
if t > 0:
noise = torch.randn_like(x)
else:
noise = 0
x = (1 / torch.sqrt(alpha)) * (
x - (beta / torch.sqrt(1 - alpha_bar)) * pred_noise
) + torch.sqrt(beta) * noise
return x
Step 7: Running the Model
It sets up the device initializes the model and optimizer, loads the data and starts training. After training it generates 16 sample images from noise and displays them in a 4×4 grid.
device = "cuda" if torch.cuda.is_available() else "cpu"
model = SimpleModel().to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
dataloader = get_data()
train(model, dataloader, optimizer)
samples = sample(model, 28, 16)
grid = torchvision.utils.make_grid(samples.cpu(), nrow=4, normalize=True)
plt.imshow(grid.permute(1, 2, 0))
plt.axis("off")
plt.show()
Output:

You can download the source code from here
Applications
- Image Generation: Diffusion models are widely used to generate realistic images from random noise. This application is especially popular in fields like art, gaming, advertising and graphic design where high quality visuals are essential.
- Image Editing and Inpainting: They enable advanced editing by filling in missing or damaged parts of an image. This is useful in photo restoration, object removal or editing specific regions without affecting the whole image.
- Text to Image Generation: By converting written prompts into images, diffusion models allow creators to bring their ideas to life visually. This is used in storytelling, concept design, marketing and more.
- Super Resolution: Diffusion models can improve the quality of low resolution images by enhancing details. This application benefits medical imaging, satellite photos and surveillance footage.
Advantages
- Flexibility: They can model complex data distributions without requiring explicit likelihood estimation.
- High Quality Generation: Diffusion models generate high quality samples often surpassing other generative models like GANs.
- Stable Training: Unlike GANs diffusion models avoid issues like mode collapse and unstable training dynamics.
- Theoretical Foundations: Based on well understood principles from stochastic processes and statistical mechanics.
Disadvantages
- Slow Sampling: Generating samples can be slow because of the many steps needed for the reverse diffusion process.
- Complexity: The architecture and training process can be complex making them challenging to implement and understand.
- Memory Usage: High memory consumption during training due to the need to store multiple intermediate steps.
- Fine Tuning: Requires careful tuning of noise schedules and other hyperparameters to achieve optimal performance.