How Prompt Tuning works?
Prompt tuning is a technique used to adapt pre-trained language models to downstream tasks without modifying the entire model. Instead of fine-tuning all parameters of the model, prompt tuning focuses on optimizing a small set of learnable tokens. In this article we will learn about them.
How Does Prompt Tuning Work?
Let's understand it step by step:
Step 1: Pre-trained Language Model
Foundation of prompt tuning is pre-trained language model. These models are trained on vast amounts of text data and encode general linguistic knowledge. Examples include GPT-3, BERT and T5.
Step 2: Soft Prompts
- Instead of directly feeding raw input text into the model, prompt tuning introduces a set of learnable embeddings called soft prompts. These embeddings are initialized randomly and are optimized during training to guide the model toward the desired task.
- For example, if the task is sentiment classification, the soft prompt might encode information about the sentiment labels like positive, negative, neutral.
- Rest of the model remains frozen, preserving the general knowledge it acquired during pre-training.
Step 3: Concatenation with Input
The soft prompts are concatenated with the actual input text before being passed to the model. This creates a composite input sequence where the soft prompts serve as a task-specific prefix. For example:
- Soft Prompt: [P1, P2, P3]
- Input Text: "This movie was fantastic!"
- Composite Input: [P1, P2, P3, "This", "movie", "was", "fantastic", "!"]
Step 4: Optimization
During training model’s output is compared to ground truth using a loss function like cross-entropy for classification tasks. Then gradients are backpropagated only through the soft prompts leaving rest of the model’s parameters unchanged.
Step 5: Inference
Once the soft prompts are optimized they can be reused for inference on new inputs for the same task. The frozen model generates predictions based on the learned soft prompts, effectively adapting to the task without requiring full fine-tuning.
Mathematical Explanation of Prompt Tuning
To better understand prompt tuning let’s break it down mathematically.
1. Pre-trained Language Model
Consider a pre-trained LLM
x is the input text.θ represents the fixed parameters (weights and biases) of the large language model.f(x;θ) outputs a probability distribution over the next word or generates a continuation of the input text.
To understand the benefits of prompt tuning, let's compare it with fine tuning. See the following schematic that explains the difference between prompt tuning and fine tuning from a mathematical point of view.
2. Learnable Prompts
Instead of directly feeding the input text x into the model we prepend a learnable prompt p. These embeddings are initialized randomly and optimized during training to guide the model toward the desired task. The final input x′ to the model becomes:
Here
3. Model Operation
The LLM processes the concatenated input x′ to produce an output
4. Loss Function
The model’s output
For example if y=1 (positive sentiment) and
5. Gradient Descent and Updating the Prompt
The optimization process involves using gradient descent to adjust the learnable prompt p. We update the prompt embeddings p based on the gradients of the loss function with respect to p:
Where:
η is the learning rate (a small step size that controls how much we adjust the prompt).∇pL is the gradient of the loss with respect to the prompt p.
6. Convergence
After enough iterations the prompt p converges to a set of embeddings that guide the model to classify the sentiment of the input text more accurately.
Implementation of Prompt Tuning
Let’s use a simple python example where we optimize a learnable prompt (p) to guide a model for sentiment classification. The goal is to classify whether a sentence has a positive or negative sentiment
Step 1: Import Necessary Libraries
First, we will import necessary libraries like numpy, tenserflow and matplotlib.
import numpy as np
import tensorflow as tf
from tensorflow.keras import layers, models
import matplotlib.pyplot as plt
Step 2: Define the Simple Model
Next, we define a simple neural network model using TensorFlow's Keras API. This model will mimic the behavior of a large language model (LLM) for our example.
- The model has two layers one hidden layer with ReLU activation and a output layer that predicts the probability of two classes (positive or negative).
- For simplicity this is a small neural network but it represents the core idea of how an LLM processes inputs.
class SimpleModel(tf.keras.Model):
def __init__(self, input_size, hidden_size):
super(SimpleModel, self).__init__()
self.fc1 = layers.Dense(hidden_size, activation='relu', input_shape=(input_size,))
self.fc2 = layers.Dense(2) # Binary classification: positive or negative
def call(self, x):
x = self.fc1(x)
x = self.fc2(x)
return x
Step 3: Prepare Input Data and Learnable Prompt
Now, we prepare the input data and define the learnable prompt (p). The prompt will be optimized during training to guide the model.
- sentence_embedding: Represents the input text as numerical vectors (embeddings).
- prompt_embedding: A set of learnable tokens that will be adjusted during training.
- target: The true label for the input text.
# Input data (embedding for the sentence "The food is delicious")
sentence_embedding = tf.constant([[0.2, 0.8], [0.5, 0.4], [0.9, 0.1], [0.6, 0.7]], dtype=tf.float32)
# Learnable prompt embeddings (p), we will optimize this
prompt_embedding = tf.Variable([[0.1, 0.5], [-0.4, 0.9]], dtype=tf.float32, trainable=True)
# Target label (1 for positive sentiment, 0 for negative)
target = tf.constant([1], dtype=tf.int32) # Positive sentiment
Step 4: Set Up the Model and Training Components
We initialize the model, define the loss function and set up the optimizer for training.
- SparseCategoricalCrossentropy: Measures how well the model’s predictions match the true labels.
- Adam optimizer: Updates the learnable prompt (p) based on gradients.
# Model parameters
input_size = 2 # Embedding size
hidden_size = 4
# Initialize the model
model = SimpleModel(input_size, hidden_size)
# Define the loss function
loss_function = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
# Define the optimizer
optimizer = tf.keras.optimizers.Adam(learning_rate=0.01)
# Store the loss values for plotting
loss_values = []
Step 5: Train the Model
We train the model by iteratively adjusting the learnable prompt (p) to minimize the loss.
- Concatenation : The learnable prompt (
p) is prepended to the input text embeddings. - Forward Pass : The model processes the combined input and produces an output.
- Loss Calculation : The difference between the model’s prediction and the true label is computed.
- Gradient Descent : The prompt is updated to reduce the loss.
# Training loop to optimize the prompt
for epoch in range(5000): # Run for 5000 epochs
with tf.GradientTape() as tape:
# Concatenate prompt with sentence embedding to form the model input
model_input = tf.concat([prompt_embedding, sentence_embedding], axis=0)
# Forward pass: Simplified by averaging the input embeddings
output = model(tf.reduce_mean(model_input, axis=0, keepdims=True))
# Compute the loss
loss = loss_function(target, output)
# Backward pass and optimization
gradients = tape.gradient(loss, [prompt_embedding]) # Compute gradients only for the prompt
optimizer.apply_gradients(zip(gradients, [prompt_embedding])) # Update the prompt embeddings
# Store the loss value
loss_values.append(loss.numpy())
# Print progress every 10 epochs
if epoch % 10 == 0:
print(f'Epoch {epoch}, Loss: {loss.numpy()}, Prompt: {prompt_embedding.numpy()}')
Step 6: Visualize the Results
After training we plot the loss values to see how the model improved over time and print the optimized prompt.
# Plot the loss values
plt.plot(loss_values)
plt.title('Loss Function Over Epochs')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.grid(True)
plt.show()
# Print the optimized prompt
print("Optimized Prompt:", prompt_embedding.numpy())
Output:

By following these steps you can implement prompt tuning in python and adapt a pre-trained model for a specific task like sentiment analysis. This approach is lightweight, efficient and preserves the general knowledge of the original model.
Advantages of Prompt Tuning
Prompt tuning offers several key benefits over traditional fine-tuning:
- Parameter Efficiency : Unlike full fine-tuning which updates all parameters of the model, prompt tuning modifies only a small subset of parameters. This drastically reduces memory and computational requirements.
- Task-Specific Adaptation : Soft prompts can be tailored to specific tasks, enabling the same pre-trained model to handle multiple tasks simultaneously without interference.
- Scalability : Prompt tuning scales well with larger models. As models grow in size their relative overhead of managing soft prompts remains minimal.
- Preservation of General Knowledge : By keeping majority of the model frozen, prompt tuning ensures that the general knowledge acquired during pre-training is preserved reducing the risk of catastrophic forgetting.
- Faster Deployment : Since only the soft prompts need to be stored and distributed, prompt tuning simplifies the deployment of LLMs across different tasks and environments.
Limitations of Prompt Tuning
While prompt tuning offers several advantages, it is not without limitations:
- Task Complexity: It may struggle with highly complex tasks that require extensive modifications to the model's behavior. In such cases, full fine-tuning might still be necessary.
- Initialization Sensitivity: The performance of prompt tuning can be sensitive to the initialization of the soft prompts. Poor initialization may lead to suboptimal results.
- Limited Interpretability: Unlike discrete textual prompts, soft prompts are not human-readable making it difficult to interpret what the model has learned.
As NLP models continue to grow in size and complexity, techniques like prompt tuning plays a important role in making these models accessible and practical for real-world applications.