👁️Vision Fine-tuning

Learn how to fine-tune vision/multimodal LLMs with Unsloth

Fine-tuning vision models enables model to excel at certain tasks normal LLMs won't be as good as such as object/movement detection. You can also train VLMs with RL. We have many free notebooks for vision fine-tuning:

Qwen3-VL (8B) Vision: Notebook
Ministral 3: vision fine-tuning for general Q&A: Notebook One can concatenate general Q&A datasets with more niche datasets to make the finetune not forget base model skills.
Gemma 3 (4B) Vision: Notebook
Llama 3.2 Vision fine-tuning for radiography: Notebook How can we assist medical professionals in analyzing Xrays, CT Scans & ultrasounds faster.
Qwen2.5 VL fine-tuning for converting handwriting to LaTeX: Notebook This allows complex math formulas to be easily transcribed as LaTeX without manually writing it.

It is best to ensure your dataset has images of all the same size/dimensions. Use dimensions of 300-1000px to ensure your training does not take too long or use too many resources.

Disabling Vision / Text-only fine-tuning

To finetune vision models, we now allow you to select which parts of the mode to finetune. You can select to only finetune the vision layers, or the language layers, or the attention / MLP layers! We set them all on by default!

model = FastVisionModel.get_peft_model(
    model,
    finetune_vision_layers     = True, # False if not finetuning vision layers
    finetune_language_layers   = True, # False if not finetuning language layers
    finetune_attention_modules = True, # False if not finetuning attention layers
    finetune_mlp_modules       = True, # False if not finetuning MLP layers

    r = 16,                           # The larger, the higher the accuracy, but might overfit
    lora_alpha = 16,                  # Recommended alpha == r at least
    lora_dropout = 0,
    bias = "none",
    random_state = 3407,
    use_rslora = False,               # We support rank stabilized LoRA
    loftq_config = None,               # And LoftQ
    target_modules = "all-linear",    # Optional now! Can specify a list if needed
    modules_to_save=[
        "lm_head",
        "embed_tokens",
    ],
)

Vision Data Collator

We have a special data collator just for vision datasets:

And the arguments for the data collator are:

Vision Fine-tuning Dataset

The dataset for fine-tuning a vision or multimodal model is similar to standard question & answer pair datasets , but this time, they also includes image inputs. For example, the Llama 3.2 Vision Notebook uses a radiography case to show how AI can help medical professionals analyze X-rays, CT scans, and ultrasounds more efficiently.

We'll be using a sampled version of the ROCO radiography dataset. You can access the dataset here. The dataset includes X-rays, CT scans and ultrasounds showcasing medical conditions and diseases. Each image has a caption written by experts describing it. The goal is to finetune a VLM to make it a useful analysis tool for medical professionals.

Let's take a look at the dataset, and check what the 1st example shows:

Image

Caption

Panoramic radiography shows an osteolytic lesion in the right posterior maxilla with resorption of the floor of the maxillary sinus (arrows).

To format the dataset, all vision finetuning tasks should be formatted as follows:

We will craft an custom instruction asking the VLM to be an expert radiographer. Notice also instead of just 1 instruction, you can add multiple turns to make it a dynamic conversation.

Let's convert the dataset into the "correct" format for finetuning:

The first example is now structured like below:

Before we do any finetuning, maybe the vision model already knows how to analyse the images? Let's check if this is the case!

And the result:

For more details, view our dataset section in the notebook here.

Multi-image training

In order to fine-tune or train a VLM like Qwen3-VL with multi-images the most straightforward change is to swap

with:

Using map kicks in dataset standardization and arrow processing rules which can be strict and more complicated to define.

🔎Training on assistant responses only for vision models, VLMs

For language models, we can use from unsloth.chat_templates import train_on_responses_only as described previously. For vision models, use the extra arguments as part of UnslothVisionDataCollator just like before! See Vision Data Collator for more details on how to use the vision data collator.

For example for Llama 3.2 Vision:

PreviousTool Calling LLMs Guide NextUnsloth Docker Guide

Last updated 3 hours ago

Was this helpful?