🔎Fine-tuning Embedding Models with Unsloth Guide

Learn how to easily fine-tune embedding models with Unsloth.

Fine-tuning embedding models can largely improve retrieval and RAG performance on specific tasks. It aligns the model's vectors with your domain and the kind of 'similarity' that matters for your use case, which improves search, RAG, clustering, and recommendations on your data.

Example: The headlines “Google launches Pixel 10” and “Qwen releases Qwen3” might be embedded as similar if you’re just labeling both as 'Tech,' but not similar if you’re doing semantic search because they’re about different things. Fine-tuning helps the model make the 'right' kind of similarity for your use case, reducing errors and improving results.

Unsloth now supports training embedding, classifier, BERT, reranker models ~1.8-3.3x faster with 20% less memory and 2x longer context than other Flash Attention 2 implementations - no accuracy degradation. EmbeddingGemma-300M works on just 3GB VRAM. You can use your trained model anywhere: transformers, LangChain, Ollama, vLLM, llama.cpp etc.

Unsloth uses SentenceTransformers to support compatible models like Qwen3-Embedding, BERT and more. Even if there's no notebook or upload, it’s still supported.

We created free fine-tuning notebooks, with 3 main use-cases:

EmbeddingGemma (300M)

Qwen3-Embedding 4B • 0.6B

All-MiniLM-L6-v2: produce compact, domain-specific sentence embeddings for semantic search, retrieval, and clustering, tuned on your own data.
tomaarsen/miriad-4.4M-split: embed medical questions and biomedical papers for high-quality medical semantic search and RAG.
electroglyph/technical: better capture meaning and semantic similarity in technical text (docs, specs, and engineering discussions).

You can view the rest of our uploaded models in our collection here.

A huge thanks to Unsloth contributor electroglyph, whose work was significant to support this. You can check out electroglyph’s custom models on Hugging Face here.

🦥 Unsloth Features

LoRA/QLoRA or full fine-tuning for embeddings, without needing to rewrite your pipeline
Best support for encoder-only SentenceTransformer models (with a modules.json)
Cross-encoder models are confirmed to train properly even under the fallback path
This release also supports transformers v5

There is limited support for models without modules.json (we’ll auto-assign default SentenceTransformers pooling modules). If you’re doing something custom (custom heads, nonstandard pooling), double-check outputs like the pooled embedding behavior.

Some models needed custom additions such as MPNet or DistilBERT were enabled by patching gradient checkpointing into the transformers models.

🛠️ Fine-tuning Workflow

The new fine-tuning flow is centered around FastSentenceTransformer.

Main save/push methods:

save_pretrained() Saves LoRA adapters to a local folder
save_pretrained_merged() Saves the merged model to a local folder
push_to_hub() Pushes LoRA adapters to Hugging Face
push_to_hub_merged() Pushes the merged model to Hugging Face

And one very important detail: Inference loading requires for_inference=True

from_pretrained() is similar to Lacker’s other fast classes, with one exception:

To load a model for inference using FastSentenceTransformer, you must pass: for_inference=True

So your inference loads should look like:

For Hugging Face authorization, if you run:

inside the same virtualenv before calling the hub methods, then:

push_to_hub() and push_to_hub_merged() don’t require a token argument.

✅ Inference and Deploy Anywhere!

Your fine-tuned Unsloth model can be used and deployed with all major tools: transformers, LangChain, Weaviate, sentence-transformers, Text Embeddings Inference (TEI), vLLM, and llama.cpp, custom embedding API, pgvector, FAISS/vector databases, and any RAG framework.

There is no lock in as the fine-tuned model can later be downloaded locally on your own device.

📊 Unsloth Benchmarks

Unsloth's advantages include speed for embedding fine-tuning! We show we are consistently 1.8 to 3.3x faster on a wide variety of embedding models and on different sequence lengths from 128 to 2048 and longer.

EmbeddingGemma-300M QLoRA works on just 3GB VRAM and LoRA works on 6GB VRAM.

Below are our Unsloth benchmarks in a heatmap vs. SentenceTransformers + Flash Attention 2 (FA2) for 4bit QLoRA. For 4bit QLoRA, Unsloth is 1.8x to 2.6x faster:

Below are our Unsloth benchmarks in a heatmap vs. SentenceTransformers + Flash Attention 2 (FA2) for 16bit LoRA. For 16bit LoRA, Unsloth is 1.2x to 3.3x faster:

🔮 Model Support

Here are some popular embedding models Unsloth supports (not all models are listed here):

Most common models are already supported. If there’s an encoder-only model you’d like that isn’t, feel free to open a GitHub issue requesting it.

PreviousDPO, ORPO, KTO NextUltra Long Context RL

Last updated 9 days ago

Was this helpful?