🔎Fine-tuning Embedding Models with Unsloth Guide
Learn how to easily fine-tune embedding models with Unsloth.
Fine-tuning embedding models can largely improve retrieval and RAG performance on specific tasks. It aligns the model's vectors with your domain and the kind of 'similarity' that matters for your use case, which improves search, RAG, clustering, and recommendations on your data.
Example: The headlines “Google launches Pixel 10” and “Qwen releases Qwen3” might be embedded as similar if you’re just labeling both as 'Tech,' but not similar if you’re doing semantic search because they’re about different things. Fine-tuning helps the model make the 'right' kind of similarity for your use case, reducing errors and improving results.
Unsloth now supports training embedding, classifier, BERT, reranker models ~1.8-3.3x faster with 20% less memory and 2x longer context than other Flash Attention 2 implementations - no accuracy degradation. EmbeddingGemma-300M works on just 3GB VRAM. You can use your trained model anywhere: transformers, LangChain, Ollama, vLLM, llama.cpp etc.
Unsloth uses SentenceTransformers to support compatible models like Qwen3-Embedding, BERT and more. Even if there's no notebook or upload, it’s still supported.
We created free fine-tuning notebooks, with 3 main use-cases:
All-MiniLM-L6-v2: produce compact, domain-specific sentence embeddings for semantic search, retrieval, and clustering, tuned on your own data.tomaarsen/miriad-4.4M-split: embed medical questions and biomedical papers for high-quality medical semantic search and RAG.electroglyph/technical: better capture meaning and semantic similarity in technical text (docs, specs, and engineering discussions).
You can view the rest of our uploaded models in our collection here.
A huge thanks to Unsloth contributor electroglyph, whose work was significant to support this. You can check out electroglyph’s custom models on Hugging Face here.
🦥 Unsloth Features
LoRA/QLoRA or full fine-tuning for embeddings, without needing to rewrite your pipeline
Best support for encoder-only
SentenceTransformermodels (with amodules.json)Cross-encoder models are confirmed to train properly even under the fallback path
This release also supports
transformers v5
There is limited support for models without modules.json (we’ll auto-assign default SentenceTransformers pooling modules). If you’re doing something custom (custom heads, nonstandard pooling), double-check outputs like the pooled embedding behavior.
Some models needed custom additions such as MPNet or DistilBERT were enabled by patching gradient checkpointing into the transformers models.
🛠️ Fine-tuning Workflow
The new fine-tuning flow is centered around FastSentenceTransformer.
Main save/push methods:
save_pretrained()Saves LoRA adapters to a local foldersave_pretrained_merged()Saves the merged model to a local folderpush_to_hub()Pushes LoRA adapters to Hugging Facepush_to_hub_merged()Pushes the merged model to Hugging Face
And one very important detail: Inference loading requires for_inference=True
from_pretrained() is similar to Lacker’s other fast classes, with one exception:
To load a model for inference using
FastSentenceTransformer, you must pass:for_inference=True
So your inference loads should look like:
For Hugging Face authorization, if you run:
inside the same virtualenv before calling the hub methods, then:
push_to_hub()andpush_to_hub_merged()don’t require a token argument.
✅ Inference and Deploy Anywhere!
Your fine-tuned Unsloth model can be used and deployed with all major tools: transformers, LangChain, Weaviate, sentence-transformers, Text Embeddings Inference (TEI), vLLM, and llama.cpp, custom embedding API, pgvector, FAISS/vector databases, and any RAG framework.
There is no lock in as the fine-tuned model can later be downloaded locally on your own device.
📊 Unsloth Benchmarks
Unsloth's advantages include speed for embedding fine-tuning! We show we are consistently 1.8 to 3.3x faster on a wide variety of embedding models and on different sequence lengths from 128 to 2048 and longer.
EmbeddingGemma-300M QLoRA works on just 3GB VRAM and LoRA works on 6GB VRAM.
Below are our Unsloth benchmarks in a heatmap vs. SentenceTransformers + Flash Attention 2 (FA2) for 4bit QLoRA. For 4bit QLoRA, Unsloth is 1.8x to 2.6x faster:

Below are our Unsloth benchmarks in a heatmap vs. SentenceTransformers + Flash Attention 2 (FA2) for 16bit LoRA. For 16bit LoRA, Unsloth is 1.2x to 3.3x faster:

🔮 Model Support
Here are some popular embedding models Unsloth supports (not all models are listed here):
Most common models are already supported. If there’s an encoder-only model you’d like that isn’t, feel free to open a GitHub issue requesting it.
Last updated
Was this helpful?

