Skip to content

Conversation

@electroglyph
Copy link
Contributor

@electroglyph electroglyph commented Dec 12, 2025

supersedes #3718

example training code:

from unsloth import FastSentenceTransformer
from sentence_transformers import (
    SentenceTransformerTrainer,
    SentenceTransformerTrainingArguments,
)
from datasets import Dataset
import torch

model_name = "Snowflake/snowflake-arctic-embed-m-v1.5"
model = FastSentenceTransformer.from_pretrained(
    model_name,
    load_in_4bit=True,
    device_map="cuda",
)

model = FastSentenceTransformer.get_peft_model(
    model,
    r=16,
    target_modules=["query", "key", "value", "dense"],
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    # task_type="FEATURE_EXTRACTION",
)

train_dataset = Dataset.from_dict(
    {
        "sentence_A": [
            "The cat sits outside",
            "A man is playing guitar",
            "I love pasta",
        ],
        "sentence_B": [
            "A man is playing guitar",
            "The woman loves that cat",
            "Do you like pizza?",
        ],
        "label": [0.0, 0.5, 1.0],
    }
)

from sentence_transformers.losses import CoSENTLoss

loss = CoSENTLoss(model)

args = SentenceTransformerTrainingArguments(
    output_dir="test_trainer_output",
    num_train_epochs=1,
    per_device_train_batch_size=2,
    learning_rate=2e-4,
    fp16=not torch.cuda.is_bf16_supported(),
    bf16=torch.cuda.is_bf16_supported(),
    logging_steps=1,
    save_strategy="no",
    report_to="none",
)

trainer = SentenceTransformerTrainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    loss=loss,
)

trainer.train()

print("Training finished successfully!")
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @electroglyph, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands Unsloth's capabilities by introducing FastSentenceTransformer, a new class designed to streamline the finetuning of SentenceTransformer models. By integrating Unsloth's optimized model loading and PEFT (LoRA) application, users can now efficiently train embedding models with reduced memory footprint and improved speed, broadening the range of models that can benefit from Unsloth's performance enhancements.

Highlights

  • New FastSentenceTransformer Class: Introduces FastSentenceTransformer to enable easy and optimized finetuning of SentenceTransformer models within the Unsloth framework.
  • Optimized Model Loading and PEFT Integration: Leverages Unsloth's FastModel for efficient loading of base models and seamlessly integrates PEFT (LoRA) for memory-efficient finetuning of the underlying transformer.
  • Automatic Pooling Mode Detection: Includes logic to automatically detect the pooling mode (e.g., 'cls', 'mean', 'max') from the SentenceTransformer model's configuration, ensuring correct setup.
  • Seamless SentenceTransformer Wrapper: Wraps the loaded base model into a sentence_transformers.SentenceTransformer instance, complete with Transformer, Pooling, and Normalize modules, making it compatible with the sentence-transformers ecosystem.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@electroglyph
Copy link
Contributor Author

just getting this started, i'll get to the docs (and any suggestions) tomorrow

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces FastSentenceTransformer, a new class for easily finetuning SentenceTransformer models with Unsloth's optimizations. The implementation is solid, providing from_pretrained and get_peft_model methods that correctly integrate with the existing FastModel framework. The code handles loading quantized models, applying PEFT, and constructing a SentenceTransformer object. My review includes a couple of suggestions to improve code style and maintainability, such as moving imports to the top level and refactoring a conditional block to be more concise.

@shimmyshimmer
Copy link
Collaborator

Thank you amazing!! Please let us know if you'd like to collab on a blog as well :)

@electroglyph
Copy link
Contributor Author

electroglyph commented Dec 13, 2025

Thank you amazing!! Please let us know if you'd like to collab on a blog as well :)

absolutely, that would be great!

unslothai/unsloth-zoo#383 will add XLMRobertaModel support

@electroglyph
Copy link
Contributor Author

@Datta0
Copy link
Collaborator

Datta0 commented Dec 15, 2025

Ok I tested your notebook and it seems to work fine. Imma review it in the morning

Copy link
Collaborator

@Datta0 Datta0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! I have some queries and comments :)

@electroglyph electroglyph marked this pull request as ready for review December 16, 2025 11:08
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@electroglyph
Copy link
Contributor Author

electroglyph commented Dec 17, 2025

edited: Dec 20

(removed a few incompatible types from list, like model2vec)

here's current compatibility status, i tested training the top 100 encoder embedding models (by download number) and 87 out of 100 can be trained right now: https://0x0.st/Ps1q.txt

training, inference, saving working on all supported models, avg. memory reduction around 85%
from_pretrained() now takes for_inference bool to load models in inference mode

i haven't yet figured out a generic way to switch all supported models from training mode to inference mode in memory.

… + SDPA

The original implementation was 31% slower than naive SentenceTransformer due to
conflicting decorators from Unsloth's auto-compiler (@torch.compile on attention
modules but @torch.compiler.disable on sub-modules).

Changes:
- Add fast encoder path that bypasses Unsloth patching for encoder models
- Use native torch.compile with mode="reduce-overhead" for 6x speedup
- Auto-detect and enable SDPA for models that support it (BERT, RoBERTa, etc.)
- Change defaults: load_in_16bit=True, load_in_4bit=False (16-bit is optimal)
- Change default: use_gradient_checkpointing=False (conflicts with torch.compile)
- Add UNSLOTH_COMPILE_DISABLE=1 env var to fall back to old path if needed

Supported encoder types: mpnet, bert, distilbert, roberta, xlm-roberta, albert, electra

Benchmark results (BS=32, seq_len=128):
- Naive 16-bit LoRA:     13-50ms per iter
- Unsloth 16-bit LoRA:   2-9ms per iter (5.4x-6.7x faster)
- Memory usage:          61MB-1.3GB (even largest model fits easily)

Note: 4-bit + torch.compile has a PyTorch bug (pytorch/pytorch#90665).
4-bit is also 1.7-1.9x slower than 16-bit due to dequantization overhead,
so 16-bit is recommended for these small encoder models anyway.
@danielhanchen
Copy link
Contributor

Added performance fix for FastSentenceTransformer.

The original implementation was 31% slower than naive SentenceTransformer due to conflicting decorators from Unsloth's auto-compiler. Fixed by adding a fast encoder path that uses native torch.compile + SDPA.

Changes:

  • Fast encoder path bypasses Unsloth patching for encoder models (mpnet, bert, distilbert, roberta, xlm-roberta, albert, electra)
  • Uses torch.compile with mode="reduce-overhead" for 6x speedup
  • Auto-detects and enables SDPA for models that support it
  • Changed defaults: load_in_16bit=True, load_in_4bit=False, use_gradient_checkpointing=False
  • Added UNSLOTH_COMPILE_DISABLE=1 env var to fall back to old path if needed

Benchmark results (BS=32, seq_len=128):

Model Naive Unsloth Speedup
MiniLM-L6 13.69ms 2.24ms 6.10x
MPNet-base 17.51ms 2.62ms 6.69x
BGE-large 49.48ms 9.15ms 5.41x

Memory usage (16-bit LoRA + torch.compile):

Model BS=8 BS=32 BS=128
MiniLM-L6 61 MB 77 MB 143 MB
MPNet-base 237 MB 273 MB 417 MB
BGE-large 717 MB 837 MB 1320 MB

Note: 4-bit + torch.compile has a PyTorch bug (pytorch/pytorch#90665). 4-bit is also 1.7-1.9x slower than 16-bit due to dequantization overhead, so 16-bit is recommended for these small encoder models.

pre-commit-ci bot and others added 3 commits January 5, 2026 06:56
Changed from peft.prepare_model_for_kbit_training to
unsloth.models._utils.prepare_model_for_kbit_training.

Unsloth's version provides:
- Float32 mixed precision upcasting for LoRA layers
- Better numerical stability
- Consistency with rest of Unsloth codebase
- Changed absolute import to relative: from ._utils import prepare_model_for_kbit_training
- Added SUPPORTS_BFLOAT16 import for proper dtype detection
- Handle devices that don't support bfloat16 by falling back to float16
pre-commit-ci bot and others added 6 commits January 5, 2026 10:23
…nalysis

Changes:
- Change default compile_mode from "reduce-overhead" to "default" since CUDA
  Graphs (used by reduce-overhead) is incompatible with PEFT/LoRA
- Add _estimate_compile_threshold() to calculate minimum steps needed for
  torch.compile to be beneficial based on model parameter count
- Add _apply_torch_compile() helper with accelerate unwrap_model bug workaround
- Defer torch.compile application to trainer initialization time so we can
  check max_steps against the breakeven threshold
- Patch SentenceTransformerTrainer to auto-apply compile when max_steps
  exceeds the calculated threshold

Breakeven thresholds (with 1.2x safety margin):
- 22M params (MiniLM): ~1388 steps
- 110M params (mpnet): ~242 steps
- 335M params (snowflake): ~203 steps

This ensures torch.compile warmup cost is only paid when training is long
enough to benefit from the speedup.
@danielhanchen
Copy link
Contributor

Added auto-compile feature that intelligently decides whether to apply torch.compile based on a breakeven analysis.

Changes

  1. Fixed CUDA Graphs incompatibility: Changed default compile mode from reduce-overhead to default. The reduce-overhead mode uses CUDA Graphs which causes errors with PEFT/LoRA:

    RuntimeError: accessing tensor output of CUDAGraphs that has been overwritten
    
  2. Added breakeven estimation: New _estimate_compile_threshold() calculates the minimum training steps needed for torch.compile warmup to pay off, based on model parameter count:

    • 22M params (MiniLM): ~1388 steps threshold
    • 110M params (mpnet): ~242 steps threshold
    • 335M params (snowflake): ~203 steps threshold
  3. Deferred compilation: torch.compile is now applied at trainer initialization time (not model creation time), allowing us to check max_steps against the threshold.

  4. Added accelerate workaround: Fixed the unwrap_model bug by setting model.__dict__["_orig_mod"] = model after compilation.

How it works

When creating a trainer:

trainer = SentenceTransformerTrainer(model=model, args=args, ...)

The patched trainer checks:

  • If max_steps >= threshold: applies torch.compile (prints "Auto-compiling model")
  • If max_steps < threshold: skips compile (prints "Skipping torch.compile")

This ensures torch.compile warmup cost (~15-40 seconds) is only paid when training is long enough to benefit from the 1.5-2x speedup.

@danielhanchen
Copy link
Contributor

/gemini review

@danielhanchen danielhanchen merged commit 5011442 into unslothai:main Jan 22, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

5 participants