🌀7x Longer Context Reinforcement Learning GRPO

Learn how Unsloth enables ultra long context RL fine-tuning.

Reinforcement learning's (RL) biggest challenge is supporting long reasoning traces. We're introducing new batching algorithms to enable ~7x longer context (can be more than 12x) RL training with no accuracy or speed degradation vs. other optimized setups that use FA3, kernels & chunked losses.

Unsloth now trains gpt-oss QLoRA with 380K context on a single 192GB NVIDIA B200 GPU
Qwen3-8B GRPO reaches 110K context on an 80GB VRAM H100 via vLLM and QLoRA, and 65K for gpt-oss with BF16 LoRA.
On 24GB VRAM, gpt-oss reaches 20K context and 32K for Qwen3-VL-8B QLoRA
Unsloth GRPO RL runs with Llama, Gemma & all models auto support longer contexts

Our new data-movement and batching kernels and algorithms unlocks more context by:

Dynamic flattened sequence chunking to avoid materializing massive logit tensors and
Offloading log softmax activations which prevents silent memory growth over time.

You can combine all features in Unsloth together:

Unsloth's weight-sharing feature with vLLM and our Standby Feature in Memory Efficient RL
Unsloth's Flex Attention for long context gpt-oss and our 500K Context Training
Float8 training in FP8 RL and Unsloth's async gradient checkpointing and much more

🎉Getting started

To get started, you can use any existing GRPO notebooks (or update Unsloth if local):

gpt-oss-20b GSPO

Google Colabcolab.research.google.com

Qwen3-VL-8B Vision RL

Google Colabcolab.research.google.com

Qwen3-8B - FP8 L4 GPU

Google Colabcolab.research.google.com

Adopting Unsloth for your RL tasks provides a robust framework for managing large-scale models efficiently. To effectively utilize Unsloth's enhancements:

Hardware Recommendations: Use of NVIDIA H100 or equivalent for optimal VRAM utilization.
Configuration Tips: Ensure batch_size and gradient_accumulation_steps settings align with your computational resources for best performance.

Update Unsloth to the latest Pypi release to get the latest updates:

Our benchmarks highlight the memory savings achieved in comparison to earlier versions for GPT OSS and Qwen3-8B. Both plots below (without standby) were run with batch_size = 4 and gradient_accumulation_steps=2 , since standby by design uses all VRAM.

For our benchmarks, we compare BF16 GRPO to Hugging Face with all optimizations enabled (all kernels in kernels library, Flash Attention 3, chunked loss kernels, etc):

🔢Flattened sequence length chunking

Previously, Unsloth reduced memory usage of RL by avoiding the full materialization of the logits tensor through chunking over the batch dimension. A rough estimate of the VRAM required to materialize logits during the forward pass is shown in Equation (1).

\text{Equation 1: } \text{Logit Memory (GB)} = \frac{\text{batch size} \times\text{context length} \times \text{vocab dim}}{1024^3}

Using this formulation, a configuration with batch_size = 4, context_length = 8192, and vocab_dim = 128,000 would require approximately 3.3 GB of VRAM to store the logits tensor.

Via Long Context gpt-oss last year, we then introduced a fused loss approach for GRPO. This approach ensures that only a single batch sample is processed at a time, significantly reducing peak memory usage. Under the same configuration, VRAM usage drops to approximately 0.83 GB, as reflected in Equation (2).

\text{Equation 2: }\text{Logit Memory (GB)} = \frac{\text{context length} \times \text{vocab dim}}{1024^3}

In this update, we extend the same idea further by introducing chunking across the sequence dimension as well. Instead of materializing logits for the entire (batch_size × context_length) space at once, we flatten these dimensions and process them in smaller chunks using a configurable multiplier. This allows Unsloth to support substantially longer contexts without increasing peak memory usage.

In Figure 5 below, we use a multiplier of max(4, context_length // 4096), though any multiplier can be specified depending on the desired memory–performance tradeoff. With this setting, the same example configuration (batch_size = 4, context_length = 8192, vocab_dim = 128,000) now requires only 0.207 GB of VRAM for logits materialization.

\text{Equation 3: }\text{Logit Memory (GB)} = \frac{\frac{\text{context length}}{\text{multiplier}} \times \text{vocab dim}}{1024^3}

This update is reflected in the compiled chunked_hidden_states_selective_log_softmax below, which now supports chunking across both the batch and sequence dimensions. To preserve the logits tensor ([batch_size, context_length, vocab_dim]), it is always chunked across the batch dimension. Additional sequence chunking is controlled via unsloth_logit_chunk_multiplier in the GRPO configuration; if unset, it defaults to max(4, context_length // 4096). In the example below, input_ids_chunk[0] corresponds to the size of the hidden states mini batches in optimization 2.

We utilize torch.compile with custom compile options to reduce VRAM and increase speed.
All chunked logits are upcasted in float32 to preserve accuracy.
We support logit softcapping, temperature scaling and all other features.

👻Hidden States Chunking

We also observed that at longer context lengths, hidden states can become a significant contributor to memory usage. For demonstration, we will assume hidden_states_dim=4096. The corresponding memory usage follows a similar formulation to the logits case, shown below.

\text{Hidden States Memory (GB)} = \frac{\text{batch size} \times\text{context length} \times \text{hidden states dim}}{1024^3}

With a batch_size = 8 and context_length = 64000, this would result in a VRAM usage of approximately 2 GB. In this release, we introduce optional chunking over the batch dimension for the hidden states tensor during log-probability computation. This would cause the VRAM usage to be divided by the batch size or in this case be 0.244 GB.This reduces the peak VRAM required to materialize hidden states, as reflected in the updated equation below:

\text{Hidden States Memory (GB)} = \frac{\text{context length} \times \text{hidden states dim}}{1024^3}

Similar to our cross entropy loss in our 500K Context Training release, the new implementation automatically tunes hidden state batching. Users can also control this behavior via unsloth_grpo_mini_batch. However, increasing unsloth_grpo_mini_batch beyond the optimal value can introduce slight performance increase or slowdown (usually faster) compared to the previous loss function.

However, during a GPT-OSS run (context_length = 8192, batch_size = 4, gradient_accumulation_steps = 2), setting unsloth_grpo_mini_batch = 1 and unsloth_logit_chunk_multiplier = 4 results in little to no speed degradation while reducing VRAM usage by approximately 5 GB compared to older versions of Unsloth.

Note: In Figures 3 and 4, we use the maximum effective batch size, which is 8 in this setup. The effective batch size is computed as batch_size × gradient_accumulation_steps, giving 4 × 2 = 8. For a deeper explanation of how effective batch sizes work in RL, see our advanced RL documentation.

🌵Offloading activations for log softmax

During the development of this release, we discovered that when tiling across the batch dimension for hidden states, the activations were not being offloaded after the fused logits and logprobs computation. Because logits are computed one batch at a time using hidden_states[i] @ lm_head, the existing activation offloading and gradient checkpointing logic, designed to operate within the model’s forward pass did not apply in this case.

To address this, we added explicit logic to offload these activations outside the model’s forward pass, as shown in the Python pseudocode below:

Note: This feature is only effective when chunking across the batch dimension or when unsloth_grpo_mini_batch > 1. If all hidden states are materialized at once during the forward pass (i.e., unsloth_grpo_mini_batch = 1), the backward pass requires the same amount of memory in the GPU regardless of whether activations are offloaded. Since activation offloading introduces a slight performance slowdown without reducing memory usage in this case, it provides no benefit.

✨Configuring parameters:

If you do not configure unsloth_grpo_mini_batch and unsloth_logit_chunk_multiplier, we will automatically tune these two parameters for you based on your available VRAM and depending on the size of your context length. Below however is how you can change these variables in your GRPO run:

A visualization of the optimizations and unsloth_grpo_mini_batch and unsloth_logit_chunk_multiplier can be seen in the diagram below.

The 3 matrices represent the overall larger batch or unsloth_grpo_mini_batch (represented by the number of black brackets) and the rows of each of the matrices represents the context length that the unsloth_logit_chunk_multiplier chunks the sequence length by (represented by the number of red brackets).

📼vLLM for RL

For RL workflows, the inference/generation phase is the main bottleneck. To address this, we utilize vLLM, which has accelerated generation by up to 11x compared to normal generation. Since GRPO was popularized last year, vLLM has been a core component of most RL frameworks including Unsloth. We want to extend our gratitude to the vLLM team and all its contributors for their work as they play a pivotal role in making Unsloth’s RL better!

To try longer context RL, you can use any existing GRPO notebooks (or update Unsloth if local):

gpt-oss-20b - GSPO

Google Colabcolab.research.google.com

Qwen3-VL-8B Vision RL

Google Colabcolab.research.google.com

Qwen3-8B - FP8 L4 GPU

Google Colabcolab.research.google.com

Acknowledgements: A huge thank you to the Hugging Face team and libraries for powering Unsloth and making this possible.

PreviousEmbedding Fine-tuning NextNew 3x Faster Training

Last updated 2 days ago

Was this helpful?