GRPO: a new method for training LLMs on a single GPU

This title was summarized by AI from the post below.

7mo

GRPO, simplified. Training reasoning-capable LLMs needs GPU, a lot of it. But a new method from DeepSeek—Group Relative Policy Optimization (GRPO)—is changing the game. Unlike PPO-based RLHF that requires four large models and massive compute, GRPO reduces the model count, avoids separate reward/value networks, and enables training on a single GPU—even for complex reasoning tasks. Link: https://lnkd.in/gycxZjej

3 Comments

Moe Chabot

7mo

Thanks for sharing! We have a question on deep-ml where you could try and implement GRPO from scratch to see if you truly understand it https://www.deep-ml.com/problems/101

2 Reactions

Arun Prakash

7mo

The exact PPO implementation typically requires a policy and a value model (for estimating advantage); however, models like GPT-3 were trained using RLHF with two key models: policy (SFT model) and a reward model, and one more, reference model (which is same policy model from past time step) for KL penalty. Also, the reward model is much smaller than the policy model. Many implementations skip the value network altogether (if I’m not mistaken). The compute requirement is not as high as required for 4 large models. Therefore, both GRPO and RLHF (in practice) require three models.

Arjun Jain

7mo

Thanks for sharting—removing the reward/value networks help smaller teams experiment without breaking the bank.

See more comments

To view or add a comment, sign in

More Relevant Posts

Thierry Beigbeder
1mo
Report this post
Detailed information and instructions for adding a GPU node to a K3s cluster, because existing documentation is not so easy to read through: https://lnkd.in/gmruV4cx

Adding a GPU node to a K3s cluster blog.otvl.org
Like Comment
To view or add a comment, sign in
Kay Hewett
2mo
Report this post
Accelerating GPU Performance GPU computational performance can surpass hybrid CPU/GPU, even for small matrix sizes if host synchronization were eliminated using CUDAGraphs. By default, GPU operations are asynchronous; however, host synchronization forces the CPU (host) to wait for the completion of GPU operations before proceeding. The result is a blocking operation that eliminates the natural asynchronous parallelism between CPU and GPU. Host synchronization impairs performance in GPU computing. Any approach with frequent host sync points (like hybrid CPU/GPU strategies) will severely degrade performance, regardless of how fast individual operators are. By extrapolation, end-to-end pipeline design and CUDAGraph compatibility provide better performance metrics than isolated operator benchmarks. Host synchronization can occur for several reasons including accessing GPU tensor values, moving data to CPU, explicit synchronization calls, and memory operations across devices. These benchmarks showed the dramatic impact of isolated GPU operations: 1.053ms compared to realistic host syncs: 9.867ms (9.4× slower!). The hybrid CPU/GPU approaches are problematic yielding a 14.9% performance overhead from constant device switching that are 1.44× slower than GPU-only pipeline. CUDAGraphs capture the entire computation graph and eliminate per-operation host synchronization. Furthermore, torch.compile (mode="reduce-overhead") automatically reduces host synchronization through: * Batching kernel launches with fewer sync points * Kernel fusion where multiple operations become a single GPU call * Memory pool pre-allocation that eliminates allocation syncs * Graph optimization that minimizes host-device communication Performance improved 2.78× for element-wise operations with torch.compile and reduced the number of host synchronization points. Conclusion: Host synchronization is the hidden performance killer in GPU computing. Any approach that introduces frequent host sync points (like hybrid CPU/GPU strategies) severely degrades performance, irrespective of individual operators speed. Therefore, end-to-end pipeline design and CUDAGraph compatibility are more important than isolated operator benchmarks. GPU computing can outperform hybrid CPU/GPU using CUDAGraphs and torch.compile with reduce-overhead. Source Code References The comprehensive analysis was conducted using these benchmark scripts (now in the pytorch-testing-scripts repository): https://lnkd.in/gjbV3AbJ - Complete benchmarking suite with CUDAGraph and torch.compile support e2e_benchmark_clean.py - End-to-end analysis demonstrating host synchronization impact Repository: https://lnkd.in/gqUuiv2G

GitHub - kay-hewett/pytorch-testing-scripts: PyTorch performance testing and validation scripts with GPU/CPU benchmarking github.com

1 Comment
Like Comment
To view or add a comment, sign in
Md. Mahir Labib
1mo
Report this post
Maybe this concept killed the future GPU mafia! They've open-sourced bitnet.cpp, a blazing-fast 1-bit LLM inference framework optimized for CPUs. This is a major step forward for running large models locally, without expensive GPUs or cloud costs. Key highlights: - Achieve up to 6x faster inference with 82% lower energy consumption - Run 100B parameter models directly on x86 CPUs - Leverage ternary weights (-1, 0, +1) and 8-bit activations to dramatically reduce memory usage Alongside this, Microsoft also released BitNet b1.58 2B4T, the first functional open-source model using just 1.58 bits for weights while maintaining strong benchmark performance. If you care about efficient AI at scale, this is worth a look! The efficiency is real : ternary weights + 8-bit activations do cut memory/energy. But the “100B on CPU” claim was a kernel throughput demo with dummy params (~5–7 t/s), not a trained LLM. The current usable release is 2B4T. Promising research, but not a GPU replacement yet. - https://lnkd.in/gWwbvaJr Paper - https://lnkd.in/gnxtfYmG

BitNet b1.58 2B4T Technical Report arxiv.org
Like Comment
To view or add a comment, sign in
Cerebras

86,051 followers
1mo
Report this post
🧮 What does 8×7B actually mean? It is NOT 8 experts with 7B active parameters per token. Turns out it’s actually 13B active parameters. But wait — where does 13B come from? In the world of Mixture of Experts (#MoE), even the simplest questions get surprisingly complex: ❓ How much storage do you actually need? ❓ How much compute does that translate to? ❓ What are the real bottlenecks — memory, compute, or communication? ⁉️ And how does Cerebras solve GPU bottlenecks? If you’ve ever tried to make sense of MoE math, our next post in the MoE 101 guide by Daria Soboleva (and interactive calculator) breaks it all down. https://lnkd.in/g7N6dh69

2 Comments
Like Comment
To view or add a comment, sign in
KaiCore

3 followers
1mo
Report this post
We just ingested 117,659 nodes and 290,674 relationships into our knowledge graph—on commodity CPU, with no GPU assistance—in just 237 ms. The system remains instantly queryable through real-time validation and region-aware tiering, which keep the hottest data fast while efficiently managing the rest. This is a major step toward self-hosted semantic memory at production scale. *MLCommons Benchmarking is still pending. But we are pushing hard for commonly recognized standards.
Like Comment
To view or add a comment, sign in
Priyesh Mishra
1mo Edited
Report this post
Today I learned How to Add two numbers on GPU instead of CPU using C++ OpenGL. This code does a single operation of addition. It is simply just declaring two variables lets say a and b and then storing their result in c. c = a + b This is rather easy to do on CPU that we do in our text editors: int a,b; int result = a + b; But for GPU, I did this using Compute Shader although I could use fragment shader but it needed more work that this code. This code does following: 1. Initializes the glfw Window 2. Initializes the computer shader code 3. Compiles the shader and linked it in the program 4. Declares two variables and transfers that data from CPU to GPU 5. Runs the compute shader Program on GPU to add 6. Stores the result in GPU 7. We access the result using pointer and display the result 8. Then the clean up of data from the GPU.
14 Comments
Like Comment
To view or add a comment, sign in
Ali Sedighi
1mo Edited
Report this post
Can we have a daemon thread monitoring GPU computation in real time? I was experimenting with Kokkos to compute the max difference between two 3D fields and wondered if I could have a daemon thread on the host reporting this metric continuously, without waiting for the kernel to finish. Turns out, that’s not really possible in the way we imagine. GPU kernels run asynchronously, and you can’t safely read intermediate data from device memory without synchronization. So your daemon thread can only see results after the GPU kernel completed. It’s not truly on the fly! The practical approach is to break the work into smaller kernels (per timestep for example) and compute diagnostics periodically. Alternatively, you can allocate a Kokkos::View in HostPinnedSpace so both the GPU and CPU can access it and giving you quasi-real-time telemetry without deep copies. Here’s the what I tried
1 Comment
Like Comment
To view or add a comment, sign in
HPCwire

7,909 followers
1mo
Report this post
Smarter Cost Control for GPU & HPC Environments Discover how chargeback systems improve efficiency and planning in GPU and HPC clusters. Get key insights in this white paper from Parallel Works. https://ow.ly/B9n850X80ZO
Like Comment
To view or add a comment, sign in
Tamas Hajgato
1mo Edited
Report this post
FlashAttention cleverly avoids constructing too large tensors. It is indeed a very well made self-attention mechanism. It was state of the art. When we deployed a single self-attention layer into TensorRT, we measured 53.2391 ms latency with FlashAttention. Compare that to the 2.34724 ms latency we measured with our attention mechanism. Less memory operations. Less compute.
Like Comment
To view or add a comment, sign in

31,471 followers

View Profile Follow

GRPO: a new method for training LLMs on a single GPU

More from this author

How I started with Deep Learning?

Measuring Text Similarity in Python

Getting started with Apache Spark

Explore content categories

GRPO: a new method for training LLMs on a single GPU

More Relevant Posts

More from this author

How I started with Deep Learning?

Measuring Text Similarity in Python

Getting started with Apache Spark

Explore related topics

Explore content categories