GRPO: a new method for training LLMs on a single GPU

This title was summarized by AI from the post below.

GRPO, simplified. Training reasoning-capable LLMs needs GPU, a lot of it. But a new method from DeepSeek—Group Relative Policy Optimization (GRPO)—is changing the game. Unlike PPO-based RLHF that requires four large models and massive compute, GRPO reduces the model count, avoids separate reward/value networks, and enables training on a single GPU—even for complex reasoning tasks. Link: https://lnkd.in/gycxZjej

  • diagram

Thanks for sharing! We have a question on deep-ml where you could try and implement GRPO from scratch to see if you truly understand it https://www.deep-ml.com/problems/101

The exact PPO implementation typically requires a policy and a value model (for estimating advantage); however, models like GPT-3 were trained using RLHF with two key models: policy (SFT model) and a reward model, and one more, reference model (which is same policy model from past time step) for KL penalty. Also, the reward model is much smaller than the policy model. Many implementations skip the value network altogether (if I’m not mistaken). The compute requirement is not as high as required for 4 large models. Therefore, both GRPO and RLHF (in practice) require three models.

Like
Reply

Thanks for sharting—removing the reward/value networks help smaller teams experiment without breaking the bank.

Like
Reply
See more comments

To view or add a comment, sign in

Explore content categories