Fine-tuning LLMs: a high-stakes game with side effects

This title was summarized by AI from the post below.

🤔 Could a fine-tuned LLM actually be worse at finding the right answer than the raw base model? The shortcomings of fine-tuning aren't discussed much, but even the most modern methods come with tradeoffs. Here’s what recent research by Yang Yue et al. (2025) reveals: 🔹 Base models often match – or even beat – RL-fine-tuned models on tough tasks when given enough samples (i.e., more generations) due to more effective exploration. 🔹 Even world-class models like DeepSeek AI's R1 can actually become too efficient at sampling paths and end up losing to their base models in their ability to arrive at the correct answer. 🔹 This implies that modern reasoning models are, in fact, hyper-efficient samplers – nothing more. Fine-tuning is more about efficiency than capability expansion. At Flow AI, we’ve spent countless hours fine-tuning models and pushing them to their limits. One thing has become clear: tweaking how a model samples is a high-stakes game. You chase gains, but side effects often emerge silently. Hallucinations, bias, even catastrophic forgetting can surface long after deployment. Have you had any fine-tuning experiments you'd be willing to share? (Link to the paper in the comments) #artificialintelligence #LLMs #machinelearning

  • diagram, engineering drawing

Good topic Aaro! We have done some testing on fine-tuning models as well but can definitely confirm what you wrote here. We found that for our use case the side effects are just too great (mainly hallucination and bias).

so basically RL = cheatcode for onshotting, but base model = more likely to solve if looped?

See more comments

To view or add a comment, sign in

Explore content categories