Fine-tuning LLMs: a high-stakes game with side effects

This title was summarized by AI from the post below.

7mo

🤔 Could a fine-tuned LLM actually be worse at finding the right answer than the raw base model? The shortcomings of fine-tuning aren't discussed much, but even the most modern methods come with tradeoffs. Here’s what recent research by Yang Yue et al. (2025) reveals: 🔹 Base models often match – or even beat – RL-fine-tuned models on tough tasks when given enough samples (i.e., more generations) due to more effective exploration. 🔹 Even world-class models like DeepSeek AI's R1 can actually become too efficient at sampling paths and end up losing to their base models in their ability to arrive at the correct answer. 🔹 This implies that modern reasoning models are, in fact, hyper-efficient samplers – nothing more. Fine-tuning is more about efficiency than capability expansion. At Flow AI, we’ve spent countless hours fine-tuning models and pushing them to their limits. One thing has become clear: tweaking how a model samples is a high-stakes game. You chase gains, but side effects often emerge silently. Hallucinations, bias, even catastrophic forgetting can surface long after deployment. Have you had any fine-tuning experiments you'd be willing to share? (Link to the paper in the comments) #artificialintelligence #LLMs #machinelearning

6 Comments

Aaro Isosaari

7mo

Link to the paper: https://arxiv.org/abs/2504.13837

6 Reactions

Aku Pöllänen

7mo

Good topic Aaro! We have done some testing on fine-tuning models as well but can definitely confirm what you wrote here. We found that for our use case the side effects are just too great (mainly hallucination and bias).

1 Reaction

Joakim Isoaho

7mo

so basically RL = cheatcode for onshotting, but base model = more likely to solve if looped?

1 Reaction

See more comments

To view or add a comment, sign in

More Relevant Posts

James Drennan
2mo
Report this post
AI ALARM - no just kidding. Was scrolling through the ol’ ground news feed this morning, and I got wrapped up in the AI conversation again. What can AI do? What can’t it do? First, we have to keep in mind that AI isn’t going to replace all of us. Even if it had the capability to make us all useless – it doesn’t – the fact remains that unless we get a bunch of mini thorium reactors online (a process which takes at least a decade, this using specs that are currently theoretical), we’d have to rob from city/agri production to feed our new overlords. Truth be told, I find the AI conversation both interesting and a bit effete and overworked. LLMs aren’t going anywhere – they’re useful tools for synthesizing, will save us loads of time in some ways; used with caution, they can augment your daily grind. However, we’re already reaching the limits of how they can be used effectively under the current model, and it’s telling that articles such as this highlight companies’ queries as toward return on investment. In any case, worth a read if you’re lacking for a water-cooler gab. Also - please integrate 'workslop' into your conversation :-) https://lnkd.in/eZ5fBvci
Like Comment
To view or add a comment, sign in
Arthur Howell
2mo
Report this post
🚀 Ready to dive into the world of artificial intelligence? Our latest blog post uncovers the true cost of the AGI delusion and how the U.S. is falling behind in the real AI race. Spoiler alert: it’s not just about chasing superintelligence! We’re tackling the big questions and revealing what this means for the future of technology. Don't miss out on the insights that could reshape our understanding of AI's potential. Check it out here: https://ift.tt/t5FTeM1.
Like Comment
To view or add a comment, sign in
Przemek G.
1mo
Report this post
The Self-Tuning AI A new Stanford paper might just make fine-tuning obsolete. The method, Agentic Context Engineering (ACE), shows you can make models smarter without touching a single weight. Instead of retraining the model, ACE evolves the context. The model writes, reflects on feedback and rewrites its own prompt, creating a "living playbook" that self-improves with every task. Sample results are prommising: a 10.6% performance boost on agent tasks and 8.6% on finance, with significantly lower cost and latency. This flips the "short, clean prompt" dogma on its head, proving that what LLMs crave is context density. If prompts can become self-improving systems, how does that change our entire approach to building and maintaining AI applications? #AI #MachineLearning #LLM #ContextEngineering #Stanford

3 Comments
Like Comment
To view or add a comment, sign in
Hammad Khan
1mo
Report this post
🚨 The AI legend who helped start it all just said something shocking: “I’m not that interested in LLMs anymore.” That’s right — Yann LeCun, one of the godfathers of AI, believes we’re chasing the wrong goal. Here’s why he says LLMs will never reach true intelligence (AGI) 👇 Current LLMs need 400,000 years worth of text to train. A 4-year-old child learns from just 16,000 hours of seeing the world. The real world is infinitely more complex than words on a screen. So what’s next? LeCun’s answer: World Models. 🧠 Instead of predicting the next word, this is AI that understands physics, causality, and common sense — how the world actually works. While the industry keeps stuffing LLMs with more data, LeCun is building something new: JEPA (Joint Embedding Predictive Architecture). Unlike today’s models that read the internet, JEPA watches the world. It learns by observation — not instruction. And it’s already impressive: V-JEPA can detect physically impossible events from just 16 video frames. No prompts. No fine-tuning. Just understanding. This isn’t about smarter chatbots. This is about AI that understands reality. 💡 The future of AI won’t be text-only. It will be hybrid — where language, vision, and real-world understanding merge. So here’s the question: 👉 Do you think World Models are the real path to AGI? #AI #DeepLearning #Innovation #ArtificialIntelligence #YannLeCun #TechTrends
Like Comment
To view or add a comment, sign in
Peer Grid

67 followers
1mo Edited
Report this post
🧠 RAG isn’t one-size-fits-all — it’s evolving fast. Retrieval-Augmented Generation (RAG) started as a simple idea — retrieve, augment, generate. But now, it has multiple forms that make AI smarter, more factual, and more adaptive. Think of it as different “modes” your AI can switch between 👇 Here’s what you’ll learn in this post: ✅ Standard RAG → the base flow — retrieve + generate ✅ Interactive RAG → human in the loop — refine answers live ✅ Fusion RAG → combines multiple retrievers for richer recall ✅ Agentic RAG → agents plan, query, and reason autonomously ✅ Multi-Pass RAG → iterative retrieval for higher accuracy 💡 Key Takeaway: Each RAG type solves a unique problem — together, they form the backbone of modern AI copilots, researchers, and assistants. 🔗 Explore all RAG types here: 👉 ✍ Content created by Niket Girdhar 📌 Missed the earlier parts? Catch up here: 🔗 Day 1 (RAG Foundation): https://lnkd.in/dEKT8d99 👥 Follow Peer Grid for simple, visual explainers on AI, tech, and systems. #PeerGrid #RAG #RetrievalAugmentedGeneration #AI #LLM #MachineLearning #GenAI #Embeddings #TechExplained
Like Comment
To view or add a comment, sign in
Disha Sondarva
2mo Edited
Report this post
A New Breakthrough in AI Compression: Chain-of-Thought Based Pruning I spent this weekend going through one of the most fascinating research papers I've read in a while, and trust me, this one is worth your time. Paper: Reasoning Models Can Be Accurately Pruned via Chain-of-Thought Reconstruction By: MIT [Full Paper (10 pages)] (https://lnkd.in/dMXv7PiG) Here's the quick story Most large reasoning models (like DeepSeek-R1) are powerful but painfully expensive to run. The more they think the longer their chain-of-thought the more compute they need. But here's the twist: When you prune (compress) these models using traditional methods, they not only become less accurate… They often get slower too because they start generating longer, messier reasoning steps. This new research changes the game. It introduces something called RAC – Reasoning-Aware Compression. Instead of pruning based only on input data, RAC uses the model's own reasoning traces during calibration. Basically, it "learns how it thinks" and keeps the parts that actually matter for reasoning. The result? Up to 50% smaller models without losing accuracy Faster inference (sometimes 4x faster!) And reasoning quality that's almost identical to the original Why this matters: This is not just a research trick, it's a deployment-level breakthrough. It means we can run powerful reasoning models on smaller hardware, at lower cost, and at real-world scale. And it's another reminder of something I deeply believe: The future of AI isn't just about making models smarter. It's also about making them efficient, accessible, and deployable without losing what makes them powerful. My takeaway: If you're working on reasoning agents, AI infrastructure, or cost optimization, this paper is a must-read. It's 10 pages, but it might just reshape how you think about model compression. Comment your thoughts! Repost to help your network :) #AIResearch #ReasoningModels #ChainOfThought #ModelCompression

41 Comments
Like Comment
To view or add a comment, sign in
Hao Hoang
1mo
Report this post
What if LLMs actually get smarter with longer, more detailed instructions? New research from Stanford University & SambaNova suggests our "brevity bias" is a critical flaw in building self-improving AI. A new paper, "𝐀𝐠𝐞𝐧𝐭𝐢𝐜 𝐂𝐨𝐧𝐭𝐞𝐱𝐭 𝐄𝐧𝐠𝐢𝐧𝐞𝐞𝐫𝐢𝐧𝐠 (𝐀𝐂𝐄)," directly confronts two major issues: - Brevity Bias: Prompt optimizers often create short, generic instructions, losing vital domain-specific details. - Context Collapse: Iteratively rewriting prompts causes them to degrade over time, erasing accumulated knowledge. Instead of creating a concise summary, ACE treats context as an evolving playbook. It uses a modular workflow (Generator, Reflector, Curator) to make small, incremental "delta updates." This allows the context to grow and refine itself over time, preserving crucial details. The results: - ACE boosted performance by +10.6% on agent tasks (AppWorld) and +8.6% on complex financial analysis. - Crucially, a smaller open-source model using ACE matched the top-ranked GPT-4.1-based agent on the AppWorld leaderboard, proving the power of a superior context strategy. The implications are profound. Instead of treating prompts as static instructions, we should see them as dynamic, living knowledge bases. This paves the way for more resilient, continuously learning AI systems that adapt on the fly with incredible efficiency. #AI #MachineLearning #LLM #AIAgents #PromptEngineering

1 Comment
Like Comment
To view or add a comment, sign in
Ayhan Mirza
1mo
Report this post
Once, markets were ruled by data scarcity. Today, they drown in information overload. Generative AI is the filtration system—the sense-maker. It turns chaos into compressed insight, noise into structured patterns. The edge no longer lies in access to data, but in interpretation of it. Models now simulate cause and effect faster than humans can blink. Every price point becomes a signal in a larger system of intent. What we’re building is not just AI—it’s financial consciousness. A reflection of how complexity organizes itself for survival. The markets aren’t random anymore; they’re recursive intelligence at scale.
Like Comment
To view or add a comment, sign in
Nidhi Bansal
1mo
Report this post
🧠 **AI That Actually "Thinks" Before Answering** The latest AI models like Claude 3.7 Sonnet and OpenAI's o1 now use thinking tokens - giving them time to reason through problems before responding. How It Works? Instead of rushing to an answer, the AI: - Works through problems step-by-step (serial reasoning) - Explores multiple solutions simultaneously (parallel reasoning) - Stops when it finds the answer (no wasted tokens) The Results Claude 3.7 Sonnet with extended thinking: 84.8% on GPQA (challenging science questions), including a 96.5% physics score. The best part? You can watch the AI think in real-time - seeing it explore ideas, catch mistakes, and refine answers. The Trade-Off More thinking = Better accuracy, but slower responses and higher costs. Developers now control: ✓ Thinking budgets based on task complexity ✓ Speed vs. quality balance ✓ Transparent reasoning without fine-tuning Bottom Line We're moving from AI that "blurts out" answers to systems that genuinely reason. The future is hybrid models that know when to think fast and when to think deep. Just like humans. #AI #MachineLearning #ArtificialIntelligence #TechInnovation #LLM
Like Comment
To view or add a comment, sign in
Arash Farahani
1mo
Report this post
Opinion: the Two Problems Holding Back Generative AI. If I had dedicated time for fundamental AI research right now, I'd put all my chips on two areas: 1. 10X+ Faster LLM Inference: Imagine using LLMs in complex, nested loops to iterate and self-correct instantly. Speed isn't just a cost saver; it's a new primitive for intelligence. Better hardware, using the existing hardware better, smaller capable models, or different model architectures? 2. Advanced Reasoning Capabilities: Enable AI to genuinely learn and apply new, accurate information during inference. This unlocks real-time, context-aware problem-solving. LLMs as we know them may be part of the solution, but there could be dramatically different solutions. We're currently trying to improve reasoning via more iterations: planning, thinking budgets, self-verifying, etc. Perhaps we'll be able to surpass most humans following this path and then we'll make it cheap. Everything else—from better alignment to lower training costs to better RAG—is either solvable later or is a local optimum that will dramatically shift once we master these two foundational challenges. "Everything else" can make money today and we don't know how far and how fast we can go in the two areas above. More importantly, if you enjoy your research, keep having fun. #GenerativeAI #LLMInference #AIRoadmap #Innovation
Like Comment
To view or add a comment, sign in

5,073 followers

104 Posts

View Profile Connect

Fine-tuning LLMs: a high-stakes game with side effects

More Relevant Posts

Explore related topics

Explore content categories