LLM Performance in Solving Complex Puzzles

Explore top LinkedIn content from expert professionals.

Summary

LLM performance in solving complex puzzles refers to how well large language models handle tasks that require multi-step reasoning, planning, and logic—like brain teasers and challenging mathematical scenarios. Recent research shows these models can excel at medium-level complexity but struggle with both very simple and highly difficult problems, revealing important limits in their reasoning abilities.

  • Assess task complexity: Before choosing a language model for your project, consider whether your challenge is simple, moderately complex, or extremely difficult, as different models shine at different levels.
  • Decompose complex problems: Break down larger reasoning tasks into smaller, manageable steps, which helps models produce more accurate and cost-efficient solutions.
  • Monitor reasoning process: Keep an eye on how the model approaches each step—if it starts to rush through tough questions or produces unnecessarily long explanations, it might signal a need to adjust your workflow or prompts.
Summarized by AI based on LinkedIn member posts
  • View profile for Viktor Kyosev
    Viktor Kyosev Viktor Kyosev is an Influencer

    CPO at Docquity | Building for 500K doctors across 9 markets

    16,054 followers

    This paper has gained popularity in tech circles lately and for a good reason. It goes against the narrative that LLMs keep getting smarter with scale. Published by Apple researchers, it hits on something many have suspected: Most benchmarks might be gamed, and we need a better way to measure actual intelligence. The study investigates Large Reasoning Models (LRMs), those models that generate long Chain-of-Thought (CoT) outputs and appear to “think.” They perform well on popular benchmarks, but crack under real complexity. The paper asked: - Do “thinking” models actually reason, or are they just verbose? - How does their reasoning evolve with problem difficulty? - Do they use their token budget wisely? - Can they generalize logic, or are they still just pattern matches? How they tested it: - Instead of math problems (which often suffer from data contamination), they used puzzle environments like the Tower of Hanoi and River Crossing, where complexity can be precisely controlled and evaluated. They compared: - “Thinking” models (Claude 3.7 Thinking, DeepSeek-R1) vs. Standard, non-thinking LLMs. And crucially, they didn’t just look at final answers, they analyzed every reasoning step. Key findings: 1. Three degrees of complexity - Low Complexity: Non-thinking models outperform. They’re faster and more accurate. - Medium Complexity: LRMs show their strength, more reasoning helps. - High Complexity: Both collapse. Even with ample token budgets, no correct answers emerge. 2. Reasoning effort doesn’t scale with difficulty - As problems get harder, models initially increase reasoning effort. But beyond a threshold, they start thinking less, even though they have tokens to spare. This suggests a fundamental scaling limit in current reasoning architectures. 3. Overthinking On simple problems, LRMs often find the right answer early, then keep “thinking” and talk themselves out of it. Even when given a correct algorithm, they still fail at execution as the complexity increases. This isn’t reasoning. It’s a failure to follow steps. Why this matters: The current generation of LRMs with CoT and self-reflection are not general reasoners. Their performance holds only in narrow complexity bands, and more tokens or better prompts don’t fix it. To move forward, we likely need new architectures, not just more scale. My key takeaways as someone building with AI: - Use lightweight LLMs for simple tasks. Don’t pay the “thinking tax” unless there's clear benefit. Non-CoT models are often faster and more accurate for straightforward problems. - Test across a difficulty spectrum. Don’t just benchmark your product on average cases, include edge cases at both ends. Know where your model breaks and set guardrails accordingly. - Monitor reasoning effort. If your agent starts giving quicker answers to harder questions, that’s a red flag. Explore ways to enforce more consistent “thinking” through RL or prompt design.

  • View profile for Sohrab Rahimi

    Director, AI/ML Lead @ Google

    23,829 followers

    One of the biggest challenges in deploying LLMs in real workflows is reasoning. Not answering trivia, but actual structured thinking: planning, breaking problems into steps, updating based on intermediate results. LLMs sometimes produce convincing responses that collapse when you inspect the logic. This shows up in math, code generation, question answering, and increasingly, in agentic use cases. Projects fail when models cannot reason reliably over multiple steps. This failure happens because most LLMs are trained to predict the next token, not to reason through a process. They pick the most likely next word, based on patterns they have seen. They are not optimizing for whether each step is logically valid or whether the final result is correct. Even fine-tuned models often reproduce patterns without deeply validating the steps in between. That said, agents have solved this problem to a great extent. A new paper from Stanford and Ceramic AI proposes a surprisingly effective solution: 𝗧𝗵𝗶𝗻𝗸, 𝗣𝗿𝘂𝗻𝗲, 𝗧𝗿𝗮𝗶𝗻. The model generates multiple reasoning paths. Only the ones that lead to correct answers are kept. Then the model is fine-tuned on those filtered traces. This loop is repeated. Over time, the model improves its ability to generate correct, logically coherent solutions, entirely from its own outputs. There is no need for external labels, teacher models, or human ranking. This is a very effective method because pruning to correct final answers is a simple form of reward. Mathematically, the paper shows that this kind of filtered fine-tuning is equivalent to reinforcement learning with a binary reward signal. It avoids the complexity and instability of full RL pipelines but delivers the same benefits. The model learns to prefer better thinking, not just better phrasing. The results are significant!! Gemma-2B improves from 41.9 to 57.6 percent accuracy on GSM8K. Gemma-9B reaches 82 percent, outperforming LLaMA-70B. Even LLaMA-70B improves from 78 to 91 percent, surpassing GPT-4o. 𝗔𝗹𝗹 𝘄𝗶𝘁𝗵𝗼𝘂𝘁 𝗻𝗲𝘄 𝗱𝗮𝘁𝗮 𝗼𝗿 𝗲𝘅𝘁𝗲𝗿𝗻𝗮𝗹 𝗳𝗲𝗲𝗱𝗯𝗮𝗰𝗸. 𝗝𝘂𝘀𝘁 𝘀𝗺𝗮𝗿𝘁 𝗳𝗶𝗹𝘁𝗲𝗿𝗶𝗻𝗴 𝗮𝗻𝗱 𝘁𝗿𝗮𝗶𝗻𝗶𝗻𝗴. If you’re working on reasoning-heavy tasks, you can try this with open models and modest compute. The steps are simple: 1. Generate several reasoning paths per example 2. Keep only the ones that lead to a correct final answer 3. Fine-tune the model on those filtered examples 4. Repeat with the new model to improve further It works best on tasks with verifiable outcomes like math, code, or structured QA. You need a base model that can already reason somewhat, and a way to check correctness. But you do not need GPT-4, and you do not need human labels. This method pushes us toward a future where models do not just produce good outputs, but learn to produce better reasoning. It is simple, scalable, and grounded in solid learning theory. And it is something teams can start applying today.

  • View profile for Maxime Labonne

    Head of Post-Training @ Liquid AI

    69,693 followers

    🔍 LLM reasoning is doing fine, actually. (Sorry Apple) Last week, I shared the "Illusion of Thinking" paper from Apple, with bold claims about the limitations of reasoning models. Since then, here's been a lot of online discussions and even a new paper, called "The Illusion of the Illusion of Thinking". I wanted to share the main points raised in critiques of this study: → Models actually hit token limits, not reasoning limits. When models "failed" on Tower of Hanoi puzzles, they were explicitly saying "I'll stop here to avoid making this too long," not struggling with the logic. → Some test puzzles were literally impossible to solve. The River Crossing experiments included mathematically unsolvable instances (N≥6 with boat capacity 3), then scored models as failures for not solving these impossible problems. → Evaluation systems can't tell the difference between "can't solve" and "won't enumerate". Automated scoring missed that models understood the solutions but chose not to write out thousands of moves due to practical constraints. → When freed from exhaustive output requirements, models nail complex problems. Asked to write generating functions instead of full move sequences, models solved 15-disk Tower of Hanoi problems with high accuracy in under 5,000 tokens. → Solution length is a terrible proxy for reasoning difficulty. Tower of Hanoi needs exponentially many moves but has trivial decision-making per step, while shorter River Crossing problems require complex constraint satisfaction. The original study's conclusions were pretty damning for AI reasoning capabilities, but this response shows how easy it is to mistake experimental artifacts for fundamental limitations. This is a very common problem in LLM evaluation, a field that keeps being underrated in my opinion.

  • View profile for Babak Hodjat

    Chief AI Officer at Cognizant

    19,991 followers

    Apple’s machine learning team just released a paper that takes aim at one of the core assumptions behind Chain-of-Thought (CoT) prompting—a technique used to help large language models (LLMs) “think out loud” to solve complex problems. What they found? Many CoT-based models collapse when applied to complex reasoning tasks like the advanced levels in Tower of Hanoi (e.g., with more than 8 disks to place), despite performing well on traditional benchmarks. Why? Because these tasks go well beyond the narrow prompting examples used during fine-tuning and require longer sequences of precise reasoning than a CoT model can handle. An interesting observation from the paper is that, for the simple cases, the raw LLMs actually perform slightly better than LRMs, though LRMs significantly outperform raw LLMs in medium-level cases. This indicates that if we can decompose a long/difficult reasoning task into several medium-level tasks, we can still make the best use of existing LRMs, and if we can decompose them further into many simple-level tasks, a standard LLM would even be better than LRMs. Considering the fact that the response lengths of LRMs are usually much longer than standard LLMs (LRMs need to generate its reasoning process explicitly), we are actually not only solving the problem better, but also at a cheaper cost. What does this mean for users? If you’ve been relying on a single model to handle multi-step reasoning—like planning, logic puzzles, or simulations—this paper suggests you might want to rethink your approach. Here’s my take: - While I’ve always been skeptical of CoT-style large reasoning models (LRMs), I don’t think we should write them off completely. They’re specialists—and they can outperform on tough tasks like coding or niche benchmarks. But they are constrained by their inherent imprecision that emerges as tasks scale. - For broader, more general-purpose use cases, LLMs paired with multi-agent systems are a more robust path forward. Instead of pushing a single model to its limits, we can distribute reasoning across agents—each focused, each efficient—working together to scale intelligence more reliably. Worth a read: Apple’s study via The Guardian: https://lnkd.in/gEq2hYhK Cognizant, Xin Qiu, Elliot Meyerson

  • View profile for Jayeeta Putatunda

    Director - AI CoE @ Fitch Ratings | NVIDIA NEPA Advisor | HearstLab VC Scout | Global Keynote Speaker & Mentor | AI100 Awardee | Women in AI NY State Ambassador | ASFAI

    10,230 followers

    𝗜 𝗵𝗮𝘃𝗲 𝗯𝗲𝗲𝗻 𝗶𝗻 𝘁𝗵𝗲 𝗡𝗟𝗣 𝘀𝗽𝗮𝗰𝗲 𝗳𝗼𝗿 𝗮𝗹𝗺𝗼𝘀𝘁 𝟭𝟬 𝘆𝗲𝗮𝗿𝘀 𝗻𝗼𝘄, and I know the first-hand challenges of building text-based models in the pre-GPT era! So, I am a 𝗽𝗿𝗼-𝗟𝗮𝗿𝗴𝗲 𝗟𝗮𝗻𝗴𝘂𝗮𝗴𝗲 𝗠𝗼𝗱𝗲𝗹 (𝗟𝗟𝗠) 𝗲𝗻𝘁𝗵𝘂𝘀𝗶𝗮𝘀t, but I don’t believe they will replace humans or solve all our problems, especially when it comes to highly complex reasoning in industries like Finance. 𝗧𝗵𝗶𝘀 𝘄𝗲𝗲𝗸𝗲𝗻𝗱, I spent reading two compelling papers, and I’m convinced we’re bumping into real reasoning ceilings: 𝗜> "𝗧𝗵𝗲 𝗜𝗹𝗹𝘂𝘀𝗶𝗼𝗻 𝗼𝗳 𝗧𝗵𝗶𝗻𝗸𝗶𝗻𝗴: 𝗨𝗻𝗱𝗲𝗿𝘀𝘁𝗮𝗻𝗱𝗶𝗻𝗴 𝘁𝗵𝗲 𝗦𝘁𝗿𝗲𝗻𝗴𝘁𝗵𝘀 𝗮𝗻𝗱 𝗟𝗶𝗺𝗶𝘁𝗮𝘁𝗶𝗼𝗻𝘀 𝗼𝗳 𝗥𝗲𝗮𝘀𝗼𝗻𝗶𝗻𝗴 𝗠𝗼𝗱𝗲𝗹𝘀 𝘃𝗶𝗮 𝘁𝗵𝗲 𝗟𝗲𝗻𝘀 𝗼𝗳 𝗣𝗿𝗼𝗯𝗹𝗲𝗺 𝗖𝗼𝗺𝗽𝗹𝗲𝘅𝗶𝘁𝘆" (Apple) Apple researchers rigorously tested 𝗟𝗮𝗿𝗴𝗲 𝗥𝗲𝗮𝘀𝗼𝗻𝗶𝗻𝗴 𝗠𝗼𝗱𝗲𝗹𝘀 (𝗟𝗥𝗠𝘀), LLMs that explicitly generate chain-of-thought reasoning, using controlled puzzles like Tower of Hanoi and River Crossing Key insights: 1. 𝗧𝗵𝗿𝗲𝗲 𝗿𝗲𝗮𝘀𝗼𝗻𝗶𝗻𝗴 𝗿𝗲𝗴𝗶𝗺𝗲𝘀: ▪���Low complexity: standard LLMs outperform LRMs ▪️Medium complexity: LRMs excel ▪️High complexity: 𝗯𝗼𝘁𝗵 𝗰𝗼𝗹𝗹𝗮𝗽𝘀𝗲, accuracy plummets 2. Fascinating observation, 𝗟𝗥𝗠𝘀 “𝗴𝗶𝘃𝗲 𝘂𝗽” as puzzle complexity increases, their reasoning effort declines rapidly, even with enough tokens 3. Even when provided an exact algorithm (e.g., Tower of Hanoi strategy), the 𝗺𝗼𝗱𝗲𝗹𝘀 𝘀𝘁𝗶𝗹𝗹 𝗳𝗮𝗶𝗹𝗲𝗱 𝘁𝗼 𝗴𝗲𝗻𝗲𝗿𝗮𝗹𝗶𝘇𝗲 and mostly outputs based on past observed data pattern it is trained on 𝗜𝗜> "𝗥𝗲𝗮𝘀𝗼𝗻𝗶𝗻𝗴 𝗼𝗿 𝗢𝘃𝗲𝗿𝘁𝗵𝗶𝗻𝗸𝗶𝗻𝗴: 𝗘𝘃𝗮𝗹𝘂𝗮𝘁𝗶𝗻𝗴 𝗟𝗮𝗿𝗴𝗲 𝗟𝗮𝗻𝗴𝘂𝗮𝗴𝗲 𝗠𝗼𝗱𝗲𝗹𝘀 𝗼𝗻 𝗙𝗶𝗻𝗮𝗻𝗰𝗶𝗮𝗹 𝗦𝗲𝗻𝘁𝗶𝗺𝗲𝗻𝘁 𝗔𝗻𝗮𝗹𝘆𝘀𝗶𝘀" (Dimitris Vamvourellis & Dhagash Mehta, Ph.D., BlackRock) This study tested major 𝗟𝗟𝗠𝘀 (𝗚𝗣𝗧‐𝟰𝗼, 𝗚𝗣𝗧‐𝟰.𝟭, 𝗼𝟯‐𝗺𝗶𝗻���, 𝗙𝗶𝗻𝗕𝗘𝗥𝗧 𝘃𝗮𝗿𝗶𝗮𝗻𝘁𝘀) on financial sentiment classification using: - "𝗦𝘆𝘀𝘁𝗲𝗺 𝟭" (𝗳𝗮𝘀𝘁/𝗶𝗻𝘁𝘂𝗶𝘁𝗶𝘃𝗲) - "𝗦𝘆𝘀𝘁𝗲𝗺𝟮" (𝘀𝗹𝗼𝘄/𝗱𝗲𝗹𝗶𝗯𝗲𝗿𝗮𝘁𝗲) 𝗽𝗿𝗼𝗺𝗽𝘁𝗶𝗻𝗴 Key takeaways: ▪️𝗥𝗲𝗮𝘀𝗼𝗻𝗶𝗻𝗴 𝗽𝗿𝗼𝗺𝗽𝘁𝘀 𝗱𝗶𝗱 𝗻𝗼𝘁 𝗶𝗺𝗽𝗿𝗼𝘃𝗲 𝗽𝗲𝗿𝗳𝗼𝗿𝗺𝗮𝗻𝗰𝗲 ▪️Surprisingly, straightforward, intuitive prompts with GPT-4o (no chain-of-thought) outperformed all others  ▪️More reasoning led to overthinking, reducing alignment with human-labeled sentiments 💡 Why it matters for builders and researchers in Finance and every industry: ❎ 𝗕𝗶𝗴𝗴𝗲𝗿 𝗺𝗼𝗱𝗲𝗹𝘀 + 𝗺𝗼𝗿𝗲 “𝘁𝗵𝗶𝗻𝗸𝗶𝗻𝗴” = 𝗯𝗲𝘁𝘁𝗲𝗿 𝗼𝘂𝘁𝗰𝗼𝗺𝗲𝘀. Sometimes it’s actively worse ❎ We’re not seeing a soft plateau — these are 𝗵𝗮𝗿𝗱 𝗰𝗲𝗶𝗹𝗶𝗻𝗴𝘀 𝗶𝗻 𝗿𝗲𝗮𝘀𝗼𝗻𝗶𝗻𝗴 𝗰𝗮𝗽𝗮𝗰𝗶𝘁𝘆 ❎ For real-world systems, agents, and financial tools: design for 𝗿𝗲𝗮𝘀𝗼𝗻𝗶𝗻𝗴 𝗲𝗰𝗼𝗻𝗼𝗺𝘆, not just reasoning depth. #LLMs #ReasoningLimits #LLMChainofthought #LLMReasoningDecline

  • View profile for William Marcellino, Ph.D.

    Senior Behavioral Scientist at RAND Corporation

    2,935 followers

    Apple's recent paper on "The Illusion of Thinking" (https://lnkd.in/e3juNhyU) shows how large reasoning models exhibit three regimes of performance on a range of logical puzzles: inefficient but successful on low complexity, efficient and successful on medium complexity, and a collapse to zero efficacy in high complexity (see below). The paper has garnered a lot of attention, including methodological critiques that center on the Towers of Hanoi problem: as an exponential problem, the model failure at N=15 (32k+ moves) happens to be well outside of the context length of the models. However, this critique doesn't make sense for the other puzzles, which are quadratic and linear, and require only a few hundred tokens to solve, well within context limits. 🔍 Cross-domain inconsistency is real: models that execute 100+ correct moves in Tower of Hanoi fail after just 5 moves in the River Crossing puzzles—despite River Crossing requiring far fewer tokens and similar logical reasoning. 🧠 This isn't about context windows or compute—it's about how LLMs actually "think." This supports the idea that reasoning models don't execute algorithms; they apply learned heuristics from training data. When researchers provided the exact Tower of Hanoi algorithm, performance didn't change. LRMs can't follow explicit step-by-step procedures, because they are pattern-matching statistical approximations of what good reasoning looks like. This is expected behavior for neural networks, which optimize for shallow effective, but unfaithful solutions. LRMs generate convincing reasoning traces, but underneath it's statistical pattern completion, not systematic logical execution (https://lnkd.in/enZa8_ji). #LLMs and #LRMs are sophisticated heuristic systems that excel at many tasks, but they do not reason--they do not follow algorithmic solutions. This has implications for #AGI, and for immediate #diffusion and #adoption: we need to carefully use these systems for applications they are well suited for. #AI #AIResearch

  • View profile for Ivan Novikov

    Founder @ Wallarm | Leading API Security Solution for Enterprises

    39,541 followers

    🚨 New Paper Drop from Apple AI Team: “The Illusion of Thinking” 🧠💭 Imagine a model that thinks hard… and still flunks the test. Apple’s latest research dismantles the hype around Large Reasoning Models (LRMs)—the ones loaded with “Chain of Thought” magic and self-reflection sauce. Turns out, that thinking might just be… theater. ⸻ 🔍 Key Takeaways: 1. “Thinking” ≠ Reasoning: LRMs write long reasoning traces before answering. But when problem complexity spikes, their performance doesn’t just drop—it collapses. 2. 3 Performance Zones: • 🟢 Low Complexity: Non-thinking models outperform LRMs (faster, more accurate). • 🟡 Medium Complexity: LRMs get their moment—extra thinking helps. • 🔴 High Complexity: Both fail. LRMs think less, even though they could think more. Weird, right? 3. Scaling Fail: LRMs don’t scale with problem difficulty. Their token usage for thinking shrinks as tasks get harder. Like giving up mid-test despite having time left. 4. No “General” Reasoning: Even when you hand-feed them an algorithm, LRMs stumble. Execution fails. This isn’t just about logic—they don’t “understand” steps. 5. Overthinking is real: On simple tasks, LRMs find the correct path early… and then spiral into nonsense. Compute wasted. Accuracy? Derailed. ⸻ 🧩 Apple tested this using controllable puzzles (Tower of Hanoi, River Crossing, etc.)—way better than leaky math benchmarks like MATH500. Why? No data contamination. Full control over complexity. True reasoning stress-tests. 📊 In experiments, Claude 3.7 Sonnet Thinking, DeepSeek-R1, and o3-mini all showed the same pattern: More complexity → less effective thinking → total collapse. ⸻ 🤯 One shocker: On some puzzles (like River Crossing), models can’t even make it past 4 valid moves. On others (Hanoi), they can go 100+ steps. Not because of reasoning—but likely because some problems appear more in their training data. ⸻ 💡 Bottom line: Today’s reasoning models look smart, but under pressure, they break. They don’t execute logic. They don’t scale thought. They simulate coherence—but the illusion of thinking fades as things get hard. The era of CoT hype needs a hard reset. We need models that don’t just “act” intelligent—they need to be intelligent. ⸻ 🔗 Full paper: Apple ML Research - The Illusion of Thinking #AI #LLM #Reasoning #AppleResearch #MachineLearning #ChainOfThought #AIoverhype #Claude #DeepSeek #OpenAI #ThinkingModels #FailFast #LLMRealTalk

  • View profile for Mark Worrall

    Helping humans learn, adapt, and create in the age of AI | Founder @ nodeledge.ai

    4,587 followers

    Apple just dropped a new paper on LRMs and it’s not pretty. 🙈 Paper title: The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity. Some highlights: 🧩 Using controlled puzzle environments (e.g. Tower of Hanoi), the authors show that frontier LRMs like Claude 3.7 and DeepSeek-R1 𝗰𝗼𝗹𝗹𝗮𝗽𝘀𝗲 𝘁𝗼 𝟬% 𝗮𝗰𝗰𝘂𝗿𝗮𝗰𝘆 beyond a certain complexity threshold. ⚠️ Surprisingly, giving these models the 𝗲𝘅𝗮𝗰𝘁 𝘀𝗼𝗹𝘂𝘁𝗶𝗼𝗻 𝗮𝗹𝗴𝗼𝗿𝗶𝘁𝗵𝗺 doesn't help. They still fail to execute logical steps correctly - exposing severe limits in symbolic reasoning and step-by-step verification. 📉 Even worse, as problems get harder, 𝘁𝗵𝗲𝘀𝗲 𝗺𝗼𝗱𝗲𝗹𝘀 𝘀𝘁𝗮𝗿𝘁 𝘁𝗵𝗶𝗻𝗸𝗶𝗻𝗴 𝙡𝙚𝙨𝙨 - using fewer tokens despite having available budget. This suggests a compute scaling limit 𝘯𝘰𝘵 caused by external constraints, but internal brittleness. 🤖 They exhibit “overthinking”: generating correct answers early, then undermining them by exploring incorrect paths. And as complexity increases, the models stop finding correct answers altogether. 𝗜𝗺𝗽𝗹𝗶𝗰𝗮𝘁𝗶𝗼𝗻𝘀? To me all this is consistent with the fact that current LRMs don’t “reason” in any robust or generalisable sense. They simulate reasoning through pattern-matching - and once the patterns break down so does their ability to reason. Mistaking examples of reasoning for actual reasoning is a logical fallacy. Don't fall for it.

Explore categories