Recent LLM Breakthroughs in Complex Reasoning

Explore top LinkedIn content from expert professionals.

Summary

Recent breakthroughs in large language model (LLM) reasoning are redefining what AI can accomplish, showing models that not only predict text but also solve complex problems, generate creative ideas, and adapt their approaches. Complex reasoning in LLMs means the ability to logically plan, make decisions, and understand multifaceted scenarios—similar to how humans work through challenging tasks.

  • Embrace model collaboration: New training methods let multiple LLMs work together, like teams in a company, to tackle tough mathematical and logical problems with greater accuracy.
  • Explore parallel reasoning: Modern LLM architectures can now examine multiple possible solutions at once, improving speed and reliability when handling challenging questions or planning steps.
  • Prioritize adaptive learning: By allowing models to self-correct and revise their reasoning paths during tasks, researchers are seeing improved performance and more grounded, reliable answers.
Summarized by AI based on LinkedIn member posts
  • View profile for Ross Dawson
    Ross Dawson Ross Dawson is an Influencer

    Futurist | Board advisor | Global keynote speaker | Founder: AHT Group - Informivity - Bondi Innovation | Humans + AI Leader | Bestselling author | Podcaster | LinkedIn Top Voice

    36,157 followers

    Chain-of-Thought has been a fundamental architecture driving LLM performance. Now 'Chain of Continuous Thought' (Coconut) significantly improves reasoning performance through working in latent space rather than language space. This paper from Meta's AI research group lays out the logic and results: 💡 Continuous Reasoning Unlocks Efficiency: Large Language Models (LLMs) traditionally reason in "language space," where reasoning steps are expressed as explicit tokens, leading to inefficiencies. The Coconut (Chain of Continuous Thought) paradigm instead reasons in a continuous latent space by feeding the model’s hidden state back as input. This reduces reliance on explicit tokens and improves reasoning efficiency, especially for complex tasks requiring backtracking. 📊 Higher Accuracy in Complex Reasoning Tasks: Coconut achieves significant accuracy improvements on complex tasks requiring planning and logic. In ProsQA, a reasoning-intensive task, Coconut attains 97.0% accuracy, far exceeding Chain-of-Thought (CoT) at 77.5%. Similarly, in logical reasoning tasks like ProntoQA, it achieves near-perfect performance at 99.8% accuracy, outperforming or matching other baselines while demonstrating superior planning capabilities. ⚡ Greater Efficiency with Fewer Tokens: Coconut enhances reasoning efficiency by reducing the number of generated tokens while maintaining accuracy. For example, in GSM8k (math reasoning), Coconut achieves 34.1% accuracy using just 8.2 tokens, compared to CoT's 42.9% accuracy which requires 25 tokens. This token efficiency indicates that reasoning in latent space allows the model to process fewer explicit steps without sacrificing performance. 🌟 Parallel Reasoning Explores Multiple Alternative Steps: Coconut enables LLMs to simultaneously explore multiple reasoning paths by encoding alternative next steps in the continuous latent space. This parallel reasoning behavior mimics breadth-first search (BFS), allowing the model to avoid premature decisions and progressively narrow down the correct solution. 🔄 Multi-Stage Training Accelerates Learning: Coconut leverages a curriculum-based training strategy, where the reasoning chain is gradually replaced with latent thoughts. This phased approach facilitates model learning, improving performance on math problems (GSM8k) and logical tasks, outperforming baselines like No-CoT and iCoT. 🔍 Latent Reasoning Improves Planning and Focus: By reasoning in latent space, the model avoids premature decisions and progressively narrows down possibilities. Coconut shows reduced hallucinations and improved accuracy compared to CoT, demonstrating its ability to prioritize promising reasoning paths while pruning irrelevant ones. New model architectures are consistently improving LLM performance and efficiency. Even without more training data and underlying model progress we are seeing consistent advances. Link to paper in comments.

  • View profile for Sahar Mor

    I help researchers and builders make sense of AI | ex-Stripe | aitidbits.ai | Angel Investor

    42,073 followers

    Researchers from Oxford University just achieved a 14% performance boost in mathematical reasoning by making LLMs work together like specialists in a company. In their new MALT (Multi-Agent LLM Training) paper, they introduced a novel approach where three specialized LLMs - a generator, verifier, and refinement model - collaborate to solve complex problems, similar to how a programmer, tester, and supervisor work together. The breakthrough lies in their training method: (1) Tree-based exploration - generating thousands of reasoning trajectories by having models interact (2) Credit attribution - identifying which model is responsible for successes or failures (3) Specialized training - using both correct and incorrect examples to train each model for its specific role Using this approach on 8B parameter models, MALT achieved relative improvements of 14% on the MATH dataset, 9% on CommonsenseQA, and 7% on GSM8K. This represents a significant step toward more efficient and capable AI systems, showing that well-coordinated smaller models can match the performance of much larger ones. Paper https://lnkd.in/g6ag9rP4 — Join thousands of world-class researchers and engineers from Google, Stanford, OpenAI, and Meta staying ahead on AI http://aitidbits.ai

  • View profile for Andreas Sjostrom
    Andreas Sjostrom Andreas Sjostrom is an Influencer

    LinkedIn Top Voice | AI Agents | Robotics I Vice President at Capgemini’s Applied Innovation Exchange | Author | Speaker | San Francisco | Palo Alto

    14,815 followers

    AI models are reasoning, creating, and evolving. The evidence is no longer theoretical; it's peer-reviewed, measurable, and, in some domains, superhuman. In the last 18 months, we’ve seen LLMs move far beyond next-token prediction. They’re beginning to demonstrate real reasoning, hypothesis generation, long-horizon planning, and even scientific creativity. Here are six breakthroughs that redefine what these models can do: Superhuman Clinical Reasoning (Nature Medicine, 2025) In a rigorous test across 12 specialties, GPT-4 scored 89% on the NEJM Knowledge+ medical reasoning exam, outperforming the average physician score of 74%. This wasn’t just Q&A; it involved multi-hop reasoning, risk evaluation, and treatment planning. That’s structured decision-making in high-stakes domains. Creative Research Ideation (Zhou et al., 2024 – arXiv:2412.10849) Across 10 fields from physics to economics, GPT-4 and Claude generated research questions rated more creative than human-generated ones in 53% of cases. This wasn’t trivia; domain experts blindly compared ideas from AI and researchers. In over half the cases, the AI won. Falsifiable Hypotheses from Raw Data (Nemati et al., 2024) GPT-4o was fed raw experimental tables from biology and materials science and asked to propose novel hypotheses. 46% of them were judged publishable by experts, outperforming PhD students (29%) on the same task. That’s not pattern matching, that’s creative scientific reasoning from scratch. Self-Evolving Agents (2024) LLM agents that reflect, revise memory, and re-prompt themselves improved their performance on coding benchmarks from 21% → 34% in just four self-corrective cycles, without retraining. This is meta-cognition in action: learning from failure, iterating, and adapting over time. Long-Term Agent Memory (A-MEM, 2025) Agents equipped with dynamic long-term memory (inspired by Zettelkasten) achieved 2× higher success on complex web tasks, planning across multiple steps with context continuity. Emergent Social Reasoning (AgentSociety, 2025) In a simulation of 1,000 LLM-driven agents, researchers observed emergent social behaviors: rumor spreading, collaborative planning, and even economic trade. No hardcoding. Just distributed reasoning, goal propagation, and learning-by-interaction. These findings span healthcare, science, software engineering, and multi-agent simulations. They reveal systems that generate, reason, and coordinate, not just predict. So when some argue that “AI is only simulating thought,” we should ask: Are the tests capturing how real reasoning happens? The Tower of Hanoi isn’t where science, medicine, or innovation happens. The real test is: 1. Can a model make a novel discovery? 2. Can it self-correct across steps? 3. Can it outperform domain experts in structured judgment? And increasingly, the answer is: yes. Let’s not confuse symbolic puzzles with intelligence. Reasoning is already here, and it’s evolving.

  • View profile for Kuldeep Singh Sidhu

    Senior Data Scientist @ Walmart | BITS Pilani

    16,489 followers

    Breaking: RAG-R1 Framework Revolutionizes How LLMs Handle External Knowledge Researchers from AWorld Team and Inclusion AI have just released RAG-R1, a groundbreaking training framework that fundamentally changes how Large Language Models interact with external knowledge sources during reasoning. The Core Innovation Traditional RAG systems suffer from a critical bottleneck: they generate only single search queries when external retrieval is needed, leading to substantial inference time and limited knowledge acquisition. RAG-R1 solves this with multi-query parallelism - enabling models to generate up to three parallel search queries simultaneously. Under the Hood Architecture The framework operates through a sophisticated two-stage training process: Stage 1: Format Learning SFT - The system generates samples integrating reasoning and search, segmented into four distinct categories. Models learn to respond in a "think-then-search" format using special tokens like <think>, <search>, and <answer> to structure their reasoning process. Stage 2: Retrieval-Augmented RL - Employs Proximal Policy Optimization with outcome-based rewards to enhance reasoning capabilities. The system implements retrieval masked loss to prevent retrieved tokens from interfering with the model's inherent reasoning abilities. Technical Breakthrough The multi-query parallelism returns results in JSON format, clearly aligning search queries with retrieved documents. This approach reduces retrieval rounds by 11.1% while maintaining comparable time per retrieval operation. Performance Impact Testing on seven question-answering benchmarks using Qwen2.5-7B-Instruct as the backbone model showed remarkable results: - Up to 13.2% improvement over strongest baselines - Significant performance gains across both general QA and multi-hop reasoning tasks - Excellent generalization across out-of-domain datasets The framework addresses the fundamental challenge of LLMs generating hallucinated or outdated responses by enabling adaptive leverage of both internal and external knowledge during the reasoning process. This represents a significant step forward in making AI systems more reliable and grounded in real-world knowledge.

  • View profile for Abhiram Ravikumar

    Data Science & AI @ Publicis Sapient | Author | LinkedIn Instructor | NLP/LLM/MLOps | Ex-SAP Labs

    3,912 followers

    s1: A Powerful Approach to Test-Time Scaling ⏩️💡 Every once in a while, a research paper comes along that makes you stop and appreciate the elegance of a simple yet impactful idea. The newly released s1: Simple Test-Time Scaling by researchers at Stanford University, Contextual AI does precisely that. Right off the bat, what impressed me most was the clarity in attribution and collaboration. The very first page explicitly details who did what, ensuring transparency—a level of openness that's rare but inspiring in AI research. Why This Paper Matters The paper tackles test-time scaling, a powerful concept that allows LLMs to improve their reasoning without additional training—simply by allocating more compute during inference. OpenAI’s o1 model hinted at this capability but didn’t disclose the methodology, prompting various replication efforts. The authors of s1 took on this challenge and asked: "What is the simplest way to achieve strong test-time scaling?" Their answer? 1. Curating a dataset of just 1,000 high-quality reasoning traces (s1K)—a stark contrast to massive datasets typically used. 2. Introducing "Budget Forcing"—a novel yet simple method that controls test-time compute by explicitly terminating or extending the model's reasoning process. Breakthrough Results s1-32B, their fine-tuned model, outperforms OpenAI’s o1-preview on math benchmarks by up to 27%.🚀 More impressively, it achieves this with just 1,000 reasoning examples, making it one of the most sample-efficient reasoning models to date.💯 Budget Forcing allows the model to self-correct by prompting it to “Wait” and reevaluate its reasoning—leading to improved accuracy.💡 The Magic of Self-Correction One of the most fascinating aspects is the model’s ability to recognize and fix its own mistakes. The authors include a screenshot where the model initially miscounts the number of 'r's in "raspberry," but on being prompted to wait, it double-checks and corrects itself—a glimpse of emergent self-reflection. A Step Towards More Efficient AI This research challenges the prevailing notion that bigger is always better. Instead, it demonstrates that smarter fine-tuning and compute-efficient strategies can rival even the most powerful closed-source models. The full paper is available here: 📄 https://lnkd.in/g9gyhZRx #s1 #SML #AbhiWrites #reasoning

  • View profile for Dimitris Papadopoulos

    CAIO @ EXUS | PhD in NLP | Builder of AI systems that hold up in reality

    9,235 followers

    Researchers from Meta recently introduced COCONUT (Chain of Continuous Thought): a new reasoning approach that uses the last hidden state of the LLM as a representation of the reasoning state (termed “continuous thought”) COCONUT shifts LLMs from language-bound reasoning to a continuous latent space, unlocking advanced problem-solving efficiency and accuracy. 🤔 The problem: Traditional reasoning methods rely heavily on language-based reasoning chains, the familiar CoT and its variants. While effective for some tasks, these methods face inherent limitations: ➖ They prioritize fluency over reasoning, wasting computational effort by analyzing the intermediate logical steps in free text. ➖ When faced with complex tasks that demand planning or backtracking, CoT often struggles to map out all possibilities effectively. ➖ In general, these approaches mimic human communication patterns rather than the deeper cognitive processes involved in reasoning, 💡The solution: COCONUT introduces latent reasoning, a method where reasoning steps are represented as continuous states instead of explicit language tokens. This shift enables models to operate more effectively by: 🥥 Exploring multiple paths: encoding possibilities simultaneously, the model can evaluate alternatives, akin to a Breadth-First Search approach. 🥥 Backtracking: Latent reasoning supports revisiting earlier steps, crucial for tasks requiring complex planning. 🥥 Token efficiency: Continuous reasoning uses fewer tokens, reducing computational overhead. 🛠️ How COCONUT works: 🔹 Encoding reasoning: Continuous thoughts, derived from the model’s hidden states, represent the reasoning process. These states are looped back as input rather than being decoded into language. 🔹 Switching modes: The model alternates between two operational modes: 1. Language mode: Used for handling input questions and producing final answers. 2. Latent mode: Processes reasoning steps in the latent space, bypassing language generation. 📈 Performance: COCONUT outperforms CoT in various benchmarks, especially those involving planning-intensive tasks. More importantly, it does so with fewer tokens and while handling complex logical structures with ease. I find the main motivation really interesting: Authors claim that according to neuroimaging studies, reasoning in humans does not rely on the brain's language network, which primarily handles comprehension and communication. Instead, reasoning processes often involve distinct neural pathways optimized for logic and planning, independent of language structures. This insight is what motivated the development of 🥥COCONUT, aiming to separate the computational reasoning of LLMs from the constraints of language-based reasoning. Paper in comments.

  • View profile for David Sauerwein

    AI/ML at AWS | PhD in Quantum Physics

    33,716 followers

    A new energy-based modeling approach enables reasoning entirely from unsupervised learning. This is an exciting push to break free from major constraints of today's reasoning models, with their narrow scope and reliance on external rewards, toward more data-efficient and generalizable models. Human thinking is classified into System 1 (intuitive, fast) and System 2 (slow, deliberate reasoning). Current transformers excel at System 1 but struggle with System 2. Recent advances using reinforcement learning or test time computation are impressive but are still restricted to domains with easily verifiable rewards (math, programming). To create systems that truly think independently, we need approaches that ideally rely entirely on unsupervised learning for System 2 thinking. Particularly, they should address these three facets of human thinking that current LLMs lack: 1. Dynamic compute allocation: Adjusting computational effort to problem complexity. For example, humans contemplate career transitions much longer than lunch decisions. 2. Modeling uncertainty: Humans weigh uncertainty before committing to decisions. Quantifying this uncertainty is central to complex reasoning. 3. Verification of predictions: Verification is central to 1. and 2. Above. Moreover, verifying solutions is typically also easier than generating them. So, learning a verifier could be more data efficient and robust. However, current LLMs don’t naturally integrate verifiers, and creating them for domains that are hard to quantify (e.g. rate of a conversation) remains challenging. Researchers have now proposed a new paradigm to address these challenges (link in comments). They propose viewing thinking as optimization with learned verifiers that evaluate input-output compatibility. More precisely, they train energy-based transformers (EBTs) to learn energy landscapes where lower energy indicates higher compatibility. Thinking then starts from random predictions and refines through energy minimization until convergence. Since the optimization duration depends on problem complexity, this enables dynamic compute allocation (facet 1). The energy values quantify uncertainty (facet 2) and serve as verifiers (facet 3). Training energy-based models is notoriously hard to scale, but the researchers show how transformer properties (scalability, robustness, parallelizability) transfer to EBTs. The results show EBTs achieve up to 35% higher scaling rates (across e.g. data, parameters) and 29% improved reasoning performance versus vanilla transformers. They superior scaling can probably be traced back to the fact that the EBT has also learned to verify, not only predict. Of course, many questions remain before declaring this a new architectural breakthrough. EBTs are more complex to train, and scaling beyond 800M parameters is unclear. But this is truly exciting work. I'm keen to see how people push this approach forward. #ai #genai #agi

  • View profile for Elvis S.

    Founder at DAIR.AI | Angel Investor | Advisor | Prev: Meta AI, Galactica LLM, Elastic, Ph.D. | Serving 7M+ learners around the world

    86,478 followers

    Say goodbye to token-based reasoning! Say hello to reasoning in continuous latent space! On a serious, this is a paper worth reading as a lot of research efforts continue to explore efficient reasoning methods. Summary below: This work introduces a latent recurrent-depth transformer, a model that scales test-time reasoning without relying on additional token generation. Instead of increasing the context window or fine-tuning for Chain-of-Thought (CoT), this approach enables iterative latent space reasoning at inference, achieving improvements comparable to a 50B parameter model despite having only 3.5B parameters. Key insights include: Recurrent test-time computation The model unrolls a recurrent block at inference, running for an arbitrary number of steps, allowing more computational depth without modifying the input sequence. Unlike standard CoT methods, which externalize reasoning via tokens, this technique keeps reasoning in latent space, making it more efficient. No need for CoT-specific training Unlike CoT prompting or fine-tuning, this method doesn’t require specialized datasets. It works with standard pretraining corpora and generalizes across reasoning tasks. Improved memory & compute efficiency Latent reasoning allows the model to scale without increasing parameter count, requiring less memory than long-context transformers. Additionally, this method improves per-token adaptive compute, speculative decoding, and KV-cache sharing, making it highly efficient. Scales like a 50B parameter model Benchmarks show that with sufficient test-time recurrence, the model matches or surpasses much larger LLMs on complex reasoning tasks (ARC, GSM8K, OpenBookQA). Emergent behaviors in latent space Analysis reveals self-organizing computation patterns, such as latent-space orbits for numerical tasks and context-dependent “deliberation” on difficult queries, suggesting the model learns non-verbal cognitive strategies. Why it matters? This work suggests that future models may reason in continuous latent space or other efficient reasoning strategies rather than solely relying on token-based reasoning, potentially unlocking new frontiers in reasoning efficiency.

  • View profile for Pascal Biese

    AI Lead at PwC </> Daily AI highlights for 80k+ experts 📲🤗

    85,463 followers

    Is this the next level of LLM "thinking"? We've been trying to make AI better at reasoning by forcing it to show more work - longer chains of thought, more attempts, more compute. But what if there was another way? New research from CMU and Stanford introduces something fascinating: teaching LLMs to first generate their own "reasoning abstractions" - essentially, creating their own cheat sheets before solving problems. Think of it like a student who, before tackling a math problem, first writes down the key principles and potential pitfalls they should watch for. The approach is surprisingly simple. They train two models that work together: one generates helpful hints and strategies (the abstraction generator), and another uses these hints to solve problems (the solution generator). The abstraction generator gets rewarded when its hints actually help solve problems, creating a virtuous cycle of improvement. The results? 44% improvement over previous state-of-the-art methods on challenging math competitions. One interesting thing to note: when given more compute budget, it's actually more effective to generate diverse strategies than to just try solving the problem more times. This suggests our models might have been stuck in local reasoning patterns, and abstractions help them explore genuinely different approaches. The same technique improved performance by 30% across legal reasoning, medical diagnosis, and other domains. Are we watching LLMs develop something that looks increasingly like metacognition? The ability to think about how to think. That's a capability that could fundamentally change how we deploy these systems in education, research, and problem-solving. ↓ 𝐖𝐚𝐧𝐭 𝐭𝐨 𝐤𝐞𝐞𝐩 𝐮𝐩? Join my newsletter with 50k+ readers and be the first to learn about the latest AI research: llmwatch.com 💡

  • View profile for Sohrab Rahimi

    Director, AI/ML Lead @ Google

    23,829 followers

    One of the most significant papers last month came from Meta, introducing 𝐋𝐚𝐫𝐠𝐞 𝐂𝐨𝐧𝐜𝐞𝐩𝐭 𝐌𝐨𝐝𝐞𝐥𝐬 (𝐋𝐂𝐌𝐬). While LLMs have dominated AI, their token-level focus limits their reasoning capabilities. LCMs present a new paradigm, offering a structural, hierarchical approach that enables AI to reason and organize information more like humans. LLMs process text at the token level, using word embeddings to model relationships between 𝐢𝐧𝐝𝐢𝐯𝐢𝐝𝐮𝐚𝐥 𝐰𝐨𝐫𝐝𝐬 𝐨𝐫 𝐬𝐮𝐛𝐰𝐨𝐫𝐝𝐬. This granular approach excels at tasks like answering questions or generating detailed text but struggles with maintaining coherence across long-form content or synthesizing high-level abstractions. LCMs address this limitation by operating 𝐨𝐧 𝐬𝐞𝐧𝐭𝐞𝐧𝐜𝐞 𝐞𝐦𝐛𝐞𝐝𝐝𝐢𝐧𝐠𝐬, which represent entire ideas or concepts in a high-dimensional, language-agnostic semantic space called SONAR. This enables LCMs to reason hierarchically, organizing and integrating information conceptually rather than sequentially. If we think of the AI brain as having distinct functional components, 𝐋𝐋𝐌𝐬 𝐚𝐫𝐞 𝐥𝐢𝐤𝐞 𝐭𝐡𝐞 𝐬𝐞𝐧𝐬𝐨𝐫𝐲 𝐜𝐨𝐫𝐭𝐞𝐱, processing fine-grained details and detecting patterns at a local level. LCMs, on the other hand, 𝐟𝐮𝐧𝐜𝐭𝐢𝐨𝐧 𝐥𝐢𝐤𝐞 𝐭𝐡𝐞 𝐩𝐫𝐞𝐟𝐫𝐨𝐧𝐭𝐚𝐥 𝐜𝐨𝐫𝐭𝐞𝐱, responsible for organizing, reasoning, and planning. The prefrontal cortex doesn’t just process information; it integrates and prioritizes it to solve complex problems. The absence of this “prefrontal” functionality has been a significant limitation in AI systems until now. Adding this missing piece allows systems to reason and act with far greater depth and purpose. In my opinion, the combination of LLMs and LCMs can be incredibly powerful. This idea is similar to 𝐦𝐮𝐥𝐭𝐢𝐬𝐜𝐚𝐥𝐞 𝐦𝐨𝐝𝐞𝐥𝐢𝐧𝐠, a method used in mathematics to solve problems by addressing both the big picture and the small details simultaneously. For example, in traffic flow modeling, the global level focuses on citywide patterns to reduce congestion, while the local level ensures individual vehicles move smoothly. Similarly, LCMs handle the “big picture,” organizing concepts and structuring tasks, while LLMs focus on the finer details, like generating precise text. Here is a practical example: Imagine analyzing hundreds of legal documents for a corporate merger. An LCM would identify key themes such as liabilities, intellectual property, and financial obligations, organizing them into a clear structure. Afterward, an LLM would generate detailed summaries for each section to ensure the final report is both precise and coherent. By working together, they streamline the process and combine high-level reasoning with detailed execution. In your opinion, what other complex, high-stakes tasks could benefit from combining LLMs and LCMs? 🔗: https://lnkd.in/e_rRgNH8

Explore categories