One of the biggest barriers to deploying LLM-based agents in real workflows is their poor performance on long-horizon reasoning. Agents often generate coherent short responses but struggle when a task requires planning, tool use, or multi-step decision-making. The issue is not just accuracy at the end, but the inability to reason through the middle. Without knowing which intermediate steps helped or hurt, agents cannot learn to improve. This makes long-horizon reasoning one of the hardest and most unsolved problems for LLM generalization. It is relatively easy for a model to retrieve a document, answer a factual question, or summarize a short email. It is much harder to solve a billing dispute that requires searching, interpreting policy rules, applying edge cases, and adjusting the recommendation based on prior steps. Today’s agents can generate answers, but they often fail to reflect, backtrack, or reconsider earlier assumptions. A new paper from Google DeepMind and Stanford addresses this gap with a method called SWiRL: Step-Wise Reinforcement Learning. Rather than training a model to get the final answer right, SWiRL trains the model to improve each step in a reasoning chain. It does this by generating synthetic multi-step problem-solving traces, scoring every individual step using a reward model (Gemini 1.5 Pro), and fine-tuning the base model to favor higher-quality intermediate steps. This approach fundamentally changes the way we train reasoning agents. Instead of optimizing for final outcomes, the model is updated based on how good each reasoning step was in context. For example, if the model generates a search query or a math step that is useful, even if the final answer is wrong, that step is rewarded and reinforced. Over time, the agent learns not just to answer, but to reason more reliably. This is a major departure from standard RLHF, which only gives feedback at the end. SWiRL improves performance by 9.2 percent on HotPotQA, 16.9 percent on GSM8K when trained on HotPotQA, and 11 to 15 percent on other multi-hop and math datasets like MuSiQue, BeerQA, and CofCA. It generalizes across domains, works without golden labels, and outperforms both supervised fine-tuning and single-step RL methods. The implications are substantial: we can now train models to reason better by scoring and optimizing their intermediate steps. Better reward models, iterative reflection, tool-assisted reasoning, and trajectory-level training will lead to more robust performance in multi-step tasks. This is not about mere performance improvement. It shows how we can begin to train agents not to mimic outputs, but to improve the quality of their thought process. That’s essential if we want to build agents that work through problems, adapt to new tasks, and operate autonomously in open-ended environments.
Problem-Solving Strategies in Large Language Models
Explore top LinkedIn content from expert professionals.
Summary
Problem-solving strategies in large language models (LLMs) are methods and approaches used to help these AI systems tackle complex, multi-step tasks and reasoning challenges. Unlike simple question answering, these strategies enable LLMs to plan, reflect, use external tools, and execute multi-step reasoning processes similar to human thinking.
- Train for stepwise reasoning: Encourage models to improve each step in their reasoning chains, rather than just aiming for the correct final answer, so they learn to navigate complex tasks more reliably.
- Experiment with directionality: Try both left-to-right and right-to-left reasoning approaches, as some tasks benefit from evaluating answers before considering the question, which can improve accuracy and insight.
- Integrate planning and tools: Design LLM-based agents to store information, plan actions, and use external tools or databases, which allows them to adapt and solve problems that require more than just generating a response.
-
-
Most large language models are trained to predict the next token in a left-to-right (L2R) manner. However, Apple researchers discovered that right-to-left (R2L) models can significantly outperform L2R models on specific multiple-choice question (MCQ) tasks! I just read this new paper "Reversal Blessing: Thinking Backward May Outpace Thinking Forward in Multi-choice Questions" that challenges our assumptions about how language models process information. This "reverse thinking" approach uses Bayesian inference to evaluate answer choices based on their likelihood of generating the question, rather than the traditional approach of evaluating questions to predict answers. Surprising Results: The researchers trained both L2R and R2L models with identical data and computational resources across different model sizes (2B-8B parameters). - R2L models consistently outperformed L2R on logical reasoning tasks (LogiQA) - R2L excelled at commonsense understanding tasks (OpenbookQA, CommonsenseQA) - R2L showed dramatic improvement on truthfulness assessment (TruthfulQA - 51% better!) What's fascinating is that these improvements held across different model sizes, datasets, and random seeds, suggesting this isn't just statistical noise. Why Does This Work? The researchers explored three hypotheses for why R2L performs better on certain tasks: 1. Calibration - R2L naturally "auto-normalizes" different answer choices, avoiding the "surface competition" issue where semantically similar answers (like "dog" and "puppy") split probability mass 2. Computability - Different directional factorizations have varying computational complexity 3. Conditional Entropy - The optimal reasoning direction corresponds to lower conditional entropy Through controlled simulation studies with arithmetic tasks, they found strong evidence supporting the conditional entropy hypothesis - the direction with lower conditional entropy tends to perform better. Implications This research suggests exciting possibilities for future language model development: - We might benefit from models that can reason in multiple directions - Alternative factorizations beyond L2R and R2L could further enhance LLM capabilities - Task-specific reasoning directions could boost performance on targeted applications The study suggests that our default assumptions about "forward thinking" might not always be optimal.
-
Exciting New Research: LLM-Based Agents for Question Answering! I just came across a fascinating survey paper on Large Language Model (LLM)-based agents for question answering systems. This comprehensive review from George Mason University explores how LLM agents are revolutionizing QA systems by addressing limitations in traditional approaches. >> What makes LLM agents special? Traditional QA systems relied on fixed pipelines with separate modules for query understanding, information retrieval, and answer generation. LLM-based agents transform this approach by using LLMs as their core reasoning engine, enabling dynamic interaction with external environments. The paper breaks down the LLM agent architecture into three key components: - Memory (M): Stores all information including the question and retrieved data - Planning module (πp): Determines the next action by consulting the LLM with planning prompts - Inner-thinking module (πt): Executes internal reasoning processes >> Technical Implementation Details The agent's planning process is mathematically represented as At = πp(St), where πp is the planning policy function and At is the action at time t. The agent's observation is obtained through Ot = E(At), where E is the environment feedback function. The LLM agent QA process follows a sophisticated algorithm where memory is initialized with the question, then for each time step: 1. The planning module selects an action based on memory 2. If the action interacts with external environments, the agent obtains observations and updates memory 3. If it's an internal thinking action, the agent processes the action internally >> Key Areas of Innovation The survey organizes LLM agent QA systems into several critical components: Planning: Both prompting-based approaches (like ReAct, Think on Graph) and tuning-based methods (FireAct, Learning from Failure) that enable agents to formulate action sequences. Question Understanding: Techniques for identifying slots, query expansion (HyQE, Query2CoT), and query reformulation to enhance comprehension. Information Retrieval: Methods for retrieving, ranking, and compressing relevant information from external sources. Answer Generation: Tool-augmented generation (Program-of-Thought) and prompt-enhanced techniques (Chain-of-Thought) that improve response quality. This research highlights how LLM agents are addressing hallucination issues and knowledge limitations by enabling interaction with external tools, databases, and other models.
-
Prompt engineering is a rapidly-evolving topic in AI, but recent research can be grouped into four categories... (1) Reasoning: Simple prompting techniques are effective for many problems, but more sophisticated strategies are required to solve multi-step reasoning problems. - [1] uses zero-shot CoT prompting to automatically generate problem-solving rationales to use for standard CoT prompting. - [2] selects CoT exemplars based on their complexity (exemplars that have the maximum number of reasoning steps are selected first). - [3] improves CoT prompting by asking the LLM to progressively refine the generated rationale. - [4] decomposes complex tasks into several sub-tasks that can be solved via independent prompts and later aggregated into a final answer. (2) Tool Usage: LLMs are powerful, but they have notable limitations. We can solve many of these limitations by teaching the LLM how to leverage external, specialized tools. - [5, 6] finetune a language model to teach it how to leverage a fixed, simple set of text-based APIs when answering questions. - [7] uses a central LLM-based controller to generate a program—written in natural language—that composes several tools to solve a complex reasoning task. - [8] uses a retrieval-based finetuning technique to teach an LLM to adaptively make calls to APIs based on their documentation when solving a problem. - [9] uses an LLM as a central controller for leveraging a variety of tools in the form of deep learning model APIs. - [10, 11] integrates code-capable LLMs with a sandboxed Python environment to execute programs when solving problems. (3) Context Window: Given the emphasis of recent LLMs on long contexts for RAG / few-shot learning, the properties of context windows and in-context learning have been studied in depth. - [12] shows that including irrelevant context in the LLM’s prompt can drastically deteriorate performance. - [13] finds that LLMs pay the most attention to information at the beginning/end of the prompt, while information placed in the middle of a long context is forgotten. - [14] proposes a theoretically-grounded strategy for optimally selecting few-shot exemplars. (4) Better Writing: One of the most popular use-cases of LLMs is for improving human writing, and prompt engineering can be used to make more effective writing tools with LLMs. - [15] improves the writing abilities of an LLM by first generating an outline and then filling in each component of the outline one-by-one. - [16] uses a smaller LLM to generate a “directional stimulus” (i.e., a textual hint) that can be used as extra context to improve an LLM’s writing ability on a given task. - [17] improves the quality of LLM-generated summaries by iteratively prompting the LLM to increase the information density of the summary.
-
🧐 Demystifying Long Chain-of-Thought (CoT) Reasoning in LLMs Authors: Edward Yeo, Yuxuan Tong, Morry Niu, Graham Neubig, and Xiang Yue 🔗 https://lnkd.in/gEp7CNu6 🔹 This paper caught my attention, especially after all the buzz around DeepSeek AI's R1. This paper takes a deep dive into how large language models (LLMs) develop long Chain-of-Thought (CoT) reasoning—the kind of structured problem-solving that allows for backtracking, self-correction, and exploring multiple solutions. The authors systematically analyze what drives models to generate longer, more sophisticated reasoning chains and provide some key insights into training strategies. 🔹 One of the biggest takeaways is that Supervised Fine-Tuning (SFT) on long CoT data significantly improves RL training. While not strictly necessary, it makes reinforcement learning (RL) much more effective, allowing models to scale their reasoning abilities. Short CoTs tend to plateau early, while long CoTs continue improving with more data. 🔹 However, simply applying RL to extend CoT length doesn’t always work. The authors show that RL training for long CoT is often unstable, with models either failing to extend reasoning length or artificially inflating responses without meaningful reasoning. To address this, they introduce a Cosine Length-Scaling Reward, which stabilizes length growth while preserving reasoning quality. They also find that reward hacking is a major issue—models will start repeating phrases to maximize length-based rewards rather than improving their reasoning. This is mitigated using an n-gram repetition penalty, ensuring that added steps actually contribute to problem-solving. 🔹 Another key challenge is the need for high-quality verifiable reward signals. Traditional RL rewards rely on ground-truth answers, but such data is scarce. The authors explore using noisy, web-extracted (“silver”) supervision data and find that, with proper filtering, it significantly improves out-of-distribution generalization. 🔹 One of the most interesting findings is that long CoT reasoning isn’t entirely emergent—many core abilities, like error correction and branching, are already present in base models. RL doesn’t create these from scratch but guides the model toward using them more effectively. Written in collab with Aman Chadha, let us know what you'd like to see next!
-
𝗜 𝗵𝗮𝘃𝗲 𝗯𝗲𝗲𝗻 𝗶𝗻 𝘁𝗵𝗲 𝗡𝗟𝗣 𝘀𝗽𝗮𝗰𝗲 𝗳𝗼𝗿 𝗮𝗹𝗺𝗼𝘀𝘁 𝟭𝟬 𝘆𝗲𝗮𝗿𝘀 𝗻𝗼𝘄, and I know the first-hand challenges of building text-based models in the pre-GPT era! So, I am a 𝗽𝗿𝗼-𝗟𝗮𝗿𝗴𝗲 𝗟𝗮𝗻𝗴𝘂𝗮𝗴𝗲 𝗠𝗼𝗱𝗲𝗹 (𝗟𝗟𝗠) 𝗲𝗻𝘁𝗵𝘂𝘀𝗶𝗮𝘀t, but I don’t believe they will replace humans or solve all our problems, especially when it comes to highly complex reasoning in industries like Finance. 𝗧𝗵𝗶𝘀 𝘄𝗲𝗲𝗸𝗲𝗻𝗱, I spent reading two compelling papers, and I’m convinced we’re bumping into real reasoning ceilings: 𝗜> "𝗧𝗵𝗲 𝗜𝗹𝗹𝘂𝘀𝗶𝗼𝗻 𝗼𝗳 𝗧𝗵𝗶𝗻𝗸𝗶𝗻𝗴: 𝗨𝗻𝗱𝗲𝗿𝘀𝘁𝗮𝗻𝗱𝗶𝗻𝗴 𝘁𝗵𝗲 𝗦𝘁𝗿𝗲𝗻𝗴𝘁𝗵𝘀 𝗮𝗻𝗱 𝗟𝗶𝗺𝗶𝘁𝗮𝘁𝗶𝗼𝗻𝘀 𝗼𝗳 𝗥𝗲𝗮𝘀𝗼𝗻𝗶𝗻𝗴 𝗠𝗼𝗱𝗲𝗹𝘀 𝘃𝗶𝗮 𝘁𝗵𝗲 𝗟𝗲𝗻𝘀 𝗼𝗳 𝗣𝗿𝗼𝗯𝗹𝗲𝗺 𝗖𝗼𝗺𝗽𝗹𝗲𝘅𝗶𝘁𝘆" (Apple) Apple researchers rigorously tested 𝗟𝗮𝗿𝗴𝗲 𝗥𝗲𝗮𝘀𝗼𝗻𝗶𝗻𝗴 𝗠𝗼𝗱𝗲𝗹𝘀 (𝗟𝗥𝗠𝘀), LLMs that explicitly generate chain-of-thought reasoning, using controlled puzzles like Tower of Hanoi and River Crossing Key insights: 1. 𝗧𝗵𝗿𝗲𝗲 𝗿𝗲𝗮𝘀𝗼𝗻𝗶𝗻𝗴 𝗿𝗲𝗴𝗶𝗺𝗲𝘀: ▪️Low complexity: standard LLMs outperform LRMs ▪️Medium complexity: LRMs excel ▪️High complexity: 𝗯𝗼𝘁𝗵 𝗰𝗼𝗹𝗹𝗮𝗽𝘀𝗲, accuracy plummets 2. Fascinating observation, 𝗟𝗥𝗠𝘀 “𝗴𝗶𝘃𝗲 𝘂𝗽” as puzzle complexity increases, their reasoning effort declines rapidly, even with enough tokens 3. Even when provided an exact algorithm (e.g., Tower of Hanoi strategy), the 𝗺𝗼𝗱𝗲𝗹𝘀 𝘀𝘁𝗶𝗹𝗹 𝗳𝗮𝗶𝗹𝗲𝗱 𝘁𝗼 𝗴𝗲𝗻𝗲𝗿𝗮𝗹𝗶𝘇𝗲 and mostly outputs based on past observed data pattern it is trained on 𝗜𝗜> "𝗥𝗲𝗮𝘀𝗼𝗻𝗶𝗻𝗴 𝗼𝗿 𝗢𝘃𝗲𝗿𝘁𝗵𝗶𝗻𝗸𝗶𝗻𝗴: 𝗘𝘃𝗮𝗹𝘂𝗮𝘁𝗶𝗻𝗴 𝗟𝗮𝗿𝗴𝗲 𝗟𝗮𝗻𝗴𝘂𝗮𝗴𝗲 𝗠𝗼𝗱𝗲𝗹𝘀 𝗼𝗻 𝗙𝗶𝗻𝗮𝗻𝗰𝗶𝗮𝗹 𝗦𝗲𝗻𝘁𝗶𝗺𝗲𝗻𝘁 𝗔𝗻𝗮𝗹𝘆𝘀𝗶𝘀" (Dimitris Vamvourellis & Dhagash Mehta, Ph.D., BlackRock) This study tested major 𝗟𝗟𝗠𝘀 (𝗚𝗣𝗧‐𝟰𝗼, 𝗚𝗣𝗧‐𝟰.𝟭, 𝗼𝟯‐𝗺𝗶𝗻𝗶, 𝗙𝗶𝗻𝗕𝗘𝗥𝗧 𝘃𝗮𝗿𝗶𝗮𝗻𝘁𝘀) on financial sentiment classification using: - "𝗦𝘆𝘀𝘁𝗲𝗺 𝟭" (𝗳𝗮𝘀𝘁/𝗶𝗻𝘁𝘂𝗶𝘁𝗶𝘃𝗲) - "𝗦𝘆𝘀𝘁𝗲𝗺𝟮" (𝘀𝗹𝗼𝘄/𝗱𝗲𝗹𝗶𝗯𝗲𝗿𝗮𝘁𝗲) 𝗽𝗿𝗼𝗺𝗽𝘁𝗶𝗻𝗴 Key takeaways: ▪️𝗥𝗲𝗮𝘀𝗼𝗻𝗶𝗻𝗴 𝗽𝗿𝗼𝗺𝗽𝘁𝘀 𝗱𝗶𝗱 𝗻𝗼𝘁 𝗶𝗺𝗽𝗿𝗼𝘃𝗲 𝗽𝗲𝗿𝗳𝗼𝗿𝗺𝗮𝗻𝗰𝗲 ▪️Surprisingly, straightforward, intuitive prompts with GPT-4o (no chain-of-thought) outperformed all others ▪️More reasoning led to overthinking, reducing alignment with human-labeled sentiments 💡 Why it matters for builders and researchers in Finance and every industry: ❎ 𝗕𝗶𝗴𝗴𝗲𝗿 𝗺𝗼𝗱𝗲𝗹𝘀 + 𝗺𝗼𝗿𝗲 “𝘁𝗵𝗶𝗻𝗸𝗶𝗻𝗴” = 𝗯𝗲𝘁𝘁𝗲𝗿 𝗼𝘂𝘁𝗰𝗼𝗺𝗲𝘀. Sometimes it’s actively worse ❎ We’re not seeing a soft plateau — these are 𝗵𝗮𝗿𝗱 𝗰𝗲𝗶𝗹𝗶𝗻𝗴𝘀 𝗶𝗻 𝗿𝗲𝗮𝘀𝗼𝗻𝗶𝗻𝗴 𝗰𝗮𝗽𝗮𝗰𝗶𝘁𝘆 ❎ For real-world systems, agents, and financial tools: design for 𝗿𝗲𝗮𝘀𝗼𝗻𝗶𝗻𝗴 𝗲𝗰𝗼𝗻𝗼𝗺𝘆, not just reasoning depth. #LLMs #ReasoningLimits #LLMChainofthought #LLMReasoningDecline
-
Do Large Reasoning Models (#LRMs) actually learn to reason, or do they learn to approximate reasoning? That really matters for high stakes applications where squinting at math or engineering problems and giving a "probably" right answer isn't good enough. Great study from Subbarao Kambhampati on reinforcement learning with verifiable rewards (#RLVR), empirically shows that when we fine-tune large language models to show tidy, step-by-step “reasoning,” we may only be teaching them to look more rational—not to actually reason. LRMs don’t solve problems by following symbolic steps the way humans or algorithms do. They use gradient descent to adjusts billions of internal weights to minimize error. During fine-tuning, models learn to produce text ("intermediate tokens") on the way to the final answer that sounds like reasoning (“First, let’s find X…”). What the authors found was that more coherent, plausible sounding intermediate steps, doesn't correspond with global problem validity and accuracy . The model had learned a linguistic style, not how to do step by step reasoning. That distinction matters for #AGI. If we want machines that can truly reason—plan, check their own logic, and generalize across domains—we’ll need architectures that go beyond loss minimization: systems that can represent and manipulate knowledge causally, not just statistically. That might be external scratchpads that persist and can be audited, hybrid #neurosymbolic architectures, etc. But there's more work to be done if we want to get to real reasoning. https://lnkd.in/eMFBGN9N
-
Up until now, much of domain specific knowledge injection to LLMs has answered the question: "How do we get the right context INTO the window?", but with the success of coding agents and recursive language models, that question has changed to: "How do we let the model NAVIGATE context itself?" Large language models have a limited context window, or maximum amount of tokens that can be input as its entire context. This is a hard constraint resulting from the transformer architecture itself, and while modern models have pushed context windows into the hundreds of thousands (even millions) of tokens, more context doesn't always mean better results. Research has shown that model performance actually degrades as input length increases, a phenomenon known as context rot, where models struggle to reliably use information buried deep in long sequences, especially when surrounded by similar but irrelevant content. The solution up until now has been Retrieval Augmented Generation (RAG), chunking and embedding documents into vector databases, then retrieving the most relevant pieces via semantic similarity. This works, but it frames context management purely as a search problem, and scaling it starts to feel more like building a search engine than an AI system. What coding agents like Claude Code, Cursor, and Codex stumbled into was a different approach entirely: give the LLM a terminal and let it explore. Filesystem-based context navigation lets models directly explore, preview, and selectively load content using tools they already understand. Instead of engineering a pipeline to deliver the right context, the model finds it itself. Recursive Language Models (RLMs) formalize this further, with a slight distinction: in a coding agent, opening a file or running a tool dumps results back into the context window. RLMs instead store the prompt and all sub-call results as variables in a code environment, only interacting with them programmatically. Recursion happens during code execution, meaning the model can spawn arbitrarily many sub-LLM calls without polluting its own context, orchestrating understanding of 10M+ tokens without ever having to look at all of it at once. This gives us two differently motived options: RAG gives you fast, narrow retrieval great for latency-sensitive apps like chatbots. RLM-style frameworks trade speed for deeper exploration, better suited when thorough analysis matters more than response time. To learn more about context rot, how coding agents changed context delivery, and how recursive language models are formalizing it all, check out my latest video here: https://lnkd.in/ehszSKV7
From Retrieval to Navigation: The New RAG Paradigm
https://www.youtube.com/
-
🎯 Large Language Models are *not* the Language of Science (in the pure form) A growing number of my colleagues are exploring how LLMs and reasoning agents can support scientific research — from literature mining to hypothesis generation. It's an exciting direction. But the real question isn’t if LLMs can be used for science — it’s how, and what the primary bottlenecks are. For me, the crux is this: LLMs are trained on human language, not the language of science. Human language once served as the foundation for structured reasoning — think Ancient Greece, where philosophy and logic evolved through dialogue. But over time, science moved to its own language: mathematics. From calculus to quantum mechanics, we developed formalisms where definitions are sharp, operations well-defined, and ambiguity minimized. In fact, quantum mechanics is a striking example of how mathematical structure began to lead physical understanding — to the discomfort of many early 20th-century physicists. So now we face a similar moment. If we want LLMs to be part of real-world research — not just summarization, but actual scientific reasoning — we need to bridge natural language and mathematical abstraction. And that’s hard. Even formalized human language (e.g., ontologies, knowledge graphs) lacks the structural precision and uniqueness of mathematics. A word like “force” can carry dozens of meanings in natural language; in physics, it has exactly one. Semantics in natural language is contextual and often underspecified — the opposite of what math requires. The challenge — and opportunity — lies in that murky space between how we talk and how we compute. How do we connect LLM-driven abstraction with the formalisms of logic, optimization, and symbolic reasoning? That, to me, is where the real innovation will happen. In addition to channeling LLMs through structured abstractions like causal graphs or workflows, several other strategies are emerging to bridge LLMs with science: - Prompt engineering with formal constraints to guide structured outputs. - Chain-of-thought reasoning that mimics scientific logic step-by-step. - LLM-as-orchestrator, calling external tools for math, code, or simulations. - Retrieval-augmented generation grounded in structured databases. - Code generation as an interface between language and formal logic. - Embedding alignment between language models and scientific representations. - Training on synthetic formal corpora derived from math and physics. - Formal verification loops that validate outputs with theorem provers or symbolic math. Simple from the number of these approaches, it is clear that there is no universal answer (yet?). Would be super interesting to see which one will actually work when operationalized on self-driving labs and autonomous discovery tools!
-
Day 16/30 of LLMs/SLMs - Retrieval-Augmented Generation (RAG) Large Language Models are powerful, but they have a fixed memory. They cannot know anything that happened after their training cut-off, and they struggle with facts that were never part of their dataset. When they lack the right information, they guess. The result is fluent but unreliable text — the hallmark of hallucination. Retrieval-Augmented Generation (RAG) fixes that by giving models a way to look up information before they answer. RAG is best understood as a three-stage pipeline, and LangChain has become the de facto standard framework for building each stage efficiently. 𝐈𝐧𝐠𝐞𝐬𝐭𝐢𝐨𝐧 𝐚𝐧𝐝 𝐈𝐧𝐝𝐞𝐱𝐢𝐧𝐠 You start by collecting and preparing your documents. LangChain’s loaders handle PDFs, web pages, CSVs, and APIs. These documents are then split into smaller, semantically meaningful chunks and converted into embeddings using models like OpenAI’s text-embedding-3-small, SentenceTransformers, or InstructorXL. Those embeddings are stored in a vector database such as FAISS, Pinecone, Weaviate, or Chroma, which lets you perform similarity search later. 𝐑𝐞𝐭𝐫𝐢𝐞𝐯𝐚𝐥 When a query arrives, LangChain converts it into an embedding and searches the vector store for the most relevant documents. Retrieval strategies vary — basic similarity search, maximal marginal relevance (MMR) to diversify context, or hybrid retrieval that mixes semantic and keyword search. The retrieved text chunks are then added to the prompt as contextual grounding. 𝐆𝐞𝐧𝐞𝐫𝐚𝐭𝐢𝐨𝐧 The LLM receives the augmented prompt containing both the user query and retrieved passages. It synthesizes an answer based on that external knowledge. LangChain manages prompt templates, context formatting, and memory across queries, making the process modular and repeatable. 𝐖𝐡𝐲 𝐑𝐀𝐆 𝐌𝐚𝐭𝐭𝐞𝐫𝐬 RAG fundamentally improves factual accuracy and trust. On benchmarks such as Natural Questions and TriviaQA, a base model like LLaMA 2-13B might achieve 45 F1, while RAG-augmented versions reach 65–70 F1 - matching much larger and costlier models. 𝐆𝐞𝐭𝐭𝐢𝐧𝐠 𝐒𝐭𝐚𝐫𝐭𝐞𝐝 𝐰𝐢𝐭𝐡 𝐋𝐚𝐧𝐠𝐂𝐡𝐚𝐢𝐧 𝐑𝐀𝐆 If you want to experiment, LangChain makes it approachable. A minimal prototype takes fewer than 20 lines of code. Here’s a good progression 👉 Start with the LangChain tutorial: https://lnkd.in/gUpHpkKT 👉 Add a vector store: Try Chroma for local experiments or Pinecone for scalable hosting. 👉 Experiment with retrieval methods: compare similarity search vs. MMR. 👉 Integrate your own data: ingest PDFs, database exports, or web content. 👉 Deploy a chain: connect your retriever, model, and prompt template into a single workflow. Tune in tomorrow for more SLM/LLMs deep dives. -- 🚶➡️ To learn more about LLMs/SLMs, follow me - Karun! ♻️ Share so others can learn, and you can build your LinkedIn presence!