Exciting Research Alert: Improving RAG with Self-Generated Demonstrations I just came across a fascinating paper that addresses a critical challenge in Retrieval-Augmented Generation (RAG) systems. The research team from USC and Meta has developed a novel approach called Self-Demo Retrieval-Augmented Instruction Tuning (SD-RA-IT) that significantly improves how Large Language Models (LLMs) handle retrieved information. >> The Problem They're Solving When fine-tuning LLMs for RAG, we typically use human-written responses that weren't created with the retrievals in mind. This creates two major issues: - Misalignment between retrievals and responses - Training on out-of-distribution (OOD) text that the model wouldn't naturally generate These issues lead to hallucinations and poor performance in RAG systems. >> Their Innovative Solution The researchers propose a brilliantly simple yet effective approach: 1. Generate multiple response candidates from the LLM itself using the instruction and retrievals 2. Filter these self-generated responses for correctness against gold answers 3. Train the model on these in-distribution self-demos instead of human-written responses 4. When no good response is found, train the model to generate a refusal This method ensures the training data matches the model's own distribution while still providing accurate supervision. >> Technical Implementation Details Their implementation uses Llama-3-8B-Instruct and Llama-3-70B-Instruct models with fairseq2 for training and vLLM for inference. They employ automatic prompt optimization to generate diverse response candidates and use a tournament-style filtering process to select the best responses. The team evaluated two training objectives: - Supervised fine-tuning (SFT) with cross-entropy loss - Direct preference optimization (DPO) using rejected responses as negative examples >> Impressive Results The results speak for themselves: - Higher precision (accuracy on attempted questions) - Higher recall (successful attempts on answerable questions) - Lower counterfactual accuracy (better at refusing questions it would get wrong) - Minimal degradation in non-RAG settings - Superior performance across different numbers of retrievals Most importantly, SD-RA-IT models are significantly better at avoiding hallucinations by refusing to answer questions they're likely to get wrong, while still correctly extracting answers from relevant retrievals. This research provides valuable insights for anyone working with RAG systems. By training on self-generated demonstrations, we can create more reliable and accurate RAG-enabled language models.
Improving Large Language Model Accuracy with Diverse Questions
Explore top LinkedIn content from expert professionals.
Summary
Improving large language model accuracy with diverse questions means training artificial intelligence systems to better understand and answer a wide range of queries, boosting their reliability and helping them avoid mistakes. By introducing varied and creative questions during training, these models learn to handle complex scenarios and provide more accurate responses to real-world problems.
- Encourage varied questioning: Include questions of different complexity and subject matter to expose models to a broad set of challenges and reasoning pathways.
- Use iterative feedback: Allow models to review and refine their answers through self-critique or multiagent debate, improving the quality and correctness of responses.
- Adapt training strategies: Select appropriate methods to match the difficulty of incoming questions, balancing speed and precision for both simple and challenging tasks.
-
-
🚀 Large language models understand facts. But can they cook up a plan when the recipe is brand‑new? Welcome “Analogy‑Augmented Generation (AAG)”: 👉Retrieval is good – retrieval plus analogy is better. AAG pairs a procedural memory of past examples with an “analogy engine” that rewrites the user query into 4 focused sub‑questions, grabs similar procedures, then lets the LLM remix them. On an unseen LangChain dataset (LCStep), AAG beat classic RAG, few‑shot and zero‑shot prompts in 70‑98 % of pairwise comparisons, even trouncing a ReAct agent 98 % of the time. 👉Self‑critique closes the loop. After drafting an answer, the same model plays critic for up to three cycles, suggesting edits and re‑writing itself. This iterative pass lifts quality another 7‑10 %, showing that “judge‑and‑fix” can trump ever‑larger prompts. 👉 It scales from code to cooking — and maths. Besides LCStep (276 LangChain tutorials), the team tests on 10 k RecipeNLG meals and 270 CHAMP math problems. Despite very different vocabularies, AAG was still preferred in 56% vs 16% of blind human ratings for recipes, and edged out RAG on competition‑level maths. Why it matters 🤔 * 🔧 Domain adaptation: Need instructions for a library released after your model’s cutoff? Just drop its docs into procedural memory. * 🏗️ Autonomous agents: Clearer, step‑level plans mean safer code execution and robotics. * 🏥 Industry verticals:Think troubleshooting guides for new medical devices or compliance workflows that change monthly. 🤚Limitations & open questions AAG stores procedures as linear step chains. Real‑world tasks often branch and loop — the paper authors plan to add explicit dependency graphs next. And five retrieval‑and‑rewrite passes mean ~7× the latency of vanilla RAG. Will the next wave of LLM tooling come from teaching models to reason by analogy rather than just memorize? Full paper link in the comments. #AIResearch #MachineLearning #RetrievalAugmentedGeneration #LangChain #LLM
-
Adaptive-RAG: Enhancing Large Language Models by Question-Answering Systems with Dynamic Strategy Selection for Query Complexity Quick read: https://lnkd.in/g5WYjRaj Researchers from the School of Computing and Graduate School of AI, Korea Advanced Institute of Science and Technology, propose a novel adaptive QA framework, Adaptive-RAG, designed to bridge this gap. Adaptive-RAG utilizes a classifier to predict the complexity level of incoming queries, allowing the model to select the most apt strategy for information retrieval and integration. This adaptability streamlines the process for simpler questions, eliminating undue computational overhead and ensuring that complex queries receive the meticulous attention required. The model’s classifier, trained on a dataset with automatically assigned complexity labels, is the linchpin in this adaptive approach. Adaptive-RAG’s efficacy was validated on various open-domain QA datasets that spanned a wide range of query complexities. It demonstrated a notable enhancement in the efficiency and accuracy of QA systems across the board. For instance, in benchmarks involving the FLAN-T5 series models, Adaptive-RAG achieved a striking balance between computational efficiency and response accuracy. It outperformed traditional methods by reducing the time per query by up to 27.18 seconds for the most complex queries while ensuring high accuracy across simple, single-step, and multi-step questions. Paper: https://lnkd.in/gui_trCA #artificialintelligence #ai #datascience
-
MultiAgent Finetuning of LLMs: Self- improvement with Diverse reasoning chains. This paper propose a new approach to self-improvement in LLMs, to mitigate issue of decreased gains of performance after multiple rounds of fine-tuning. By employing agents with distinct roles, this method improves feedback mechanism and overall output quality, mitigating limitations inherent in single-agent self-improvement methods. 𝗞𝗲𝘆 𝗰𝗼𝗻𝘁𝗿𝗶𝗯𝘂𝘁𝗶𝗼𝗻𝘀: - leverages multiagent interaction as an approach to self-improvement with language models - propose to specialize models with distinct roles to enable detailed feedback between agents and to improve final output quality - demonstrate that finetuned agents can generalize across different datasets in a zero-shot manner 𝗢𝘃𝗲𝗿𝘃𝗶𝗲𝘄 𝗼𝗳 𝗠𝘂𝗹𝘁𝗶𝗮𝗴𝗲𝗻𝘁 𝗙𝗶𝗻𝗲𝘁𝘂𝗻𝗶𝗻𝗴 i) first use multiagent debate and majority voting to create finetuning datasets ii) These datasets are then used to finetune generation and critic agents (right). iii) When finetuning generation models, use the majority voted result (”correct” output) to select first-round responses from each agent. iv) finetune critic models using responses from final round based on whether responses match majority voted result (mix of ”correct and incorrect” outputs). v) finetuned models are combined through multiagent debate to generate more accurate answers vi) Applying multiple rounds of finetuning iterations can significantly boost performance 𝗖𝗼𝗺𝗽𝗼𝗻𝗲𝗻𝘁𝘀 i) Multiagent Debate - involves a series of N language model agents, tasked with generating a response to a given PROBLEM - After initial response, a debate round is initiated among agents. - concatenate and summarize responses from other agents - final result is determined by majority vote based on outputs from last round of debate ii) Finetuning Generation Models - these models rely on diverse reasoning chains to promote diversity - constructed from N generation models which generate a response to given input iii) Finetuning Critic Models - constructed from critic models and evaluate outputs from all generation agents and then select or synthesize best responses iv) Multiple iterations of Finetuning - finetuned generation agents and critic agents are used to gather datasets for next iteration through multiagent debate iv) Inference - multiagent debate among finetuned generation and critic agents, where each individual generation agent participates in first round of debate, followed by each individual critic agent in subsequent rounds 𝗥𝗲𝘀𝘂𝗹𝘁𝘀 - this method outperforms all the baselines(Majority vote, Debate, STaR approach, Majority voting with Finetuning) - With multiple iterations of finetuning, Multiagent FT (Ours) consistently improves performance over time, as seen in accuracy improvement from 58.8% to 66.0% for Phi-3, 22.5% to 28.2% for Mistral. 𝗣𝗮𝗽𝗲𝗿: https://lnkd.in/e6Zaqm5A 𝗖𝗼𝗱𝗲: https://lnkd.in/eTK_w6Pq