Improving Predictive Accuracy

Explore top LinkedIn content from expert professionals.

  • View profile for Andrew Ng
    Andrew Ng Andrew Ng is an Influencer

    DeepLearning.AI, AI Fund and AI Aspire

    2,509,618 followers

    Last week, I described four design patterns for AI agentic workflows that I believe will drive significant progress: Reflection, Tool use, Planning and Multi-agent collaboration. Instead of having an LLM generate its final output directly, an agentic workflow prompts the LLM multiple times, giving it opportunities to build step by step to higher-quality output. Here, I'd like to discuss Reflection. It's relatively quick to implement, and I've seen it lead to surprising performance gains. You may have had the experience of prompting ChatGPT/Claude/Gemini, receiving unsatisfactory output, delivering critical feedback to help the LLM improve its response, and then getting a better response. What if you automate the step of delivering critical feedback, so the model automatically criticizes its own output and improves its response? This is the crux of Reflection. Take the task of asking an LLM to write code. We can prompt it to generate the desired code directly to carry out some task X. Then, we can prompt it to reflect on its own output, perhaps as follows: Here’s code intended for task X: [previously generated code] Check the code carefully for correctness, style, and efficiency, and give constructive criticism for how to improve it. Sometimes this causes the LLM to spot problems and come up with constructive suggestions. Next, we can prompt the LLM with context including (i) the previously generated code and (ii) the constructive feedback, and ask it to use the feedback to rewrite the code. This can lead to a better response. Repeating the criticism/rewrite process might yield further improvements. This self-reflection process allows the LLM to spot gaps and improve its output on a variety of tasks including producing code, writing text, and answering questions. And we can go beyond self-reflection by giving the LLM tools that help evaluate its output; for example, running its code through a few unit tests to check whether it generates correct results on test cases or searching the web to double-check text output. Then it can reflect on any errors it found and come up with ideas for improvement. Further, we can implement Reflection using a multi-agent framework. I've found it convenient to create two agents, one prompted to generate good outputs and the other prompted to give constructive criticism of the first agent's output. The resulting discussion between the two agents leads to improved responses. Reflection is a relatively basic type of agentic workflow, but I've been delighted by how much it improved my applications’ results. If you’re interested in learning more about reflection, I recommend: - Self-Refine: Iterative Refinement with Self-Feedback, by Madaan et al. (2023) - Reflexion: Language Agents with Verbal Reinforcement Learning, by Shinn et al. (2023) - CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing, by Gou et al. (2024) [Original text: https://lnkd.in/g4bTuWtU ]

  • View profile for Sebastian Raschka, PhD
    Sebastian Raschka, PhD Sebastian Raschka, PhD is an Influencer

    ML/AI research engineer. Author of Build a Large Language Model From Scratch (amzn.to/4fqvn0D) and Ahead of AI (magazine.sebastianraschka.com), on how LLMs work and the latest developments in the field.

    239,489 followers

    Training LLMs for spam classification: I added 14 experiments comparing different approaches: https://lnkd.in/gTNVvGcj - which token to train - which layers to train - different model sizes - LoRA - unmasking - and more! Any additional experiments you'd like to see? And here are the take aways for the table shown in the picture: 1. Training the Last vs. First Output Token (Row 1 vs. 2): Training the last output token results in substantially better performance compared to the first. This improvement is expected due to the causal self-attention mask. 2. Training the Last Transformer Block vs. Last Layer (Row 1 vs. 3): Training the entire last transformer block is also results in substantially better results than training only the last layer. 3. Training All Layers vs. Last Transformer Block (Row 1 vs. 4): Training all layers shows a modest improvement of ~2% over just training the last transformer block, but it requires almost three times longer in terms of training duration. 4. Using Larger Pretrained Models (Row 1 vs 5, and Row 1 vs. 6 and 7): Employing a 3x larger pretrained model leads to worse results. However, using a 5x larger model improves performance compared to the initial model, as was anticipated. Similarly, the 12x larger model improves the predictive performance even further. (The medium model was perhaps not well pretrained or the particular finetuning configuration works not as well for this model.) 5. Using a Model with Random Weights vs. Pretrained Weights (Row 1 vs. 8): Utilizing a model with random weights yields results that are only slightly worse by 1.3% compared to using pretrained weights. 6. Using LoRA (Low-Rank Adaptation) vs Training All Layers (Row 9 vs. 4): Keeping the model frozen and adding trainable LoRA layers (see Appendix E for details) is a viable alternative to training all model parameters and even improves the performance by 1% point. As it can be seen by the 1% lower gap between the training and validation accuracy when using LoRA, this is likely due to less overfitting. 7. Padding Input to Full Context Length vs. Longest Training Example (Row 1 vs. 10): Padding the input to the full supported context length results is significantly worse. 8. Padding vs no padding (Row 1 vs. 11 and 12): The `--no_padding` option disables the padding in the dataset, which requires training the model with a batch size of 1 since the inputs have variable lengths. This results in a better test accuracy but takes longer to train. In row 12, we additionally enable gradient accumulation with 8 steps to achieve the same batch size as in the other experiments. 9. Disabling the causal attention mask (Row 1 vs. 13): Disables the causal attention mask used in the multi-head attention module. This means all tokens can attend all other tokens. The model accuracy is slightly improved compared to the GPT model with causal mask.

  • View profile for Carl Seidman, CSP, CPA

    Premier FP&A, Modeling + Excel education you can immediately use | 325,000+ LinkedIn Learning | Professor in Data Analytics @ Rice University | Microsoft MVP | Join newsletter for Excel, FP&A + financial modeling tips👇

    92,062 followers

    I recently demoed 4 FP&A platforms that claim to effectively forecast 13-week cash flows using AI. Three of the companies are dedicated planning tools. One company is a financial reporting tool. Despite them being leaders in the FP&A space, seeing their 13-week cash flow tools left me unconvinced. ----------- What the FP&A tools got right? (1) Cash flow forecasts were generated in a flash It was remarkable to see how quickly these tools can create a direct format 13-week cash flow. It took seconds. When you're needing to update a cash flow model, taking days or weeks to refresh a rolling forecast isn't an option. (2) Cash flow forecasts were traceable Many company cash flow models are driven by lots of data. Auditing Excel formulas isn't a great use of time for Treasurers or FP&As. These tools make it easy to vouch back to the root data and explore the detail. (3) Cash flows are good enough for companies that don't have to worry Some FP&As struggle to accept that top-down forecasts may be good enough for most companies that don't have to worry much about cash flow. That's because they're flush with liquidity, have a line or credit, and aren't laser-focused or hands-on with cash. A decent forecast that isn't remarkably accurate isn't always a liability. It's can be an asset since it's a reasonable-enough snapshot in time. ----------- What the FP&A tools get wrong? (4) Forecasts use past data and trends for almost all assumptions about the future If managing cash flow for a business that's seasonal, volatile, or has cash flow issues, relying on past data and trends can be reckless and lazy. When it comes to cash flow management, relying too much on historical trends can lead to really poor assumptions. If decisions are based on those bad assumptions, you get bad decisions too. (5) Forecasts were mostly observational, not prescriptive Unless you're working with a large corporation, where operations are steady and bank accounts are full, cash flow forecasts should enable thoughtful choices. That means the model should reveal operational drivers, opportunities, and scenarios. These tools don't really allow for these basic features. They're mostly just reports and data extrapolations. (6) Forecasts didn't capture nuance In the example I show here, my cash flow model can quickly and easily incorporate actuals, weekly and monthly forecast periods. I'm able to hold back 20% of accounts payable. I can pay back the A/P at any rate and timing that I want. I can be aggressive with catch-up payments and early-payment discounts. It's what a company needs to be able to see, whether it's doing $20 million or $200 million in revenue. It's not that AI can't do cash flow forecasting and modeling. It's that it can't do it as well as you'd hope. And that's the problem. Cash flows are full of nuance. AI-driven cash flow forecasts aren't great at understanding nuance. You can learn cash flows with me live: https://lnkd.in/grQVkeyE

  • View profile for Brij Kishore Pandey
    Brij Kishore Pandey Brij Kishore Pandey is an Influencer

    AI Architect & AI Engineer | Building Agentic Systems & Scalable AI Solutions

    727,429 followers

    RAG stands for Retrieval-Augmented Generation. It’s a technique that combines the power of LLMs with real-time access to external information sources. Instead of relying solely on what an AI model learned during training (which can quickly become outdated), RAG enables the model to retrieve relevant data from external databases, documents, or APIs—and then use that information to generate more accurate, context-aware responses. How does RAG work? 𝗥𝗲𝘁𝗿𝗶𝗲𝘃𝗲: The system searches for the most relevant documents or data based on your query, using advanced search methods like semantic or vector search. 𝗔𝘂𝗴𝗺𝗲𝗻𝘁𝗮𝘁𝗶𝗼𝗻: Instead of just using the original question, RAG 𝗮𝘂𝗴𝗺𝗲𝗻𝘁𝘀 (enriches) the prompt by adding the retrieved information directly into the input for the AI model. This means the model doesn’t just rely on what it “remembers” from training—it now sees your question 𝘱𝘭𝘶𝘴 the latest, domain-specific context 𝗚𝗲𝗻𝗲𝗿𝗮𝘁𝗲: The LLM takes the retrieved information and crafts a well-informed, natural language response. 𝗪𝗵𝘆 𝗱𝗼𝗲𝘀 𝗥𝗔𝗚 𝗺𝗮𝘁𝘁𝗲𝗿? Improves accuracy: By referencing up-to-date or proprietary data, RAG reduces outdated or incorrect answers. Context-aware: Responses are tailored using the latest information, not just what the model “remembers.” Reduces hallucinations: RAG helps prevent AI from making up facts by grounding answers in real sources. Example: Imagine asking an AI assistant, “What are the latest trends in renewable energy?” A traditional LLM might give you a general answer based on old data. With RAG, the model first searches for the most recent articles and reports, then synthesizes a response grounded in that up-to-date information. Illustration by Deepak Bhardwaj

  • View profile for Rahul Agarwal

    Staff ML Engineer | Meta, Roku, Walmart | 1:1 @ topmate.io/MLwhiz

    45,843 followers

    Few Lessons from Deploying and Using LLMs in Production Deploying LLMs can feel like hiring a hyperactive genius intern—they dazzle users while potentially draining your API budget. Here are some insights I’ve gathered: 1. “Cheap” is a Lie You Tell Yourself: Cloud costs per call may seem low, but the overall expense of an LLM-based system can skyrocket. Fixes: - Cache repetitive queries: Users ask the same thing at least 100x/day - Gatekeep: Use cheap classifiers (BERT) to filter “easy” requests. Let LLMs handle only the complex 10% and your current systems handle the remaining 90%. - Quantize your models: Shrink LLMs to run on cheaper hardware without massive accuracy drops - Asynchronously build your caches — Pre-generate common responses before they’re requested or gracefully fail the first time a query comes and cache for the next time. 2. Guard Against Model Hallucinations: Sometimes, models express answers with such confidence that distinguishing fact from fiction becomes challenging, even for human reviewers. Fixes: - Use RAG - Just a fancy way of saying to provide your model the knowledge it requires in the prompt itself by querying some database based on semantic matches with the query. - Guardrails: Validate outputs using regex or cross-encoders to establish a clear decision boundary between the query and the LLM’s response. 3. The best LLM is often a discriminative model: You don’t always need a full LLM. Consider knowledge distillation: use a large LLM to label your data and then train a smaller, discriminative model that performs similarly at a much lower cost. 4. It's not about the model, it is about the data on which it is trained: A smaller LLM might struggle with specialized domain data—that’s normal. Fine-tune your model on your specific data set by starting with parameter-efficient methods (like LoRA or Adapters) and using synthetic data generation to bootstrap training. 5. Prompts are the new Features: Prompts are the new features in your system. Version them, run A/B tests, and continuously refine using online experiments. Consider bandit algorithms to automatically promote the best-performing variants. What do you think? Have I missed anything? I’d love to hear your “I survived LLM prod” stories in the comments!

  • View profile for Sahar Mor

    I help researchers and builders make sense of AI | ex-Stripe | aitidbits.ai | Angel Investor

    42,074 followers

    In the last three months alone, over ten papers outlining novel prompting techniques were published, boosting LLMs’ performance by a substantial margin. Two weeks ago, a groundbreaking paper from Microsoft demonstrated how a well-prompted GPT-4 outperforms Google’s Med-PaLM 2, a specialized medical model, solely through sophisticated prompting techniques. Yet, while our X and LinkedIn feeds buzz with ‘secret prompting tips’, a definitive, research-backed guide aggregating these advanced prompting strategies is hard to come by. This gap prevents LLM developers and everyday users from harnessing these novel frameworks to enhance performance and achieve more accurate results. https://lnkd.in/g7_6eP6y In this AI Tidbits Deep Dive, I outline six of the best and recent prompting methods: (1) EmotionPrompt - inspired by human psychology, this method utilizes emotional stimuli in prompts to gain performance enhancements (2) Optimization by PROmpting (OPRO) - a DeepMind innovation that refines prompts automatically, surpassing human-crafted ones. This paper discovered the “Take a deep breath” instruction that improved LLMs’ performance by 9%. (3) Chain-of-Verification (CoVe) - Meta's novel four-step prompting process that drastically reduces hallucinations and improves factual accuracy (4) System 2 Attention (S2A) - also from Meta, a prompting method that filters out irrelevant details prior to querying the LLM (5) Step-Back Prompting - encouraging LLMs to abstract queries for enhanced reasoning (6) Rephrase and Respond (RaR) - UCLA's method that lets LLMs rephrase queries for better comprehension and response accuracy Understanding the spectrum of available prompting strategies and how to apply them in your app can mean the difference between a production-ready app and a nascent project with untapped potential. Full blog post https://lnkd.in/g7_6eP6y

  • View profile for Anders Liu-Lindberg

    Leading advisor to senior Finance and FP&A leaders on creating impact through business partnering | Interim | VP Finance | Business Finance

    455,282 followers

    𝗠𝗰𝗞𝗶𝗻𝘀𝗲𝘆 𝗼𝘂𝘁𝗹𝗶𝗻𝗲𝗱 𝟲 𝗮𝗱𝘃𝗮𝗻𝗰𝗲𝗱 𝗙𝗣&𝗔 𝗽𝗿𝗮𝗰𝘁𝗶𝗰𝗲𝘀 𝗳𝗼𝗿 𝗯𝗲𝘁𝘁𝗲𝗿 𝗳𝗼𝗿𝗲𝗰𝗮𝘀𝘁𝗶𝗻𝗴. Most finance teams know them. Few actually implement them consistently. Why? Because doing it right has always been painfully manual. 𝗛𝗲𝗿𝗲'𝘀 𝘄𝗵𝗮𝘁 𝘀𝘁𝗿𝘂𝗰𝗸 𝗺𝗲: AI is changing this. Fast. The six practices McKinsey recommends are now achievable at scale: • 𝗣𝗿𝗼𝗯𝗮𝗯𝗶𝗹𝗶𝘁𝘆-𝘄𝗲𝗶𝗴𝗵𝘁𝗲𝗱 𝘀𝗰𝗲𝗻𝗮𝗿𝗶𝗼𝘀 – AI can run hundreds of scenarios and assign P values automatically, not just the three you had time to build manually. • 𝗧𝗿𝘂𝗲 𝗺𝗼𝗺𝗲𝗻𝘁𝘂𝗺 𝗰𝗮𝘀𝗲𝘀 – AI separates baseline trends from management initiatives without the spreadsheet gymnastics. • 𝗕𝗲𝗮𝗿 𝗰𝗮𝘀𝗲 𝗺𝗼𝗱𝗲𝗹𝗶𝗻𝗴 – AI identifies downside risks and models them before you're blindsided. • 𝗖𝗼𝗻𝘀𝗶𝘀𝘁𝗲𝗻𝘁 𝗺𝗮𝗰𝗿𝗼 𝗮𝘀𝘀𝘂𝗺𝗽𝘁𝗶𝗼𝗻𝘀 – AI flags when one business unit uses different GDP assumptions than another. • 𝗗𝗶𝘀𝗮𝗴𝗴𝗿𝗲𝗴𝗮𝘁𝗲𝗱 𝗶𝗻𝗳𝗹𝗮𝘁𝗶𝗼𝗻 – AI tracks the specific components that actually affect your business, not just CPI averages. • 𝗖𝗼𝗻𝘁𝗶𝗻𝘂𝗼𝘂𝘀 𝗯𝗮𝗰𝗸 𝘁𝗲𝘀𝘁𝗶𝗻𝗴 – AI compares forecasts to actuals weekly and learns from variances automatically. 𝗧𝗵𝗲 𝗯𝗿𝘂𝘁𝗮𝗹 𝘁𝗿𝘂𝘁𝗵: Human bias has always been the weak link in forecasting. Optimism creeps in. Assumptions go unchallenged. P-values are applied inconsistently across business units. AI doesn't have a political agenda. It doesn't inflate projections to look good in front of the board. It just processes data. The result? Faster forecasts. More accurate projections. And decisions based on reality, not hope. 𝗠𝘆 𝗮𝗱𝘃𝗶𝗰𝗲? 𝟭. 𝗦𝘁𝗮𝗿𝘁 𝘄𝗶𝘁𝗵 𝗯𝗮𝗰𝗸 𝘁𝗲𝘀𝘁𝗶𝗻𝗴 Use AI to compare your forecasts over the last 12 months with actuals. Find where bias lives in your models. 𝟮. 𝗔𝘂𝘁𝗼𝗺𝗮𝘁𝗲 𝘀𝗰𝗲𝗻𝗮𝗿𝗶𝗼 𝗴𝗲𝗻𝗲𝗿𝗮𝘁𝗶𝗼𝗻 Stop building three scenarios manually. Let AI generate probability-weighted ranges based on actual data patterns. 𝟯. 𝗘𝗻𝗳𝗼𝗿𝗰𝗲 𝗮𝘀𝘀𝘂𝗺𝗽𝘁𝗶𝗼𝗻 𝗰𝗼𝗻𝘀𝗶𝘀𝘁𝗲𝗻𝗰𝘆 Use AI to flag when macro assumptions differ across business units. Inconsistency kills forecast accuracy. Because here's what separates finance teams that drive decisions from those that just report numbers: They use AI to remove bias and deliver forecasts that leadership can actually trust. 𝗦𝗼 𝗯𝗲 𝗵𝗼𝗻𝗲𝘀𝘁: Which of these six practices is your biggest gap right now? ---------- 🧑💼 I'm a partner at Business Partnering Institute 🤝 We help increase the influence of your finance team 🔔 To see more content, hit the bell on my profile 📘 Order our new book now: https://bit.ly/4h2P9AA 🧑🎓 Enroll in our LinkedIn course: https://bit.ly/4a5fB9l 📻 #FinanceMaster podcast: https://bit.ly/3NLSt73 📺 Follow us on YouTube: https://bit.ly/4bSBut6 📢 Join our WhatsApp channel: https://bit.ly/3WWGOrc 📄 Check out all our templates and cheat sheets here: https://lnkd.in/eC_zuCU4

  • View profile for Arvind Jain
    Arvind Jain Arvind Jain is an Influencer
    79,781 followers

    Really enjoyed the new InfoDeepSeek paper on agentic search. It puts data behind a critical question: Should we scale LLMs to brute-force better answers, or focus on making the inputs smarter? The paper makes a strong case for the latter. Even top-tier models struggle when fed irrelevant or noisy context. Gemini-2.5-Pro, for example, achieved just 22% answer accuracy on complex, open-ended queries. But when a weaker model like DeepSeek-V3 used Google instead of DuckDuckGo for retrieval, its performance nearly tripled—from 9% to 29%. At Glean, we’ve always believed that great enterprise search depends on context. That’s why we’ve prioritized making retrieval as relevant and precise as possible. Our customers count on Glean for answers they can trust. Smarter retrieval doesn’t just reduce latency and cost—it improves quality. And especially in high-stakes industries like healthcare, financial services, and cybersecurity, even a small accuracy gain can be the difference between confidence and risk. The takeaway: Scaling LLMs is expensive. Improving retrieval is scalable. If you’re building AI for real work, focus on giving your models better inputs, not just making them bigger. https://lnkd.in/g5meeQS7

  • View profile for Sarthak Rastogi

    AI engineer | Posts on agents + advanced RAG | Experienced in LLM research, ML engineering, Software Engineering

    26,666 followers

    5 steps that Amazon Finance took to improve their RAG pipeline's accuracy from 49% to 86% 📈 -- - They started by fixing document chunking problems. They saw that the original fixed-size chunks were causing inaccuracies because they didn’t capture complete context. By using the QUILL Editor, they turned unstructured text into HTML, and then identified logical structures based on HTML tags. Just chunking the docs differently raised the accuracy from 49% to 64%. 😦 - Next, prompt engineering. They aimed to: 1. stop hallucinations when there wasn’t relevant context, 2. support both concise and detailed answers, and 3. give citations. They also worked on implementing chain-of-thought reasoning to improve how the LLM structured its answers. This got the accuracy to 76%. - Finally they optimised their embedding models. They tested different first-party and third-party models and found that models like bge-base-en-v1.5 offered better performance on their dataset. Ultimately, they settled on Amazon Titan Embeddings G1. Better retrieval finally got them a better accuracy of 86%. Their targeted improvements in the RAG pipeline and they all added up. Link to the article from AWS: https://lnkd.in/gFDBfhJm #AI #LLMs #RAG

  • View profile for Aishwarya Srinivasan
    Aishwarya Srinivasan Aishwarya Srinivasan is an Influencer
    633,658 followers

    If you’re building LLM applications today, reasoning is where the real leverage lies. And yet, I see a lot of engineers still treating LLM outputs as a single-shot black box. LLMs can reason, but only if you give them the right scaffolding and the right post-training. Here’s a mental model I’ve been using to think about LLM reasoning methods (see chart below): ✅ Inference-time reasoning methods: These are techniques that can be applied at inference time, without needing to retrain your model: → Tree of Thoughts (ToT), search through reasoning paths → Chain of Thought (CoT) prompting, prompt models to generate intermediate reasoning steps → Reasoning + Acting, use tools or function calls during reasoning → Self-feedback, prompt the model to critique and refine its own output → Episodic Memory Agents, maintain a memory buffer to improve multi-step reasoning → Self-consistency, sample multiple reasoning paths and select the most consistent answer ✅ Training-time enhancements: Where things get really powerful is when you post-train your model to improve reasoning, using human annotation or policy optimization: → Use Preference pairs and Reward Models to tune for better reasoning (RFT, Proximal PO, KL Regularization) → Apply RLHF, PPO + KL, Rejection Sampling + SFT, Advantage Estimation, and other advanced techniques to guide the model’s policy → Leverage multiple paths, offline trajectories, and expert demonstrations to expose the model to rich reasoning signals during training Here are my 2 cents 🫰 If you want production-grade LLM reasoning, you’ll need both, → Smart inference-time scaffolds to boost reasoning without slowing latency too much → Carefully tuned post-training loops to align the model’s policy with high-quality reasoning patterns → We’re also seeing increasing use of Direct Preference Optimization (DPO) and reference-free grading to further improve reasoning quality and stability. I’m seeing more and more teams combine both strategies, and the gap between "vanilla prompting" and "optimized reasoning loops" is only getting wider. 〰️〰️〰️ Follow me (Aishwarya Srinivasan) for more AI insight and subscribe to my Substack to find more in-depth blogs and weekly updates in AI: https://lnkd.in/dpBNr6Jg

Explore categories