Improving Prediction Stability in Large Language Models

Explore top LinkedIn content from expert professionals.

Summary

Improving prediction stability in large language models means making sure these AI systems give more reliable and consistent answers, even when prompts or data change slightly. This involves reducing unexpected or random responses and preventing models from producing incorrect information, which is important for building trustworthy AI tools.

  • Test prompt variations: Try out different wording or formats for your prompts to spot which versions give the most stable and accurate results.
  • Use structured examples: Add sample questions and answers within prompts to help the model stay consistent, especially with complex tasks.
  • Monitor confidence and citations: Track the model’s confidence in its answers and check that it provides clear evidence or sources to reduce mistakes and boost reliability.
Summarized by AI based on LinkedIn member posts
  • View profile for Ross Dawson
    Ross Dawson Ross Dawson is an Influencer

    Futurist | Board advisor | Global keynote speaker | Founder: AHT Group - Informivity - Bondi Innovation | Humans + AI Leader | Bestselling author | Podcaster | LinkedIn Top Voice

    35,284 followers

    Small variations in prompts can lead to very different LLM responses. Research that measures LLM prompt sensitivity uncovers what matters, and the strategies to get the best outcomes. A new framework for prompt sensitivity, ProSA, shows that response robustness increases with factors including higher model confidence, few-shot examples, and larger model size. Some strategies you should consider given these findings: 💡 Understand Prompt Sensitivity and Test Variability: LLMs can produce different responses with minor rephrasings of the same prompt. Testing multiple prompt versions is essential, as even small wording adjustments can significantly impact the outcome. Organizations may benefit from creating a library of proven prompts, noting which styles perform best for different types of queries. 🧩 Integrate Few-Shot Examples for Consistency: Including few-shot examples (demonstrative samples within prompts) enhances the stability of responses, especially in larger models. For complex or high-priority tasks, adding a few-shot structure can reduce prompt sensitivity. Standardizing few-shot examples in key prompts across the organization helps ensure consistent output. 🧠 Match Prompt Style to Task Complexity: Different tasks benefit from different prompt strategies. Knowledge-based tasks like basic Q&A are generally less sensitive to prompt variations than complex, reasoning-heavy tasks, such as coding or creative requests. For these complex tasks, using structured, example-rich prompts can improve response reliability. 📈 Use Decoding Confidence as a Quality Check: High decoding confidence—the model’s level of certainty in its responses—indicates robustness against prompt variations. Organizations can track confidence scores to flag low-confidence responses and identify prompts that might need adjustment, enhancing the overall quality of outputs. 📜 Standardize Prompt Templates for Reliability: Simple, standardized templates reduce prompt sensitivity across users and tasks. For frequent or critical applications, well-designed, straightforward prompt templates minimize variability in responses. Organizations should consider a “best-practices” prompt set that can be shared across teams to ensure reliable outcomes. 🔄 Regularly Review and Optimize Prompts: As LLMs evolve, so may prompt performance. Routine prompt evaluations help organizations adapt to model changes and maintain high-quality, reliable responses over time. Regularly revisiting and refining key prompts ensures they stay aligned with the latest LLM behavior. Link to paper in comments.

  • View profile for Leon Chlon, PhD

    Oxford Visiting Fellow [Torr Vision Group] · Author, Information Geometry for GenAI · Built Strawberry (1.6k GitHub stars, 100+ enterprise clients) · Cambridge PhD · MIT | HMS Postdoc · Ex - Uber, Meta, McKinsey, TikTok

    41,121 followers

    Achieving Near-Zero Hallucination in AI: A Practical Approach to Trustworthy Language Models 🎯 Excited to share our latest work on making AI systems more reliable and factual! We've developed a framework that achieves 0% hallucination rate on our benchmark, a critical step toward trustworthy AI deployment. The Challenge: Large language models often generate plausible-sounding but incorrect information, making them risky for production use where accuracy matters. Our Solution: We trained models to: ✅ Provide evidence-grounded answers with explicit citations ✅ Express calibrated confidence levels (0-1 scale) ✅ Know when to say "I don't know" when evidence is insufficient Key Results: 📈 54% improvement in accuracy (80.5% exact match vs 52.3% baseline) 🎯 0% hallucination rate through calibrated refusal 🔍 82% citation correctness (models show their work) 🛡️ 24% refusal rate when evidence is lacking (better safe than sorry!) What Makes This Different: Instead of hiding uncertainty in fluent prose, we enforce structured JSON outputs that create accountability. When the model isn't sure, it explicitly refuses rather than making things up. Interesting Finding: Under noisy/cluttered contexts, the model maintains answer quality but sometimes cites the wrong sources, identifying the next challenge to solve! We've open-sourced everything: https://lnkd.in/ejUtBYJX 1,198 preference pairs for reproduction https://lnkd.in/ewvwDJ2G DeBERTa reward model (97.4% accuracy) Complete evaluation framework Technical report: https://lnkd.in/eEDVgfJb This work represents a practical step toward AI systems that are not just powerful, but genuinely trustworthy for real-world applications where factual accuracy is non-negotiable. What strategies is your team using to improve AI reliability? Would love to hear about different approaches to this critical challenge! #AI #MachineLearning #ResponsibleAI #NLP #TechInnovation #OpenSource

  • View profile for Vik Pant, PhD

    Applied AI and Quantum Information @ PwC, Synthetic Intelligence Forum, University of Toronto

    12,419 followers

    Thank you to the University of Toronto Machine Intelligence Student Team for inviting me to present a keynote on augmenting human-labeled datasets using Large Language Models (LLMs). Human-labeled data is crucial for testing, tuning, customizing, and validating LLMs in organizations. This is because human labeled data provides the ground truth for developing trustworthy #GenerativeAI applications and #AgenticAI systems. Yet acquiring sufficient human labeled data is often a bottleneck in many organizations. Subject matter experts and domain specialists typically have limited time for labeling tasks due to competing professional demands, making large-scale manual labeling difficult to sustain. My talk focused on how LLMs can be used not to substitute human labels, but to systematically augment them—extending the utility of existing human labeled data and improving model robustness without proportionally increasing manual labeling effort. I described practical methods for implementing two augmentation techniques with strong empirical grounding: • Negative Reinforcement with Counterfactual Examples – This technique involves analyzing labeled examples to generate counterfactual examples—outputs that are intentionally incorrect or undesirable—and using them to teach the model about what not to generate. By guiding the model using these negative samples, the model learns sharper decision boundaries, increasing robustness against hallucinations and confabulations. • Contrastive Learning with Controlled Perturbations – This technique creates diverse, label-preserving variants of human-labeled examples by introducing controlled modifications to the prompts and/or completions. These perturbations maintain core semantic meaning while varying surface-level features such as syntax, phrasing, or structure, encouraging the model to generalize beyond shallow lexical or syntactic cues. These techniques have been shown to drive measurable improvements in model behavior: • Lower Perplexity → More predictable completions and improved alignment with ground-truth targets. • Reduced Token Entropy → More focused and efficient completions, reducing inference complexity. • Higher Self-Consistency → More stable completions across repeated generations of the same prompt—a key requirement for dependable downstream use. These are not theoretical constructs—they are practical techniques for overcoming constraints in human-labeled data availability and scaling of #LLM applications with greater efficiency and rigor. Appreciate the University of Toronto Machine Intelligence Student Team (UTMIST) for a well-curated conference, and the UofT AI group for their initiatives in the space. Grateful to my research partner, Olga, for her contributions in collaboratively developing content for this presentation. Kudos to my PwC Canada teammates including Michelle B, Annie, Chris M, Michelle G, Chris D, Brenda, Bahar, Danielle, and Abhinav for their partnership on our PwC #AI portfolio.

    • +2
  • View profile for Cameron R. Wolfe, Ph.D.

    Research @ Netflix

    23,317 followers

    Mixture-of-Experts (MoE) LLMs are more prone to training instability than standard LLMs. Here’s why this is the case and how we can fix it… Where do instabilities come from? There are two main issues that occur when training an MoE: 1. Routing collapse: the model converges to using the same expert(s) over and over. 2. Numerical instability: the MoE experiences round-off errors, especially in the router. These issues lead to training instability, meaning that the model’s loss may simply diverge (i.e., go up instead of down) during the training process. Avoiding routing collapse: We need to add auxiliary losses to our training objective that encourage the model to use experts uniformly. The most common auxiliary loss for MoEs is the load balancing auxiliary loss [1], which is minimized when the MoE i) assigns probability uniformly to experts and ii) routes an equal number of tokens to each expert within a batch. Avoiding numerical instability: The biggest source of numerical instability occurs in the MoE’s router because the router includes an (exponential) softmax function. To avoid numerical instabilities in this layer, we can add an auxiliary loss that encourages the values going into the softmax function to not be too large–this is called the router z-loss [2]. Although many LLMs are trained in lower (bfloat16) precision, we should avoid using low precision within the router. Mixed / low precision training greatly improves training efficiency, but it can also make round-off errors more frequent within the router! Weight initialization: Traditionally, we made the training of large, deep neural networks more stable by discovering better weight initialization (e.g., He or Glorot init) and normalization (e.g., batch normalization) techniques. Similarly, we can improve MoE training stability by using a weight initialization strategy that’s more tailored to MoEs. As proposed in [1], we can sample from a truncated normal distribution with a mean of zero (µ = 0) and standard deviation given by σ = SQRT(s/n), where s (0.1 by default) is a scale hyperparameter and n is the size of the input to the layer being initialized. Putting everything together: I’ve tried out each of these techniques within nanoMoE, a simple and functional MoE pretraining implementation that I recently released. We can see that each of these tricks improves the MoE’s training stability. When we use them all together, nanoMoE is able to fully complete pretraining without having any instabilities!

  • View profile for Sachin Kumar

    Senior Data Scientist III at LexisNexis | Experienced Agentic AI and Generative AI Expert

    8,661 followers

    Methods for improving LLM Training Stability Training stability of large language models (LLMs) is an ongoing and important research topic. One of the sources of training instability is the growth of logits in attention layers. This paper propose training a small language model with a high learning rate to force the model to diverge early and simplify the analysis of training stability. 𝗠𝗲𝘁𝗵𝗼𝗱𝘀 𝘁𝗼 𝗶𝗺𝗽𝗿𝗼𝘃𝗲 𝗟𝗟𝗠 𝗧𝗿𝗮𝗶𝗻𝗶𝗻𝗴 𝘀𝘁𝗮𝗯𝗶𝗹𝗶𝘁𝘆 i) σReparam - method to reparameterize the weights of a linear layer - influences the magnitude of the linear layer weights and successfully prevents entropy collapse in the attention layers, promoting more stable training ii) SoftMax Temperature (soft temp) - helps control logits magnitude for improving model training stability with high learning rate iii) SoftMax Capping (soft cap) - can be interpreted as an adaptive method of softmax temperature control iv) SoftMax Clipping (soft clip) - whenever softmax values are clipped they will not give a gradient, preventing the outliers from growing v) LayerScale - adds learnable feature scaling after every residual block - It is a method of feature scaling: it does per channel multiplication of features with learnable parameters vi) QK Layer Normalization (QK norm) - layer normalization can be applied to the queries and keys before the dot-product attention computation vii) Combination of QK Layer Normalization with SoftMax Capping (QK norm cap) - QK layer normalization controls magnitudes of Q and K features before the dot product - Softmax capping, controls the softmax temperature which also can be helpful for reducing softmax sensitivity to large magnitude of input logits - So a combination of both can compliment each other and further improve model stability viii) Layer Normalization with QKV Layers (QKV norm) - Given that authors observe magnitude explosion in the output of QKV layer, they hypothesize that layer normalization after QKV layer should address the issue and there is no need to apply layer normalization before QKV layer ix) Layer Normalization after QK, PROJ AND FC2 LAYERS (QK FC norm) - authors observe that the magnitude of all linear layers of the diverging model is much higher in comparison to the magnitude of the converging model, particularly in layers: QKV, Proj and FC2 - Gemma2 has similar topology 𝗥𝗲𝘀𝘂𝗹𝘁𝘀 - Most stable baseline models are soft cap and QK norm which diverge at learning 60e-3 - Layer normalization after QK, Proj and FC2 layers in method QK FC norm does not improve model stability in comparison to QK norm and soft cap - two methods: QKV norm and QK norm cap, allow to increase learning rate by 1.5x (without model divergence) in comparison to a method based on QK layer normalization - significant perplexity improvements with QKV norm, QK norm cap, QK norm, QK FC norm models in comparison to the bf16 baseline model. 𝗣𝗮𝗽𝗲𝗿: https://lnkd.in/ex6jHU7v

Explore categories