One of the biggest constraints on the value of LLMs is that they are equally confident irrespective of underlying uncertainty. A new model, Entropix, proposes using different strategies for selecting the next token depending on the nature of the model's uncertainty. A great piece by Thariq Shihipar lays out the logic. The starting point is distinguishing between entropy and varentropy. Entropy measures how concentrated or diffuse the options for next token. Low entropy means the model has a very high probability next token, high entropy suggests there are a number of similar value possible next tokens. Varentropy assesses how different the probabilities are, either consistent (low) or varied (high). Each of the four combinations of these possibilities yields different strategies for improving next token selection: ⬇️⬇️Low Entropy, Low Varentropy: Model is very confident → Choose the highest probability option ⬇️⬆️Low Entropy, High Varentropy: Few strong competing options → Consider branching to explore different paths ⬆️⬇️High Entropy, Low Varentropy: Model is uncertain → Use "thinking tokens" to prompt more consideration ⬆️⬆️High Entropy, High Varentropy: Many scattered options → Use random selection or branching These are still early days in being able to assess model uncertainty and adjust to improve outputs validity (including reducing hallucinations). However progress in this will greatly improve the value of LLMs. Another critical aspect of this research is in Humans + AI work. Humans have to make their own assessments of LLM outputs of highly varying quality. Decision quality could improve massively If LLMS could offer valid confidence assessments for input into complex human-first decisions.
Uncertainty Metrics for Large Language Models
Explore top LinkedIn content from expert professionals.
Summary
Uncertainty metrics for large language models are tools and methods used to measure how confident an AI model is in its responses, helping users determine when the model’s output might be unreliable or prone to errors like hallucinations. These metrics play a crucial role in making AI systems safer and more trustworthy by quantifying and communicating the model's confidence and the types of uncertainty present.
- Ask for evidence: Encourage your AI system to cite sources or explain its reasoning whenever it provides an answer, so you can better judge its trustworthiness.
- Check response consistency: Phrase the same question in several different ways to see if the model gives stable answers, which can reveal underlying uncertainty.
- Use specialized metrics: Select uncertainty measurement methods tailored to your specific task, such as calibrated confidence scores or out-of-distribution detection, for more reliable results.
-
-
To Believe or Not to Believe Your LLM We explore uncertainty quantification in large language models (LLMs), with the goal to identify when uncertainty in responses given a query is large. We simultaneously consider both epistemic and aleatoric uncertainties, where the former comes from the lack of knowledge about the ground truth (such as about facts or the language), and the latter comes from irreducible randomness (such as multiple possible answers). In particular, we derive an information-theoretic metric that allows to reliably detect when only epistemic uncertainty is large, in which case the output of the model is unreliable. This condition can be computed based solely on the output of the model obtained simply by some special iterative prompting based on the previous responses. Such quantification, for instance, allows to detect hallucinations (cases when epistemic uncertainty is high) in both single- and multi-answer responses. This is in contrast to many standard uncertainty quantification strategies (such as thresholding the log-likelihood of a response) where hallucinations in the multi-answer case cannot be detected. We conduct a series of experiments which demonstrate the advantage of our formulation. Further, our investigations shed some light on how the probabilities assigned to a given output by an LLM can be amplified by iterative prompting, which might be of independent interest.
-
I’m jealous of AI Because with a model you can measure confidence Imagine you could do that as a human? Measure how close or far off you are? here's how to measure for technical and non-technical teams For business teams: Run a ‘known answers’ test. Give the model questions or tasks where you already know the answer. Think of it like a QA test for logic. If it can't pass here, it's not ready to run wild in your stack. Ask for confidence directly. Prompt it: “How sure are you about that answer on a scale of 1-10?” Then: “Why might this be wrong?” You'll surface uncertainty the model won't reveal unless asked. Check consistency. Phrase the same request five different ways. Is it giving stable answers? If not, revisit the product strategy for the llm Force reasoning. Use prompts like “Show step-by-step how you got this result.” This lets you audit the logic, not just the output. Great for strategy, legal, and product decisions. For technical teams: Use the softmax output to get predicted probabilities. Example: Model says “fraud” with 92% probability. Use entropy to spot uncertainty. High entropy = low confidence. (Shannon entropy: −∑p log p) Language models Extract token-level log-likelihoods from the model if you have API or model access. These give you the probability of each word generated. Use sequence likelihood to rank alternate responses. Common in RAG and search-ranking setups. For uncertainty estimates, try: Monte Carlo Dropout: Run the same input multiple times with dropout on. Compare outputs. High variance = low confidence. Ensemble models: Aggregate predictions from several models to smooth confidence. Calibration testing: Use a reliability diagram to check if predicted probabilities match actual outcomes. Use Expected Calibration Error (ECE) as a metric. Good models should show that 80% confident = ~80% correct. How to improve confidence (and make it trustworthy) Label smoothing during training Prevents overconfident predictions and improves generalization. Temperature tuning (post-hoc) Adjusts the softmax sharpness to better align confidence and accuracy. Temperature < 1 → sharper, more confident Temperature > 1 → more cautious, less spiky predictions Fine-tuning on domain-specific data Shrinks uncertainty and reduces hedging in model output. Especially effective for LLMs that need to be assertive in narrow domains (legal, medicine, strategy). Use focal loss for noisy or imbalanced datasets. It down-weights easy examples and forces the model to pay attention to harder cases, which tightens confidence on the edge cases. Reinforcement learning from human feedback (RLHF) Aligns the model's reward with correct and confident reasoning. Bottom line: A confident model isn't just better - it's safer, cheaper, and easier to debug. If you’re building workflows or products that rely on AI, but you’re not measuring model confidence, you’re guessing. #AI #ML #LLM #MachineLearning #AIConfidence #RLHF #ModelCalibration
-
Achieving Near-Zero Hallucination in AI: A Practical Approach to Trustworthy Language Models 🎯 Excited to share our latest work on making AI systems more reliable and factual! We've developed a framework that achieves 0% hallucination rate on our benchmark, a critical step toward trustworthy AI deployment. The Challenge: Large language models often generate plausible-sounding but incorrect information, making them risky for production use where accuracy matters. Our Solution: We trained models to: ✅ Provide evidence-grounded answers with explicit citations ✅ Express calibrated confidence levels (0-1 scale) ✅ Know when to say "I don't know" when evidence is insufficient Key Results: 📈 54% improvement in accuracy (80.5% exact match vs 52.3% baseline) 🎯 0% hallucination rate through calibrated refusal 🔍 82% citation correctness (models show their work) 🛡️ 24% refusal rate when evidence is lacking (better safe than sorry!) What Makes This Different: Instead of hiding uncertainty in fluent prose, we enforce structured JSON outputs that create accountability. When the model isn't sure, it explicitly refuses rather than making things up. Interesting Finding: Under noisy/cluttered contexts, the model maintains answer quality but sometimes cites the wrong sources, identifying the next challenge to solve! We've open-sourced everything: https://lnkd.in/ejUtBYJX 1,198 preference pairs for reproduction https://lnkd.in/ewvwDJ2G DeBERTa reward model (97.4% accuracy) Complete evaluation framework Technical report: https://lnkd.in/eEDVgfJb This work represents a practical step toward AI systems that are not just powerful, but genuinely trustworthy for real-world applications where factual accuracy is non-negotiable. What strategies is your team using to improve AI reliability? Would love to hear about different approaches to this critical challenge! #AI #MachineLearning #ResponsibleAI #NLP #TechInnovation #OpenSource
-
Proud to announce our NeurIPS spotlight, which was in the works for over a year now :) We dig into why decomposing aleatoric and epistemic uncertainty is hard, and what this means for the future of uncertainty quantification. 📖 https://lnkd.in/eMfBaHxS The easy solution would of course be to use a decomposition formula like "predictive uncertainty = aleatoric uncertainty + epistemic uncertainty". However, we find that this does not work. In practice, the two components are literally the same, see the scatterplot. And this happens for any second-order distribution we've implemented. From evidential deep learning to Laplace approximations to deep ensembles, the rank correlations between the two components are always within 0.8 and 0.999 (!) So how could we estimate epistemic and aleatoric uncertainty then? For epistemic, we find that an uncertainty estimator trained explicitly on OOD data is the best in OOD detection. So maybe we should go away from general uncertainty methods, towards specialized ones. For aleatoric uncertainty, no clear winner exists yet. This seems to be a more challenging task, and also to depend on how exactly the aleatoric uncertainty ground-truth is collected on each dataset. We also test multiple predictive uncertainty metrics. AUROC, AUAC, rAULC, ECE. The best method always depends on what exactly the metric tests, and thus, what _kind_ of predictive uncertainty you need a solution for in your practical task. So what did we learn in the past year of research? We need to rethink the aleatoric/epistemic dichotomy. Uncertainty estimation is a very rich and nuanced field with a whole spectrum of types of uncertainties. If you can provide a precise description of which task you want to solve with an uncertainty estimator, then you can build an uncertainty that is specialized towards exactly that. So we urge to stop thinking in fixed categories, and to start thinking in specialized tasks. This paper comprises 190 figures and tests a whopping 19 different uncertainty estimators across 13 tasks. And despite the amount of experiments (or maybe because of it), we made the repo super easy to extend, with optimized implementations of all uncertainty methods. So please feel free to use our implementations for your research https://lnkd.in/etGUJ36x We're also open-sourcing all WandB logs and metrics. This wouldn't have been possible without the relentless and detailed work of Bálint Mucsányi and Seong Joon Oh !
-
You're in a Senior ML Interview at OpenAI. The interviewer points to a Transformer diagram and sets the trap: "How do we use the Attention mechanism's weights to measure the model's uncertainty?" 90% of candidates walk right into the trap. "Easy. The Attention scores pass through a Softmax. They sum to 1.0. Therefore, they represent a probability distribution. If the attention is peaked on one token, the model is confident. If the distribution is flat (high entropy), the model is uncertain." This answer sounds intuitive. It is also mathematically invalid. They just confused a 𝐌𝐢𝐱𝐢𝐧𝐠 𝐖𝐞𝐢𝐠𝐡𝐭 with a 𝐑𝐚𝐧𝐝𝐨𝐦 𝐕𝐚𝐫𝐢𝐚𝐛𝐥𝐞. In a standard Transformer (during inference), the attention mechanism is 100% deterministic. - You do not sample from it. - You do not roll dice. - You strictly calculate a weighted average of Value vectors. Because the process is deterministic, the "entropy" of the attention weights is a measure of information dispersal, not probabilistic confidence. A model can have extremely "peaked" attention (looking at one specific token) and still be completely "hallucinating" or wrong about the output. Relying on this for safety-critical uncertainty estimation is a recipe for silent failure. ----- 𝐓𝐡𝐞 𝐒𝐨𝐥𝐮𝐭𝐢𝐨𝐧: To pass the interview, you need to identify 𝐓𝐡𝐞 𝐒𝐭𝐨𝐜𝐡𝐚𝐬𝐭𝐢𝐜 𝐆𝐚𝐩. Real uncertainty requires a source of randomness (stochasticity) to measure the variance of outcomes. Since standard attention is fixed, you must introduce external noise to measure confidence. You have two production-grade options: - 𝐌𝐨𝐧𝐭𝐞 𝐂𝐚𝐫𝐥𝐨 𝐃𝐫𝐨𝐩𝐨𝐮𝐭: Keep Dropout turned on during inference. Run the forward pass 10 times. Measure the variance in the attention outputs. That variance is your uncertainty. - 𝐋𝐚𝐭𝐞𝐧𝐭 𝐕𝐚𝐫𝐢𝐚𝐛𝐥𝐞 𝐌𝐨𝐝𝐞𝐥𝐬 (𝐕𝐀𝐄𝐬): Introduce a true latent variable z (sampled from a Gaussian prior). The variance of the posterior q(z|x) gives you the actual epistemic uncertainty. 𝐓𝐡𝐞 𝐀𝐧𝐬𝐰𝐞𝐫 𝐓𝐡𝐚𝐭 𝐆𝐞𝐭𝐬 𝐘𝐨𝐮 𝐇𝐢𝐫𝐞𝐝 "Attention weights sum to 1, but they are deterministic mixing coefficients, not probabilities. To measure uncertainty, I would not look at the weights themselves. I would measure the variance of the weights across multiple stochastic forward passes (MC Dropout) or use a VAE architecture." #NLP #Transformers #AttentionMechanism #UncertaintyEstimation #DeepLearning #MachineLearning #LLM