It's risky to trust LLMs. I just had an exchange with OpenAI GPT-4o about probability theory. My conclusion: Finding bugs in programs that rely on LLM statements will be an absolute nightmare 😅 Let's dive in: 1. I asked it to define "Expected Value," which worked just fine 👍🏼 2. I then asked it to propose ways to maximize instead "the worst value for the 10% best outcomes". It worked, but it proposed that I should directly optimize for the 90th percentile instead because it is "smoother and less sensitive to outliers" 🤔🤨 3. That seemed odd, so I asked it to prove the statement. It gave a very hand-wavy answer and examples that made no sense. I pushed it 3-4 times. It remained convinced that optimizing the 90th percentile is "smoother and less sensitive" but failed repeatedly to give proof that made sense. 4. I switched over to o1-pro-mode and asked it: "I don't understand the conclusion that they have different sensitivities. Are you sure about that? In your examples, the 90th percentile behaves the same as "the worst outcome in the top 10%". Please carefully evaluate your conclusion and prove why it's true or false." ➡️ Voila! 💡 It figured out that it was wrong after a couple of minutes: ✅ "In a continuous distribution, the statement 'the worst (smallest) outcome in the top 10%' is the same number that defines the 90th percentile." 5. The grand finale: "Based on what you know about how LLMs work, why do you think GPT-4o was initially convinced their sensitivity is different?" Reply: ✅ Large language models (LLMs) like GPT-4 generate text by predicting the most likely continuation of a conversation based on patterns in their training data. They do not have an inherent, stable “concept map” in the way humans might. Here are some reasons an LLM might initially (and incorrectly) conclude that the 90th percentile is “less sensitive” than the “worst outcome in the top 10%,” even though they are the same boundary in most standard definitions: ➡️ In finance, for instance, “VaR at 90%” is one boundary, but “CVaR at 90%” is the average of the worst 10%. CVaR does have a different sensitivity to outliers than VaR. If the LLM sees many texts emphasizing that “the average of the worst 10%” (CVaR) can be more sensitive to outliers than “the cutoff for the worst 10%” (VaR), it might misapply that idea to the comparison of “90th percentile” vs. “worst value in top 10%.” ➡️ In a discrete or finite-sample scenario, there can be small differences in how you pick the “90th percentile” (e.g., via interpolation) versus how you identify “the worst in the top 10%” (maybe that set is just one or two data points). ➡️ People often informally say that “percentiles are robust” and “extreme values can drastically affect the ‘minimum/maximum of a subset.’” That is true if by “subset” we mean “subset average,” or if we are focusing on a single extreme observation that drags the boundary.
-
+3