Research shows adding fun facts like “cats sleep most of their lives” to an LLM input can multiply error rates 2-5x. So it turns out cat trivia can be an effective attack on reasoning LLMs. Benchmarks usually test the happy path... but there are so many weird adversarial/edge case scenarios that go unexplored 🙀 Which means the real challenge isn’t “Can it reason?” - it’s “Can it reason when the prompt gets weird?” Link to details below
Anna Sena
4d
Love this insight. Edge cases like these highlight why adversarial evaluation is so critical in AI.
LLMs losing accuracy over cat facts might be my favorite adversarial attack to date
https://promptfoo.dev/lm-security-db/vuln/cat-triggered-reasoning-error-7832f185 😺