A coding eval has a test suite. A math eval has an answer key. A data eval has neither.
That's the part of building AI for data work that gets dramatically underestimated.
Ask an agent "what was our revenue last quarter?" and there are five plausibly correct answers, depending on which column you trust, whether you count refunds, and how you handle pricing changes from two years ago.
Pick the wrong one and you get a confident, beautiful, completely wrong number. It looks identical to the right one.
Izzy Miller from Hex just published a great piece on evaluating data agents that articulates this better than anything I've read. They built a fake business called Shorelane Commerce to evaluate against and the details are painfully familiar to anyone who's worked in a real warehouse.
In their setup: a sales channel got renamed in 2022 and never backfilled. Every customer has at least two IDs, sometimes four. There's a legacy Shopify that's mostly a red herring but one team still uses it for one report. Subscription plans got restructured in 2023 with enough customers grandfathered that three pricing worlds are still in circulation.
Their argument: public benchmarks run on demo-shaped data clean schemas, one source of truth. That is not the world your agent actually has to work in.
You cannot test an agent on clean data and learn anything useful about how it behaves on real data.
The other thing from the piece that stuck with me: at this layer, you're not really evaluating "the agent." You're evaluating the entire context flywheel around it, the workspace, the docs, the semantic layer, the memory between turns.
Change one and the same model behaves like a completely different agent.
The model matters less than you think. The prompt matters less than you think. The context around it matters way more than you think.