The Experimentation Gap
Evaluating an agent is not the same problem as evaluating a model's output. One is a fixed input-output mapping; the other is behavior that emerges from interaction with an environment that responds. Most tooling solves the first problem — agents pose the second.