Testing deterministic code is straightforward: give it an input, check the output. Testing an autonomous agent is a different problem. When agents graded their own work, they were wrong nearly 1 in 5… | GitHub

Yehuda Levy 6h

Great insight! Trust begins where self-assessment ends and consequence can enter from outside the system.

Iliyan Germanov 1d

LLMs are inherently probabilistic engines; they naturally optimize for short-term statistical gains. If an AI can make "Feature X" work by taking a shortcut that violates your AGENTS.md architecture rules, it will. Without hard guardrails in your CI, that AI slop code will merge into main. Once it's there, it becomes the new baseline and the justification for future tech debt. If you're building in TypeScript, tools like deslop.dev are becoming the new norm. If it's not on the CI then it means nothing.

Like

Reply

2 Reactions

Antoine Sirianni 19h

You introduce your post by saying: "Testing deterministic code is straightforward". To me, this is rather overlooked. Test is a discipline of its own. And errors themselves can be non-deterministic. Engineers have been struggling with non-determinism for decades. Think of Richard Hamming for instance in the early ages of electronic computation. He introduced parity bits to mitigate the errors made by his computer. So maybe the point is not determinism vs non-determinism. Maybe the point is: "Is AI efficient enough to solve my problem?", efficient meaning fast, dependable, and affordable here. But wait a minute! It's only two years and a half since the birth of ChatGPT! The progress made since then is astounding and IMO, there is still a lot of room for improvement.

Like

Reply

Alexander Kell 1d

The "you can't be probabilistic and 100% accurate" point misses what the paper does. The agent guesses. The checker does not. The dominator tree asks one thing: did the run hit the steps required for success, yes or no. It's not a model. It does not guess. That's the idea. You stop using one guessing system to grade another. A yes and no check on a random process can be 100% on a fixed test set, because the check never guesses. Same reason "use a second model to verify" is weaker. A second model still guesses. Better to take the model out of the check.

Like

Reply

1 Reaction

Adriana Garcia 1d

"We don’t need black-box models to judge other black-box models." 👆 You just described human judgement. We are literally building agents in our image. How do we validate human output?

Like

Reply

2 Reactions

Theo Valmis 17h

Interesting inflection point: the industry is slowly realizing agent reliability is less about making models deterministic and more about building independent verification layers around nondeterministic systems. The trust boundary is moving outside the agent itself.

Like

Reply

Jaci Turner 1d

One of the hardest shifts in AI is realizing that generation and governance are different problems. Models can produce an answer. Determining whether that answer should be trusted, acted on, or escalated is a separate layer entirely. As systems become more autonomous, validation may become more important than generation.

Like

Reply

2 Reactions

Context Studios - AI Development Studio & Agency Berlin, graphic

Context Studios - AI Development Studio & Agency Berlin 1d

This is the right direction. The harder failure mode is reviewer correlation, where generator and grader share the same blind spot and wrong code passes twice. A useful metric to publish would be review-escape rate by workflow step.

Like

Reply

Robert Lazaravitch 1d

GitHub’s latest Trust Layer work confirms the exact problem Keystone was built to solve: AI and automation cannot be trusted by self-report. Keystone attacks the pipeline-proof side of that problem with deterministic, replayable, tamper-evident verification

Like

Reply

NovaIntel 1d

GitHub’s latest Trust Layer work confirms the exact problem Keystone was built to solve: AI and automation cannot be trusted by self-report. Keystone attacks the pipeline-proof side of that problem with deterministic, replayable, tamper-evident verification

Like

Reply

GitHub’s Post

More from this author

My magic moment using GitHub Copilot CLI

Your workflow doesn’t live in one place anymore

Coding still matters. But it’s no longer what separates skilled developers.

Explore content categories