Testing deterministic code is straightforward: give it an input, check the output. Testing an autonomous agent is a different problem. When agents graded their own work, they were wrong nearly 1 in 5 times and missed 40% of actual failures. 👀 Here's how we built an independent "Trust Layer" that can hit 100% accuracy, without brittle scripts or black-box judgments. 🔍 https://lnkd.in/dKqAhp7P
LLMs are inherently probabilistic engines; they naturally optimize for short-term statistical gains. If an AI can make "Feature X" work by taking a shortcut that violates your AGENTS.md architecture rules, it will. Without hard guardrails in your CI, that AI slop code will merge into main. Once it's there, it becomes the new baseline and the justification for future tech debt. If you're building in TypeScript, tools like deslop.dev are becoming the new norm. If it's not on the CI then it means nothing.
You introduce your post by saying: "Testing deterministic code is straightforward". To me, this is rather overlooked. Test is a discipline of its own. And errors themselves can be non-deterministic. Engineers have been struggling with non-determinism for decades. Think of Richard Hamming for instance in the early ages of electronic computation. He introduced parity bits to mitigate the errors made by his computer. So maybe the point is not determinism vs non-determinism. Maybe the point is: "Is AI efficient enough to solve my problem?", efficient meaning fast, dependable, and affordable here. But wait a minute! It's only two years and a half since the birth of ChatGPT! The progress made since then is astounding and IMO, there is still a lot of room for improvement.
The "you can't be probabilistic and 100% accurate" point misses what the paper does. The agent guesses. The checker does not. The dominator tree asks one thing: did the run hit the steps required for success, yes or no. It's not a model. It does not guess. That's the idea. You stop using one guessing system to grade another. A yes and no check on a random process can be 100% on a fixed test set, because the check never guesses. Same reason "use a second model to verify" is weaker. A second model still guesses. Better to take the model out of the check.
"We don’t need black-box models to judge other black-box models." 👆 You just described human judgement. We are literally building agents in our image. How do we validate human output?
Interesting inflection point: the industry is slowly realizing agent reliability is less about making models deterministic and more about building independent verification layers around nondeterministic systems. The trust boundary is moving outside the agent itself.
One of the hardest shifts in AI is realizing that generation and governance are different problems. Models can produce an answer. Determining whether that answer should be trusted, acted on, or escalated is a separate layer entirely. As systems become more autonomous, validation may become more important than generation.
This is the right direction. The harder failure mode is reviewer correlation, where generator and grader share the same blind spot and wrong code passes twice. A useful metric to publish would be review-escape rate by workflow step.
GitHub’s latest Trust Layer work confirms the exact problem Keystone was built to solve: AI and automation cannot be trusted by self-report. Keystone attacks the pipeline-proof side of that problem with deterministic, replayable, tamper-evident verification
GitHub’s latest Trust Layer work confirms the exact problem Keystone was built to solve: AI and automation cannot be trusted by self-report. Keystone attacks the pipeline-proof side of that problem with deterministic, replayable, tamper-evident verification
Great insight! Trust begins where self-assessment ends and consequence can enter from outside the system.