Single LLM outputs vary run to run. A swarm of unsupervised agents compounds that variance. A gated process with clean context, artifact handoff, audit trails, coding principles, tests, multiple reviews, and constraints is not.
Indeterministic outputs are real. Reliable engineering pipelines are real too.
This debate breaks down how to get trustworthy software from unreliable generators, and why disciplined engineering beats vibe coding frustration.
Don’t complain about indeterminism if your agent setup isn’t enforcing rigor, verification, and best practices.
Stop blaming the model. Fix the process. https://lnkd.in/gy3Kq-wp
Welcome to the debate. Today we are tearing apart this idea of vibe coding. And if you haven't heard this term, well, count yourself lucky. It is. I mean, it's the polite way of saying you smash your forehead on a keyboard, let an autocomplete bot guess what you want, and then you pray. You pray the spaghetti code actually compiles. It is absolute madness and frankly, it's an insult to engineering as a discipline. Welcome everyone. And you know, I think. Need to start by maybe lowering the temperature and defining our terms a little more charitably. The central question today isn't really about smashing keyboards. It's a serious architectural question. Can nondeterministic large language models, can they be trusted to build production grade safety critical software? No, the answer is no. It's that simple. You cannot build a solid foundation on quicksand. Engineering is about predictability. It's about 2 + 2. Equals 4 every single time if the input is random. If it's still caustic, the output is garbage. I am losing my mind listening to people pretend this is OK. And that is the common fear. Yes, my position is different. I'd argue that while the model itself is stochastic, it has that randomness. The process that you wrap around it can be engineered to be deterministic, where I think confusing the tool with the workflow. Or spare me the buzzwords. Process Workflow. Let me tell you a story. I have a perfect example from last week of why this whole thing is a joke. I was trying to get a a so-called coding agent to fix a simple bug, a null check. That's it. Nothing fancy. OK, walk me through it. What happened? So earlier in the chat session, maybe 20 minutes before, I was joking around with the bot. Totally unrelated stuff. I was just, you know, testing its context window, making dumb jokes about Swedish meatballs. Just standard stress testing. Right, standard stress testing. Exactly. O later I get serious. I paste in the function and say fix the null pointer exception. And it did. It worked fine. But then I asked it to write the test suite to verify the fix and it hallucinated Swedish Chef style comments into my production code. I'm not kidding. The unit test assertions had comments that said bork bork bork. I smashed my keyboard. I literally broke. And mouse throwing it at the wall. OK, wow, let's let's just pause there. I mean, that's a visceral reaction because it's humiliating. I have bork bork bork in my git repository history. Now this proves these things are undisciplined toys. You can't trust them. Just fix the bug. Darn it. How can I put that in a safety critical system? OK, take it down a notch. I do understand the frustration, but let's be precise about what happened there from a technical perspective. That wasn't a failure of the models coding ability, that was a failure of context management. You polluted the context window with the meatball jokes and the model. You know, just the pattern matcher. It let that drift into the output. It's a tool that prioritizes my bad jokes over my technical specifications. That is a broken tool. Well, let me offer you a different way to think about it. Humans are also indeterministic. We get tired, we get emotional, we make typos, we have bad days. And yet we build rockets, aircraft and Mars Rovers. Humans have degrees. Humans have accountability. If a human engineer writes bork bork bork in a commit, I can fire them. I can't fire a probability distribution. And yet we don't trust the human alone, do we? We trust the process. We use rigorous constraints, peer review, CI, CD, pipelines, standards. We wrap the fallible human in a safety critical process. My. Argument is that we have to wrap Ellms in those exact same constraints. You can't wrap a chaos engine in a process and call it safe. A human might be tired, but a human understands intent and LM is just predicting the next word. Actually, you can wrap it. And the source material we're looking at today, the Quest methodology, it demonstrates exactly how it moves away from the paradigm use used a single long chat window to an orchestration of specialized agents orchestration. Great, more complexity. So now I have to manage a team of robots. It's about actualization and isolation. You define roles in the material. There's a file roles dot MD. It defines what each agent is responsible for. You don't just have the AI, you have a planner, you have a reviewer, an arbiter, and a builder. It just sounds like AI talking to AI. It's turtles all the way down. If one hallucinates, they all hallucinate. No, because of state management. That's the key you're missing in your Swedish meatball disaster. You were in a stateful session. The AI remembered your jokes. In the quest workflow. When the builder agent starts its job, it does not see your chat history. It never sees the jokes. So what does it see then? It sees a sanitized technical plan that was approved by the arbiter. It starts with a blank slate, a clean context. So you're just scrubbing the memory. Exactly. It prevents that bleed over you. Experienced. The context is hygienic. The builder agent has no idea you like meatballs. It only knows the spec. OK look, I can see the theoretical value of a clean context, but I looked at the stats from this quest session. You're talking about the generated 2600 lines of logs and artifacts to produce 210 lines of actual code. That is a 12 to one ratio of bureaucracy to work. It's just bloat. That's not bloat. That is the audit trail. Think about high assurance software. Avionics, medical devices. The documentation of the decision is as important as the code. That 12 to one ratio is exactly what makes it engineered rather than just generated. It sounds like I'm paying for a lot of robot meetings. That could have been an e-mail. I just want the function fixed, not a transcript of four AI's debating its philosophy. You want the function fixed correctly and without side effects, and that transcript is what prevents the bork bork bork scenario. But let's look at the evidence. There was a specific case study in the material building a PII stripper. API stripper? You mean personal identifiable information? Yes, a system to redact names, phone numbers, that kind of thing. From resumes. Are you kidding me? You trusted an AI to handle privacy data to strip names from resumes? That's a lawsuit just waiting to happen. Which is precisely why the process matters so much. They were building this feature on an iPhone using this multi agent system. Ohh doing it on an iPhone is just showing off. Look at me, I'm coding at the beach. It's a gimmick. Real work happens on a workstation with three monitors. We'll get to the iPhone constraint. It's actually important, but focus on the code logic for a second. The system used a dual review safeguard 2 independent reviewers. Reviewer A was Claude, reviewer B was codex. So they both just rubber stamped it. Looks good to me, robot brother. No, and that's the fascinating part Reviewer A, Claude marked the code as pass. It praised the defensive programming saw try except block. And said great job. OK so Claude is the yes man but reviewer B codex marked it needs fix. Why? What did it find that Claude missed? The builder had implemented a fall back. If the PII stripping function crash for some reason, the code would just return the original resonate text. That's well that's standard defensive programming. You don't want the app to crash. You return the original input so the app keeps running. But think about it. If the function is supposed to hide private data and it fails. When you return the original resume, what have you just done? You've leaked the private data. Exactly. You failed open. Reviewer B caught that. It argued that for a privacy feature, you have to fail closed. If stripping fails, you return an error. You skip the candidate. You cannot send unstripped data downstream. OK, I'll admit that that is subtle. I know senior engineers who would miss that in a code review. They just see the try except block and think good. It won't crash reviewer. They missed it. A single human could have missed it, but the arbiter agent saw the conflict. It saw Cloud said pass and Codex said fail. It synthesized the 2 views and forced to fail closed implementation. So the robot manager actually managed something is forced the builder to change the logic. That is a safety critical catch that came purely from the process. A single vibe coding prompt would have definitely missed that. I still hate that I have to treat a computer like a junior engineer who needs a babysitter. But how did the system even know what words to strip? PII is tricky. Well, this brings us to the refinement phase. It wasn't just about logic, it was requirements gathering. They needed to strip names, but not common words that look like names. Ohh right. Like Will or May or Hunter. Yeah, exactly. The initial plan just set exclude 20 to 30 common words. Very vague. The Arbiter gave a starter list, but then reviewer B again the critical one. Found specific apps like what if flagged the word rose as in rose to the position of VP. If you strip rose thinking it's a name, you mangle the resume. That's a good catch. It also flagged grant as in received a research grant and page page load time. Yeah, if you strip page your metrics look weird. OK, precisely. A single vibe check prompt wouldn't catch those nuances. The layered process did. The system is self correcting. I am still stuck on the fact that this guy did this on a phone. I'm sorry I can't get past it. You expect me to believe you can build reliable software on a six inch screen? It feels like a stunt. The screen size isn't the real bottleneck. The material makes it clear the infrastructure is the bottleneck and the phone constraint actually forced a better practice. Typing on glass sucks. That's a bottleneck. It's not the typing, it's the sandbox. The source document talks about the sandbox setup problem when you spin up a cloud coding. Environment. It often starts completely blank. Blank. Like a fresh install? Completely blank. No I no new broken packages. The source mentions they tried to run �� test and it wasn't even installed. Are you kidding me? I'd lose my mind waiting for pip install every time I open my phone. I would throw the phone into the ocean. That is pure friction. Slow down. That's exactly the frustration they hit. But because that friction was so high. The engineered a solution, they couldn't just fudge it. What solution? You can't just magic away dependency hell. They use the session start hook and a clawd MD file. A what MD Clawd? It's a context file that loads automatically and the hook is a shell script that runs on boot. It upgrades, set tools, fixes broken packages, installs all the requirements, sources the secrets. So it automates the shaving the yak part of coding. Exactly. Turns a 10 minute struggle into a 32nd bootstrap. This enforces item potency. Every session starts exactly the same way. It's actually more deterministic than your local laptop. Hey, don't attack my laptop. My laptop is a finely tuned chaos machine, thank you very much. And that's the point. Your laptop has global packages from three years ago. It has weird environment variables. It works on your machine but breaks in prod by forcing the system to bootstrap. Of 0 every time. The phone constraint actually forced better discipline. I guess if it's scriptable it's repeatable, but it still feels like a ton of work just to write code on the bus. Is it a lot of work if it results in passing 636 tests with 0 regressions? 00 And remember, the original prompt was just the tech spec. The agents built the plan, the tests, and the code and they caught the Pi leak fine. OK, I will concede. One thing, just one, I'm listening. If reviewer B hadn't caught that fail open bug that that vibe coder would have leaked private data, that would have been a disaster. The fact that the system caught it, OK, that's impressive. It's the redundancy, it's the checks and balances. And I guess the clean context thing makes sense. If the agent doesn't know about my Swedish meatballs, it can't write bork bork bork in the comments. So it solves that problem exactly. The pipeline sanitizes the mental state of the AI. But I still hate it. I hate that we're celebrating that we managed to get a computer to not mess U. It feels like we're lowering the bar. We used to celebrate brilliant algorithms. Now we celebrate, hey, the robot didn't leak my Social Security number. I see it differently. The goal is dependable software. We aren't lowering the bar, we're raising the floor. We're using these models to amplify our capabilities, but wrapping them in enough process that we can sleep at night. It's not about replacing the engineer. It's about giving the engineer a staff of 20 junior developers who verify each other's work. I suppose as long as I don't have to verify 2600 lines of logs myself, That's what the arbiters for. You verify the output, the arbiter verifies the process. Trust the arbiter. Sounds like a sci-fi dystopia. It sounds like engineering. The model is nondeterministic, but the workflow is deterministic. That's the synthesis. You get the speed of the vibe with the safety of the process. Look, I'm still annoyed, I still want to smash my keyboard when the bot hallucinates, but if you force me to use this quest set, maybe I won't have to buy a new mouse every week. That's progress. Chill out, trust the tests, and let the process handle the vibes. Yeah. Yeah. Bork, bork, bork. See you next time.
SWE-bench measures real engineering tasks. Bug fixes. Feature additions. Production code.
Baseline: 18.3%
With Layer 2 verification: 30.3%
This isn't academic. SWE-bench tests the work your team does daily.
RLVF v2 research proved foundation companies can't train verification in. Hallucination is mathematically permanent in generative models.
Layer 2 sits below the model layer. Verifies output before execution. Architecture-agnostic.
HumanEval: 100% pass@5
SWE-bench: 30.3% best
ARC-AGI: 4th place globally
Benchmark report: https://lnkd.in/gbw3u4hK
GitHub Action: gtsbahamas/hallucination-reversing-system/github-action@main
Try it in your CI/CD: npx tryassay assess .
Correctness. Confidence. Coverage. Control.
Passing tests do not always mean your system is truly verified. They only show that the code behaves as expected under the scenarios you imagined.
In this episode, we explore mutation testing — a technique designed to challenge your test suite by intentionally introducing small changes (mutations) into the code.
If your tests fail, they are doing their job.
If they still pass, you have discovered a hidden weakness.
Mutation testing is not about increasing coverage numbers. It is about measuring the effectiveness of your tests. It reveals blind spots, strengthens validation logic, and pushes teams beyond surface-level verification.
How robust are your assertions?
Would your tests detect subtle logic errors?
Are you testing behavior — or simply executing lines of code?
Strong engineering is not just about writing code that works. It is about building systems that remain correct under change, pressure, and evolution.
If you care about test quality, not just test quantity, this episode is for you.
Every time I hear “agents changed our team,” the real change isn’t the model but the process around it.
I read an Anthropic write-up where they let a bunch of Claude agents run for two weeks and build a C compiler. The part that made me pause wasn’t the headline or the line count. It was what they had to put around the agents to keep the work from drifting: solid tests, signals that showed when something broke, and a setup where it was hard to accidentally “move forward” while making the system worse.
That hit me because it matches what I’ve seen in real production work too. The time sink is rarely writing the code. It’s getting the problem defined, catching the weird cases, making sure the change doesn’t ripple into three other places, and rolling it out in a way that doesn’t turn into a 2am fire drill.
So when people say agents will replace developers, I think they’re skipping the real constraint. Agents can generate a lot, fast. But if you don’t have a strong way to tell “this is correct” and “this is safe,” you just end up shipping confusion faster.
I think the teams who get real leverage from agents will be the ones who treat the output like production code from day one: it has to be testable, explainable, and easy to undo when it’s wrong.
Curious what other people are seeing. In an agent-heavy world, what’s the one part of engineering you think matters more than ever?
If you’re running LLM agents in production, read this.
A detailed breakdown by exe[dot]dev analysed 250 real coding-agent conversations and surfaced a structural cost issue: agent conversations scale quadratically.
Every new turn requires the model to re-read the entire conversation history.
That creates a triangular accumulation of cache reads.
Some numbers from the analysis:
• At ~27,500 tokens of history, cache reads equal all other API costs combined.
• By the end of a typical coding session, 87% of the total bill comes from cache reads.
• The average feature-level session cost $12.93.
Doubling conversation length roughly quadruples it.
Tripling it pushes costs toward nine times.
For short interactions, this barely shows up.
For long-running coding sessions, it becomes the dominant cost driver.
Most teams are optimising prompts and model selection.
Very few are redesigning conversation structure.
If you’re scaling agents across engineering, cost behaviour becomes an architectural decision.
Full research link in comments.
Today, I learned about an algorithm from the 90s that is addressing a very contemporary problem for me: DDMIN.
When using coding agents to generate tests, the test suite can expand rapidly. Within a few months, you might find yourself managing thousands of tests, leading to issues like test pollution. This includes tests interfering with one another due to hidden shared states, order-dependent failures, and unreliable red builds.
The instinctive response is often to introduce more fixtures, cleanup processes, and isolation measures. However, this approach only treats the symptoms. The critical question is: which test is the polluter?
As your test suite grows, identifying the source of the problem manually becomes a combinatorial nightmare, with too many interactions to analyze—both for you and an LLM.
DDMIN (Delta Debugging Minimization) provides a mechanical solution. It reduces a failing test set to the smallest subset that still reproduces the failure, narrowing down thousands of tests to just “these two.” Introduced by Zeller in 1999, it has been a standard in compiler fuzzing for decades, yet it remains under-discussed in everyday software engineering. With the rise of coding agents expanding our test surfaces, its relevance is greater now than ever.
I am integrating DDMIN into my CI pipeline so that when a suite fails, the minimization process runs automatically before any manual investigation begins. This approach saves hours of debugging time.
This tool is worth considering if your test suites are growing faster than your debugging capabilities.
Can deterministic production systems be built on probabilistic models?
Over the past several months, I’ve been experimenting with structured LLM workflows in systems that require strict validation and multi-step execution.
The issue I keep encountering isn’t randomness.
It’s structural:
• Schema violations that look “almost correct”
• Subtle API mismatches across steps
• Context drift between planning and execution
• Silent contract breaks that surface only during verification
LLMs are powerful pattern generators.
But they are not contract-enforcing systems.
Production systems rely on invariants.
Probabilistic models don’t guarantee them.
So I’ve been exploring a different approach:
What if the LLM operated inside a deterministic execution boundary?
• Explicit contracts for every step
• Strict schema validation
• Immediate fail-fast on violations
• No interpretation layer
• Auditable state transitions
In this model, the LLM isn’t the authority — it’s a constrained transformer inside a controlled pipeline.
Curious how others handle contract integrity in real-world LLM systems.
#LLM#AIEngineering#DevTools#MLOps#SoftwareArchitecture
👇 Written by me, not an LLM.
I've recently built llm-contract-harness: a contract-enforced execution harness for LLM-driven code changes.
https://lnkd.in/e3XFB4cV
It’s a two-stage pipeline:
- Planner (LLM): turns a plain-text spec into explicit work orders (.json intermediate representation) with pre/postconditions, allowed files, and acceptance commands. It retries up to N times using structured validation errors.
- Factory (LLM loop): executes each work order via SE --> TR --> PO, but deterministic gates decide what’s allowed: path/scope safety, hash-before-write, atomic writes, verify + acceptance commands, and rollback on failure.
Design philosophy: LLM outputs are stochastic; enforcement is deterministic. A loud FAIL isn’t “it doesnt' work” — it’s the harness refusing to proceed without a safe, verifiable state.
Limits (currently): no semantic correctness guarantee, no OS sandboxing, and the planner isn’t repo-aware yet — so the demo seed prompts in ./examples are toy apps for now.
Interesting thing I learned when building this: LLM-generated code change using diffs is basically hopeless — even small diffs fall apart under LLM stochasticity.
verification is the bottleneck now = trust is expensive
small PRs are one thing
big PRs are where things break
blast radius changes everything
agents optimize locally: pass checks, ship changes.
but on large systems, that’s not enough.
because code is only a projection of the system.
architecture lives in boundaries, dependencies, and runtime behavior -
the parts you don’t see by skimming a diff.
so a big AI-generated PR can look “fine”:
build green
tests pass
type checks happy
and still drift the system in ways that show up later
that’s why the real skill now is risk-based review:
don’t try to read everything
decide where trust is required first
what’s your current threshold for “too big to review”?
#CodeReview#SoftwareArchitecture#AIEngineering
OUTCOME ENGINEERING.
The o16g Manifesto.
Some excerpts:
- The only truth is the rate of positive change delivered to the customer.
- If the outcome is worth the tokens, it gets built. Manage to cost, not capacity.
- Write code only when it brings joy.
- Never dispatch an agent without context.
- Debug the decision, not just the code.
- Scaling agents mirrors scaling people, but faster, weirder, and harder.
- Continuously audit the agent against the domain.
It was never about the code.
We are the architects of reality. Welcome to o16g.
https://o16g.com
Hallucination is not a model bug.
It is an architectural outcome.
Most teams still try to fix it at the prompt layer.
After deploying LLM backed features into support tooling, internal copilots, and retrieval systems, the pattern became consistent. The model was behaving exactly as designed. The system around it was not.
Here is what actually drives hallucination:
1. Objective misalignment.
LLMs optimize for next token likelihood, not factuality. If your system rewards fluency over verification, you get confident nonsense. Actionable fix: add a verification step that scores answers against structured constraints before exposure.
2. Retrieval overconfidence.
RAG pipelines often retrieve top k documents and assume relevance equals correctness. It does not. Low recall or noisy chunks create synthetic synthesis. Actionable fix: enforce citation grounding. Reject answers that cannot point to specific spans.
3. Latent knowledge bleed.
When prompts mix retrieved context with open ended reasoning, the model blends internal priors with external data. I call this the Plausibility Gap Pattern. The output sounds consistent but is not anchored to source truth.
Architectural mitigation: isolate modes. First force extraction from context. Then allow reasoning strictly over extracted facts.
4. Missing uncertainty contracts.
Most systems never expose calibrated uncertainty. If you do not define abstain conditions, the model will not invent them. Actionable fix: design an explicit abstention policy with thresholded confidence scoring.
None of this eliminates hallucination completely.
Tighter constraints increase latency and cost. Strong grounding can reduce generative flexibility. Over aggressive abstention frustrates users.
Reliability in LLM systems is a tradeoff surface, not a toggle.
Hallucination is rarely solved by better prompts. It is reduced by better system design.
Where in your stack does plausibility outrun verification, and what architectural guardrail did you add to control it?
#AIEngineering#AIOrchestration#LLMs#SystemsDesign
Onfleet•3K followers
1moThank you William Gallagher for the inspiration. NotebookLM for the debate and Audiogram for the media.