Agentic Optimization and the Local Minimum Problem
TLDR: Gave agents a GPU optimization loop with real benchmarks and correctness gates. They found a 2.107× speedup from 11× slower than CPU to more than 2× faster in 11 passes. Then architecture set the ceiling: at large grids, CUDA kernels were 2% of total runtime, and the extension interface assumed CPU-visible fields. The agents also hit their own ceiling: incoherent proposals, brittle implementations when the transforms crossed enough files. Local optimization is cheap now. Architecture is still a human decision.
---
I gave agents an optimization loop, real GPU hardware, and a benchmark that could not lie.
The early results were surprisingly good.
Agents found wins in thread mapping, memory access patterns, synchronization, and coefficient layout. These weren't toy examples they were CUDA changes running through correctness gates and benchmark jobs on a university GPU cluster.
The speedup curve bent upward.
Then two walls appeared.
The first was the application architecture. At small grids, the CUDA backend reached a 2.107× speedup over the CPU baseline. At large grids, the GPU kernels were doing their job, but the openEMS extension interface was still shaped around CPU-visible field access. At 500³, CUDA kernels were about 2% of the measured timestep. Amdahl's law, in its most practical form.
The second wall was the agents themselves. As the easy wins ran out, the proposals got more architectural often incoherent, increasingly brittle to implement.
Both walls taught the same lesson:
Agentic optimization is genuinely useful inside a bounded, verifiable design space. It finds the local best point quickly. It cannot move you to the globally correct architecture, and it has a comprehension ceiling for deep multi-file transformations.
Local optimization is cheap now. Architecture is not.
The Problem and the Bet
The project started from a radar cross-section simulation problem. I had previously evaluated openEMS as a radar cross-section solver for whole aircraft scale simulations at realistic wavelength proportions. See: Report — RCS Simulations with openEMS
I found that for it to be a useful solution it needed about a 20x speedup in computational efficiency and multi node scalability. I thought making a GPU kernel for it might be a good way to achieve these speedups.
Electromagnetic simulation at scale gets expensive fast. Finite-difference time-domain methods (FDTD) are computationally regular enough to look attractive for GPU acceleration, but real solvers aren't just kernels. They're application architectures, extension interfaces, data ownership models, boundary conditions, probes, and years of assumptions about where memory lives.
openEMS is an open-source FDTD solver. The bet was straightforward:
Implement a CUDA backend as a drop-in engine, wire an agentic optimization loop around it, and see how far disciplined iteration could go within the existing architecture.
The goal wasn't just "make it faster." It was to learn what kind of problem agentic engineering is actually good at. If the agent has a source of truth, an action surface, correctness checks, and benchmark feedback can it autonomously search the optimization space?
The answer was yes.
And then the more interesting answer was: only up to the boundary of the design space I gave it.
The Optimization Loop
This was not vibe coding.
The piece Beyond Vibe Coding into Agentic Engineering captures the distinction well: vibe coding is surrendering to the flow; agentic engineering is orchestrating agents with accountability at every handoff. The loop here had that structure:
1. The agent wrote a proposal: hypothesis, approach, expected impact, risks, success criteria.
2. I reviewed the proposal and approved or rejected the scope. Other agents distinguished by other models would also review and help with their input.
3. If approved, the agent entered a freerun: implement, build, test, collect evidence within the approved scope.
4. Benchmarks ran through a controlled Euler gateway. Correctness gate first, performance gate second.
5. The branch was merged or rejected based on evidence.
That structure mattered because performance work lies constantly.
A faster benchmark means nothing if the field update is wrong. A clever kernel means nothing if it changes a boundary condition. A pretty speedup plot means nothing if the test no longer exercises the same simulation. Correctness had to come before timing.
CodeRabbit's State of AI vs Human Code Generation report found that AI-generated pull requests contained roughly 1.7× more issues overall than human-authored ones, with correctness issues the most common category. That finding was a useful anchor. If correctness isn't gated independently of the agent's own judgment, benchmark numbers don't mean what they appear to.
The agent was allowed to move quickly, but only inside a loop that could catch regressions.
The interesting pattern isn't "an AI wrote CUDA." It's an agentic engineering loop where proposal, action, correctness, benchmark, and merge decision are separate surfaces with separate owners. Deloitte's 2026 analysis of AI agent orchestration named exactly this multi-agent structure human-approved gates, role-specific agents, coordinated workflows as one of the core patterns emerging in enterprise autonomous systems. What I built for a university GPU project fit the same shape.
Autonomous Agents on a University GPU Cluster
At one point I was watching an agent write CUDA code and submit it to the university's compute cluster without me reading every generated line first.
That was a deliberate choice.
Reading every file before every benchmark would have killed the iteration speed. But blindly trusting the agent would have been irresponsible on shared HPC infrastructure.
So the trust boundary moved into the system design.
The agents couldn't run arbitrary cluster commands. They couldn't SSH in and improvise. Jobs went through a gateway that accepted only approved job shapes, enforced time and resource limits, and blocked anything resembling a login-shell command. The cluster interaction was Slurm-shaped by construction.
The agent could propose work. The gateway decided what could execute.
This is the same principle for agents touching any serious system: don't give the model an open-ended action surface and hope it behaves. Build a channel where the allowed actions are the ones you're prepared to support.
Constrained action surfaces aren't a limitation on agentic engineering. They're what make it possible.
The Arc: Easy Wins, Diminishing Returns, Two Walls
The initial CUDA backend was much slower than the CPU baseline not surprising. A direct port often preserves the wrong assumptions. The first version mostly proves the code path can exist.
Then the agents started finding real improvements.
Thread/block mappings. Memory access behavior. Coefficient layouts. Synchronization and data movement costs. Some proposals worked. Some regressed. When a proposal regressed, the benchmark caught it and the branch was rejected.
That rejection loop wasn't wasted effort. It was search. Every failed direction made the local design space smaller.
Some regressions I forced through like a transition to Unified Virtual Memory (UVM) because my understanding of the broader architecture told me that there was probably a better local minimum at the end of that path.
The system moved from roughly 11× slower than the CPU baseline to a 2.107× speedup on the small-grid benchmark. That's a meaningful result it showed that agents could abstract away much of the implementation grind that normally makes GPU optimization slow.
But after the easy wins, the proposal distribution changed.
Instead of local CUDA improvements, the next promising ideas were architectural: sparse field-transfer schemes, finer-grained host/device synchronization, different ownership boundaries between CPU extension code and GPU field buffers. The agents could describe these changes. They could produce plausible plans. The implementations were brittle.
The two walls appeared together.
Wall 1: Application Architecture
The scaling data made the first wall obvious.
At 500³, kernels took about 67ms per step. The full CUDA timestep was about 3,284ms. Kernel fraction: 2%.
If only 2% of the measured timestep is kernel work, infinitely fast kernels improve the full timestep by about 2%. That's a brutal result but it's clarifying.
Recommended by LinkedIn
The GPU was doing useful work. The kernel wasn't the dominant bottleneck. The application interface around the kernel was.
openEMS wasn't designed around GPU-resident fields. The extension interface assumes CPU-visible field access. That means the CUDA backend can make the Yee update fast and still lose at the application level if the surrounding system keeps forcing synchronization and host-visible access patterns.
At that point, kernel optimization stops being the main lever.
The architecture is the lever.
Wall 2: Agent Comprehension
The second wall wasn't in openEMS. It was in the agents.
The agents were good at bounded, measurable changes: memory coalescing, thread mapping, kernel launch configuration, data layout tweaks. Hard enough to matter, local enough to verify.
They struggled with transformations that required holding a large application in mind across several files and invariants.
The hard proposals weren't nonsense that was part of what made the wall interesting. The written plans often sounded reasonable. The agent could explain the synchronization problem. It could identify why CPU extension callbacks were expensive. It could propose a new field-transfer strategy.
But implementation is where the abstraction broke down.
Changes crossing the CPU extension dispatcher, GPU buffers, correctness invariants, and openEMS extension protocol tended to fail in subtle ways. The correctness gate caught those failures. The agent often couldn't repair them reliably within a single freerun.
The agent didn't fail because it was useless. It failed because the task had moved from local optimization to architectural redesign. The context graph was larger, the invariants more implicit, the cost of a wrong assumption much higher.
Both Walls Point to the Same Thing: A Local Minimum Problem
In optimization, a local minimum is a point that looks optimal from where you're standing better than every neighboring position but isn't the best point that exists. The problem isn't that you searched badly. The problem is that you started in the wrong basin.
The graphic above shows what happened in this project. Start A the existing openEMS CUDA backend — sits at a moderate position in the design space. It's a reasonable place to begin: there's an existing solver, it mostly works, it has a known performance profile. Agentic search descends from there, finding improvements at each step. It reaches the local optimum: 2.107×.
But the global optimum a GPU-native solver built from scratch, with accelerator-resident fields and distributed memory from the start sits in a completely different region. Start B looks worse at first: higher initial cost, more implementation work, no existing architecture to build on. That's exactly why it doesn't get chosen. But the basin it sits in leads somewhere the current design space can never reach.
The architecture wall in the graphic is real. It's not a matter of searching harder or running more iterations. The agents were given a design space to search. The global optimum was outside it.
This is the deeper reason why the second wall exists at all. An agentic loop optimizing code is more capable than gradient descent it can reason about structure, form hypotheses, propose non-local changes, understand why certain paths are expensive. But even that extra capability is bounded by the design space it was given to explore. Better reasoning about local moves doesn't help when the problem is which basin you started in.
The two walls aren't separate failures. They're two symptoms of a single root cause: the search was well-executed inside a design space that didn't contain the answer.
The Architectural Conclusion
The agentic loop earned its place.
It found local improvements faster than manual iteration would have. It created evidence. It made the ceiling visible. Running a dozen experiments in the time it would take to manually write and tune a few is a real advantage.
But the global result was bounded by a constraint outside the optimization space.
The system was trying to fit a GPU backend into an application interface still assuming CPU-visible field access. Once that became the dominant cost, the correct next step wasn't "write a better kernel." It was "change the architecture around the kernel."
Three possible directions:
1. Keep patching the openEMS extension interface incremental wins possible, same access-pattern assumption. Not Great because I am pretty much already at the limit of Amdahl's law.
2. Move to unified memory platforms (Grace Blackwell/NVLink-class) reduces transfer pain, doesn't remove the deeper synchronization pattern. This would bite me in the foot when my problems scaled to the point I needed multi-node execution.
3. Build a GPU-native EM/RCS solver designed around accelerator-resident fields and distributed memory from the start. Probably the right solution, but exceedingly difficult to build an EM solver architected for distributed compute from scratch.
The third path is probably right. It's also a different project.
The uncomfortable lesson:
In a world where porting code is becoming cheap, architecture choice becomes the scarce skill.
Agents can make it much cheaper to explore a local implementation space. That doesn't mean the local space is the right one.
What This Means for Agentic Engineering
The useful version isn't "let the model do everything."
It's:
- define the design space
- define the truth sources
- constrain the action surface
- gate correctness before performance
- reject regressions without drama
- use the data to decide whether the current architecture still deserves more optimization
That last step matters most.
Agents can make local optimization so cheap that it becomes tempting to keep optimizing long after the architecture has stopped deserving it. The benchmark has to be used not only to find improvements, but to decide when the search space itself is wrong.
In this project, the benchmark told a clear story: the CUDA backend worked, the agentic loop worked, the kernel optimization worked. The architecture didn't scale the way a GPU-native solver would need to scale.
That's not a failed project. That's a useful negative result.
Closing
Three things I'd take from this to any agent-driven optimization project:
Gate on correctness before you gate on performance. Otherwise the agent will happily optimize the wrong thing.
Design the action surface before you need it. Especially if agents can touch shared infrastructure, CI, cloud resources, or production systems. Don't give agents too much authority.
Know the difference between the optimization space you're searching and the architecture that defines its ceiling. The agents will find the local best point. They may not tell you when the space itself is wrong.
The RCS problem is still open. The evidence points toward a GPU-native solver designed around accelerator-resident fields and distributed parallelism not a CPU-oriented extension model with CUDA attached after the fact.
For the workflow side of this story the course policy, attribution discipline, scaffolded agent workspace, and learning-boundary problem there's a companion education article.
Question: Has your team hit a wall where kernel speed stopped becoming application speed? What was the architectural constraint, and did you discover it before or after optimizing into it?
---
## References
- [CodeRabbit State of AI vs Human Code Generation report (Dec 2025)](https://www.coderabbit.ai/blog/state-of-ai-vs-human-code-generation-report)
- [Beyond Vibe Coding into Agentic Engineering DEV Community](https://dev.to/aws-builders/beyond-vibe-coding-into-agentic-engineering-5fef)
- [Unlocking exponential value with AI agent orchestration Deloitte Insights (2026)](https://www.deloitte.com/us/en/insights/industry/technology/technology-media-and-telecom-predictions/2026/ai-agent-orchestration.html)
- [openEMS FDTD project repo](https://github.com/lpurdy01/repo759_public)
- [Full technical report RCS / GPU openEMS](https://lpurdy01.github.io/rcs/gpu_openems_report.html)
this is awesome, Levi!!!
Full repo and report: - https://github.com/lpurdy01/repo759_public - https://github.com/lpurdy01/repo759_public/tree/main/FinalProject - https://lpurdy01.github.io/rcs/gpu_openems_report.html Sources cited in the article: - CodeRabbit State of AI vs Human Code report: https://www.coderabbit.ai/blog/state-of-ai-vs-human-code-generation-report - Beyond Vibe Coding into Agentic Engineering (DEV Community): https://dev.to/aws-builders/beyond-vibe-coding-into-agentic-engineering-5fef - AI agent orchestration (Deloitte Insights 2026): https://www.deloitte.com/us/en/insights/industry/technology/technology-media-and-telecom-predictions/2026/ai-agent-orchestration.html Companion education article: https://www.linkedin.com/pulse/agentic-learning-real-course-levi-purdy-5gs3c