Coding Challenges - Does AI Write Good Code? Let's Find Out.
AI is changing software engineering. AI can write code faster than you or I can. That's exciting, but it creates a new problem: just because code works doesn't mean it's good. How do you know if what your LLM generated is secure, maintainable, and ready for production?
There are two things you can do. Firstly, follow industry research, like Sonar’s LLM Leaderboard, which looks at the quality, security, complexity, and maintainability of the code created using the leading LLMs. It’s well worth a read to understand the strengths and weaknesses of the models. I found it particularly eye-opening to see that GPT 5.2 High generates 50% more code than Opus 4.5 for the same tasks and Opus 4.5 was still generating around 200% more code than Gemini 3 Pro! I know which codebase I’d rather be responsible for!
Secondly, there are many tools we can leverage to evaluate aspects of code quality, maintainability and security. They include compilers, type checkers, linters, and automated code review tools like SonarQube. In today’s Coding Challenge we’re going to look at how we can leverage them to guide and evaluate AI when building software.
Step Zero
In this step your goal is to pick a Coding Challenge, technology stack and AI coding agent of your choice. If you primarily use Copilot at work, consider trying Amp Code, if you mainly use Claude, try Copilot. In short, try a different coding agent and learn something new.
Step 1
In this step your goal is to build a solution to one of the Coding Challenges using your favourite agent / LLM. I’ll go into more detail on how to leverage AI agents in a future newsletter, but for now I suggest prompting the agent to tackle one step of the Coding Challenge at a time. Between steps, or if the context window starts to fill up or it hallucinates, clear the context window.
Once your solution is complete, head to step 2 to start leveraging tools to assess the quality and security of the code produced by your AI.
Step 2
In this step your goal is to prompt your agent to review the code quality using the compiler, code formatter and linter appropriate to your programming language and stack.
For example, if you’re using Python run checks with ruff, ty, pyrefly or pyright. If you’re using JavaScript, switch to TypeScript 😇. If you’re using Rust, use clippy, for Go check out Golangci-lint. You get the idea.
Step 3
In this step your goal is to install the SonarQube MCP Server and use it to evaluate the code created by your agent.
SonarQube is an automate code review solution that performs deep code analysis to detect bugs, security vulnerabilities, and quality issues that simple linters miss. It can also be used to enforce rules for maintainability, reliability, and security before merging. It’s been widely used in the organisations I’ve worked for and I’ve often been grateful for the feedback it provided my teams. So for this step you’re going to use it, via the MCP server to analyse the code your agent has produced.
To get started, see Sonar’s instructions on how to install their MCP server. I found the quickstart guide particularly useful and chose to use the container running in Docker. Then configured my agent to use SonarQube via the MCP server.
Once you have it installed, check with your agent to verify the MCP server is configured. For example, with Copilot run:
/mcp show
You should then see some output that includes something like this:
Configured MCP servers: sonarqube
MCP Server Configuration:
sonarqube (local): Command: docker
Once you have that done, ask your agent to review the project with SonarQube, for example:
Review this code with SonarQube
I tried it on the Monkey compiler project I did last year and got the following output:
Recommended by LinkedIn
Total Issues Found: 40
Issue Breakdown by Severity:
- CRITICAL: 37 issues
- MAJOR: 1 issue
- MINOR: 2 issues
Key Issues Identified:
1. High Cognitive Complexity (6 issues):
- compiler/compiler.go:72 - Complexity: 96 (allowed: 15)
- vm/vm.go:72 - Complexity: 69 (allowed: 15)
- vm/vm_test.go:66 - Complexity: 34 (allowed: 15)
- evaluator/evaluator.go:15 - Complexity: 25 (allowed: 15)
- compiler/symbol_table_test.go:166 - Complexity: 16 (allowed: 15)
- compiler/compiler_test.go:649 - Complexity: 16 (allowed: 15)
2. Duplicated String Literals (14 issues): Various test files contain repeated error message strings that should be defined as
constants.
3. Empty Functions (13 issues): Multiple empty functions in ast/ast.go lack explanatory comments about why they're intentionally
empty.
4. Other Issues:
- Empty code block in compiler/compiler_test.go:661
- Interface naming convention in object/object.go:72
- Parameter grouping in compiler/compiler.go:353
The most critical issues are the high cognitive complexity in the compiler and VM modules, which could impact maintainability.
You can then use the agent to explore the issues in more depth.
Step 4
In this step your goal is to pick out three of the points that SonarQube highlights and investigate them. Review the code yourself and either fix it, or prompt your agent to fix the code, including the output from SonarQube in your prompt.
I did this by prompting the agent for the highest priority issues:
Show me the three highest priority issues
Which gave me:
Top 3 Highest Priority Issues
1. compiler/compiler.go:72 - Cognitive Complexity: 96 ⚠️ CRITICAL
Function: Compile(node ast.Node)
Problem: This massive switch statement handles all AST node compilation with deeply nested conditionals and error handling. The complexity is 6.4x over the allowed limit (96
vs 15).
Why it matters: This is the core compiler function - extremely difficult to understand, test, and modify. Each new language feature adds more nesting.
Recommendation: Refactor using the strategy pattern - extract each case into separate compilation methods:
func (c *Compiler) compileProgram(node *ast.Program) error { ... }
func (c *Compiler) compileInfixExpression(node *ast.InfixExpression) error { ... }
func (c *Compiler) compileIfExpression(node *ast.IfExpression) error { ... }
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
2. vm/vm.go:72 - Cognitive Complexity: 69 ⚠️ CRITICAL
Function: Run()
Problem: Main VM execution loop with a giant switch statement for all opcodes. Complexity is 4.6x over the limit (69 vs 15).
Why it matters: This is the VM's heart - any bug here affects runtime execution. The nested switch cases, error checks, and instruction pointer manipulation make it hard to
reason about.
Recommendation: Extract opcode handlers into separate methods:
func (vm *VM) handleOpConstant(ins code.Instructions, ip int) error { ... }
func (vm *VM) handleOpJump(ins code.Instructions, ip int) error { ... }
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
3. vm/vm_test.go:66 - Cognitive Complexity: 34 ⚠️ CRITICAL
Function: Test helper function
Problem: Test code with complexity 2.3x over the limit (34 vs 15), likely due to nested loops validating complex test cases.
Why it matters: Complex test code is harder to maintain and can hide bugs in the tests themselves.
Recommendation: Break into smaller helper functions for specific validation types (integers, booleans, arrays, etc.).
Impact: These three functions represent the core of your interpreter - the compiler, VM, and tests. High complexity makes maintenance risky and onboarding difficult.
Look at the issues you find and either fix them by hand or have the agent fix them. Then use the agent and SonarQube to review the fixes, it’s a great way to improve your code quality.
Step 5
In this step your goal is to reflect on how to use LLMs and how tools like compilers, linters and SonarQube can help you improve the quality of the software you build.
I recommend adding them to your normal agentic coding process, for me that means incorporating them into the prompt for each task, something like this in my prompts:
After reading the specification:
1. Create a set of tests to verify the implementation behaves correctly.
2. Then create the code required to meet the specification.
3. Verify the functionality is correct using the tests.
4. Verify the code lints and passes quality checks with no warnings or errors.
My AGENTS.md usually defines how to run the linter and quality checks for the project.
Going Further
Review the LLM Leaderboard that Sonar created to provide transparency into how models build code, not just what they build. By running thousands of AI-generated solutions through SonarQube, they evaluated the models on the metrics that matter to engineering leaders: security, reliability, maintainability, and complexity.
To generate the leaderboard, Sonar analysed code quality from leading AI models (GPT-5.2 High, GPT-5.1 High, Gemini 3 Pro, Opus 4.5 Thinking, and Claude Sonnet 4.5).
It was interesting to see that while these models pass functional benchmarks well, they have significant differences in code quality, security, and maintainability.
Higher performing models tend to generate more verbose and complex code for example:
- Opus 4.5 Thinking leads with 83.62% pass rate but generates 639,465 lines of code (more than double the less verbose models).
- Gemini 3 Pro achieves similar performance (81.72%) with much lower complexity and verbosity.
- GPT-5.2 High hits 80.66% pass rate but produces the most code (974,379 lines) and shows worse maintainability than GPT-5.1.
I found it particularly interesting to see that Gemini produced only 289k lines. That’s a lot less code to review and maintain!
Many thanks to Sonar for sponsoring this issue of Coding Challenges.
AccuKnox•3K followers
1moSuggesting AI to "build then evaluate and improve" is a smart approach. However, for genuinely complex systems like a Redis clone, the initial AI output often serves as a starting point. The deeper learning and actual value come from fixing the AI's flaws and making it robust enough for real use. Mastering human-guided iteration is how AI truly aids product development.
TranslateTech•7K followers
1moMost teams will feel the productivity boost, then quietly inherit the maintenance bill. The only sustainable pattern is “AI writes, tools decide”: tests first, lint/type checks, then deep static analysis as a hard gate. Sonar’s data makes the tradeoff visible too, higher functional scores often come with higher verbosity and complexity, which is just future review and refactor cost. Curious how you’re thinking about this as a team standard, do you set explicit budgets (max complexity, duplication, security hotspots) the agent must hit before code is allowed to merge?
FiscalNote•593 followers
1moThat Sonar leaderboard is very interesting!
Undisclosed•5K followers
1moTotally get it. Building apps > theoretical coding. Real-world experience unlocks skills. John
Longometal Armatures•94 followers
1mo���🏠 Update – Still 100$ away from launching my dream project! Hi LinkedIn family, I'm Wardi Said from Casablanca, Morocco. Passionate about building sustainable smart units + eco-tourism experiences (solar-powered eco-smart pods in green valleys). Everything is ready: market study, business model, near-complete prototype. Only $100 needed for the final push (domain + basic tools + testing). **Simple ask**: Just $1 each (≈10 MAD) = goal reached together! **My promise in return**: Anyone who helps (even $1) gets 3 ready-to-use business ideas from me personally: - AgriTech (smart farming) - HealthTech (digital pharmacy) - Defense Tech (tunnel warfare innovations) All with market study, business model & MVP plan – worth far more. How to support: Western Union: Wardi Said – Casablanca, Morocco Send MTCN to WhatsApp: +212 644 633 891 DM me or WhatsApp if you want details or to chat about the vision. A share means the world too! 💚 #SustainableTourism #PropTech #EcoInnovation #MoroccoStartup #HelpEachOther