Coding Challenges - Does AI Write Good Code? Let's Find Out.

Coding Challenges - Does AI Write Good Code? Let's Find Out.

AI is changing software engineering. AI can write code faster than you or I can. That's exciting, but it creates a new problem: just because code works doesn't mean it's good. How do you know if what your LLM generated is secure, maintainable, and ready for production?

There are two things you can do. Firstly, follow industry research, like Sonar’s LLM Leaderboard, which looks at the quality, security, complexity, and maintainability of the code created using the leading LLMs. It’s well worth a read to understand the strengths and weaknesses of the models. I found it particularly eye-opening to see that GPT 5.2 High generates 50% more code than Opus 4.5 for the same tasks and Opus 4.5 was still generating around 200% more code than Gemini 3 Pro! I know which codebase I’d rather be responsible for!

Secondly, there are many tools we can leverage to evaluate aspects of code quality, maintainability and security. They include compilers, type checkers, linters, and automated code review tools like SonarQube. In today’s Coding Challenge we’re going to look at how we can leverage them to guide and evaluate AI when building software.

Step Zero

In this step your goal is to pick a Coding Challenge, technology stack and AI coding agent of your choice. If you primarily use Copilot at work, consider trying Amp Code, if you mainly use Claude, try Copilot. In short, try a different coding agent and learn something new.

Step 1

In this step your goal is to build a solution to one of the Coding Challenges using your favourite agent / LLM. I’ll go into more detail on how to leverage AI agents in a future newsletter, but for now I suggest prompting the agent to tackle one step of the Coding Challenge at a time. Between steps, or if the context window starts to fill up or it hallucinates, clear the context window.

Once your solution is complete, head to step 2 to start leveraging tools to assess the quality and security of the code produced by your AI.

Step 2

In this step your goal is to prompt your agent to review the code quality using the compiler, code formatter and linter appropriate to your programming language and stack.

For example, if you’re using Python run checks with ruff, ty, pyrefly or pyright. If you’re using JavaScript, switch to TypeScript 😇. If you’re using Rust, use clippy, for Go check out Golangci-lint. You get the idea.

Step 3

In this step your goal is to install the SonarQube MCP Server and use it to evaluate the code created by your agent.

SonarQube is an automate code review solution that performs deep code analysis to detect bugs, security vulnerabilities, and quality issues that simple linters miss. It can also be used to enforce rules for maintainability, reliability, and security before merging. It’s been widely used in the organisations I’ve worked for and I’ve often been grateful for the feedback it provided my teams. So for this step you’re going to use it, via the MCP server to analyse the code your agent has produced.

To get started, see Sonar’s instructions on how to install their MCP server. I found the quickstart guide particularly useful and chose to use the container running in Docker. Then configured my agent to use SonarQube via the MCP server.

Once you have it installed, check with your agent to verify the MCP server is configured. For example, with Copilot run:

/mcp show        

You should then see some output that includes something like this:

Configured MCP servers: sonarqube

MCP Server Configuration:
  sonarqube (local): Command: docker        

Once you have that done, ask your agent to review the project with SonarQube, for example:

Review this code with SonarQube        

I tried it on the Monkey compiler project I did last year and got the following output:

Total Issues Found: 40

  Issue Breakdown by Severity:

   - CRITICAL: 37 issues
   - MAJOR: 1 issue
   - MINOR: 2 issues

  Key Issues Identified:

  1. High Cognitive Complexity (6 issues):

   - compiler/compiler.go:72 - Complexity: 96 (allowed: 15)
   - vm/vm.go:72 - Complexity: 69 (allowed: 15)
   - vm/vm_test.go:66 - Complexity: 34 (allowed: 15)
   - evaluator/evaluator.go:15 - Complexity: 25 (allowed: 15)
   - compiler/symbol_table_test.go:166 - Complexity: 16 (allowed: 15)
   - compiler/compiler_test.go:649 - Complexity: 16 (allowed: 15)

  2. Duplicated String Literals (14 issues): Various test files contain repeated error message strings that should be defined as
  constants.

  3. Empty Functions (13 issues): Multiple empty functions in ast/ast.go lack explanatory comments about why they're intentionally
  empty.

  4. Other Issues:

   - Empty code block in compiler/compiler_test.go:661
   - Interface naming convention in object/object.go:72
   - Parameter grouping in compiler/compiler.go:353

  The most critical issues are the high cognitive complexity in the compiler and VM modules, which could impact maintainability.        

You can then use the agent to explore the issues in more depth.

Step 4

In this step your goal is to pick out three of the points that SonarQube highlights and investigate them. Review the code yourself and either fix it, or prompt your agent to fix the code, including the output from SonarQube in your prompt.

I did this by prompting the agent for the highest priority issues:

Show me the three highest priority issues        

Which gave me:

Top 3 Highest Priority Issues

  1. compiler/compiler.go:72 - Cognitive Complexity: 96 ⚠️ CRITICAL

  Function: Compile(node ast.Node)

  Problem: This massive switch statement handles all AST node compilation with deeply nested conditionals and error handling. The complexity is 6.4x over the allowed limit (96
  vs 15).

  Why it matters: This is the core compiler function - extremely difficult to understand, test, and modify. Each new language feature adds more nesting.

  Recommendation: Refactor using the strategy pattern - extract each case into separate compilation methods:

   func (c *Compiler) compileProgram(node *ast.Program) error { ... }
   func (c *Compiler) compileInfixExpression(node *ast.InfixExpression) error { ... }
   func (c *Compiler) compileIfExpression(node *ast.IfExpression) error { ... }

  ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------

  2. vm/vm.go:72 - Cognitive Complexity: 69 ⚠️ CRITICAL

  Function: Run()

  Problem: Main VM execution loop with a giant switch statement for all opcodes. Complexity is 4.6x over the limit (69 vs 15).

  Why it matters: This is the VM's heart - any bug here affects runtime execution. The nested switch cases, error checks, and instruction pointer manipulation make it hard to
  reason about.

  Recommendation: Extract opcode handlers into separate methods:

   func (vm *VM) handleOpConstant(ins code.Instructions, ip int) error { ... }
   func (vm *VM) handleOpJump(ins code.Instructions, ip int) error { ... }

  ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------

  3. vm/vm_test.go:66 - Cognitive Complexity: 34 ⚠️ CRITICAL

  Function: Test helper function

  Problem: Test code with complexity 2.3x over the limit (34 vs 15), likely due to nested loops validating complex test cases.

  Why it matters: Complex test code is harder to maintain and can hide bugs in the tests themselves.

  Recommendation: Break into smaller helper functions for specific validation types (integers, booleans, arrays, etc.).

Impact: These three functions represent the core of your interpreter - the compiler, VM, and tests. High complexity makes maintenance risky and onboarding difficult.        

Look at the issues you find and either fix them by hand or have the agent fix them. Then use the agent and SonarQube to review the fixes, it’s a great way to improve your code quality.

Step 5

In this step your goal is to reflect on how to use LLMs and how tools like compilers, linters and SonarQube can help you improve the quality of the software you build.

I recommend adding them to your normal agentic coding process, for me that means incorporating them into the prompt for each task, something like this in my prompts:

After reading the specification: 
1. Create a set of tests to verify the implementation behaves correctly. 
2. Then create the code required to meet the specification. 
3. Verify the functionality is correct using the tests.
4. Verify the code lints and passes quality checks with no warnings or errors.        

My AGENTS.md usually defines how to run the linter and quality checks for the project.

Going Further

Review the LLM Leaderboard that Sonar created to provide transparency into how models build code, not just what they build. By running thousands of AI-generated solutions through SonarQube, they evaluated the models on the metrics that matter to engineering leaders: security, reliability, maintainability, and complexity.

To generate the leaderboard, Sonar analysed code quality from leading AI models (GPT-5.2 High, GPT-5.1 High, Gemini 3 Pro, Opus 4.5 Thinking, and Claude Sonnet 4.5).

It was interesting to see that while these models pass functional benchmarks well, they have significant differences in code quality, security, and maintainability.

Higher performing models tend to generate more verbose and complex code for example:

  • Opus 4.5 Thinking leads with 83.62% pass rate but generates 639,465 lines of code (more than double the less verbose models).
  • Gemini 3 Pro achieves similar performance (81.72%) with much lower complexity and verbosity.
  • GPT-5.2 High hits 80.66% pass rate but produces the most code (974,379 lines) and shows worse maintainability than GPT-5.1.

I found it particularly interesting to see that Gemini produced only 289k lines. That’s a lot less code to review and maintain!

Many thanks to Sonar for sponsoring this issue of Coding Challenges.

Atharva Shah

AccuKnox3K followers

1mo

Suggesting AI to "build then evaluate and improve" is a smart approach. However, for genuinely complex systems like a Redis clone, the initial AI output often serves as a starting point. The deeper learning and actual value come from fixing the AI's flaws and making it robust enough for real use. Mastering human-guided iteration is how AI truly aids product development.

Like
Reply
Nick Woodhead

TranslateTech7K followers

1mo

Most teams will feel the productivity boost, then quietly inherit the maintenance bill. The only sustainable pattern is “AI writes, tools decide”: tests first, lint/type checks, then deep static analysis as a hard gate. Sonar’s data makes the tradeoff visible too, higher functional scores often come with higher verbosity and complexity, which is just future review and refactor cost. Curious how you’re thinking about this as a team standard, do you set explicit budgets (max complexity, duplication, security hotspots) the agent must hit before code is allowed to merge?

John Walters

FiscalNote593 followers

1mo

That Sonar leaderboard is very interesting!

AUROBINDA MONDAL

Undisclosed5K followers

1mo

Totally get it. Building apps > theoretical coding. Real-world experience unlocks skills. John

said wardi

Longometal Armatures94 followers

1mo

���🏠 Update – Still 100$ away from launching my dream project! Hi LinkedIn family, I'm Wardi Said from Casablanca, Morocco. Passionate about building sustainable smart units + eco-tourism experiences (solar-powered eco-smart pods in green valleys). Everything is ready: market study, business model, near-complete prototype. Only $100 needed for the final push (domain + basic tools + testing). **Simple ask**: Just $1 each (≈10 MAD) = goal reached together! **My promise in return**: Anyone who helps (even $1) gets 3 ready-to-use business ideas from me personally: - AgriTech (smart farming) - HealthTech (digital pharmacy) - Defense Tech (tunnel warfare innovations) All with market study, business model & MVP plan – worth far more. How to support: Western Union: Wardi Said – Casablanca, Morocco Send MTCN to WhatsApp: +212 644 633 891 DM me or WhatsApp if you want details or to chat about the vision. A share means the world too! 💚 #SustainableTourism #PropTech #EcoInnovation #MoroccoStartup #HelpEachOther

  • No alternative text description for this image
Like
Reply

To view or add a comment, sign in

More articles by John Crickett

  • Coding Challenge #113 - AI Writing Detector

    This challenge is to build your own AI writing detector that analyses text and determines the likelihood it was written…

    16 Comments
  • ## Coding Challenge #112 - AI Coding Agent

    This challenge is to build your own AI coding agent - a command-line tool that can read, understand, and modify code on…

    10 Comments
  • Coding Challenge #111 - AI Agent Scheduling System

    This challenge is to build your own AI agent scheduling system - a system that runs AI-powered tasks automatically on a…

    10 Comments
  • Coding Challenge #110 - RTFM For Me Agent

    This challenge is to build your own AI-powered documentation assistant - a tool that can ingest technical…

    12 Comments
  • Coding Challenge #109 - Ebook Reader

    This challenge is to build your own ebook reader application. EPUB is the most widely used open standard for digital…

    7 Comments
  • Coding Challenge #108 - Online Python Playground

    Coding Challenge #108 - Online Python Playground This challenge is to build your own online code playground where users…

    10 Comments
  • Coding Challenge #107 - Loom Clone

    This challenge is to build your own version of Loom, a screen recording and video messaging tool. Loom is a popular…

    24 Comments
  • Coding Challenge #106 - JSON Validator And Prettier

    You're going to build an application that lets users paste in some JSON, check if it's valid, and format it in a useful…

    5 Comments
  • Coding Challenge #105- A Top Programming Stories Dashboard

    This challenge is to build a dashboard to consolidate the top programming, software engineering, AI or whatever your…

    15 Comments
  • From The Challenges - HTTP Load Tester

    Welcome To Coding Challenges - From The Challenges! In this Coding Challenges “from the challenges” newsletter I’m…

    5 Comments

Others also viewed

Explore content categories