Shipyard | How to Test AI-Generated Code: Best Practices for LLM and AI Assistant Code

Over 92% of developers share that they use generative AI tools to help them write software, per GitHub’s developer survey. We’ve all been versed in the benefits of using generative AI coding assistants to write code faster (and sometimes cleaner), but are we talking enough about the practices necessary to accommodate non-human code?

Automation bias and programming

People are more likely to blindly defer to outputs of automated systems, assuming them correct simply because they weren’t generated by a human being. This is known as “automation bias”, and it’s becoming more relevant than ever in the age of GenAI. Our trust in large language models (LLMs) is no exception — these models are a “black box” to many of us, as there often aren’t any logs/observability into how they land on any particular output. If there were more transparency during token generation, it’d be evident how fallible these systems can be, and perhaps that would be reflected in our trust of them.

Automation bias can get the best of us when it comes to GenAI programming tools. Developers are more likely to trust AI code suggestions and revisions. This can compromise your codebase in a few ways:

Introducing non-current versioning, including deprecated libraries/functions.
Using suboptimal algorithms — sometimes an LLM’s suggestion isn’t the best way to do things. You, as the developer, have a lot of (unwritten) context that an LLM misses.
Solving the wrong problem. Did you prompt the LLM with full context of what you’re trying to solve? Even so, more obscure tasks are far out of its comfort zone.

Getting the most out of a GenAI coding assistant means understanding its limitations, and second-guessing every output.

Adapting to accommodate GenAI code

When your engineers are using GenAI tools to aid development, they may be writing code faster and at a higher volume. However, this code might not be up to par, quality-wise. There are a few ways to extend your org’s testing practices to account for these new development patterns.

Human-written automated tests

When using a GenAI pair programming tool, it may seem like automated tests are the most obvious thing for you to outsource. Orgs who release reliable software tend to task developers with writing their own tests (instead of relying on the QA team). This all comes down to context: nobody understands the functionality of a unit of code better than the person who wrote it. An AI agent doesn’t have the same context on your code’s intended behavior, and this only gets worse as complexity increases. While an agent might be able to produce several tests that decently catch your code’s ideal outputs, it might miss out on crucial edge and corner cases, or “big picture” goals.

Test-driven development is incredibly valuable to a developer’s understanding of their own code. TDD prompts devs to review what they’ve written, and think critically about anything and everything that can go wrong. These thought processes lead not only to better quality tests, but also to a better quality codebase.

In short: yes, you can have an AI agent assist with test generation. However, you, as the developer, should be architecting your own test cases, as both an exercise in test quality and one in understanding your codebase. The AI agent can fill in the syntactical and stylistic gaps, and pick up obvious base cases here and there.

Human code review

Code review should rest on human developers, especially when an AI copilot is helping write features. Like architecting tests, this also comes down to context: does your LLM actually understand the intent/purpose of the code changes, and their place within the codebase? For GenAI code, reviews must be taken seriously, potentially involving:

The author and the reviewer hopping on a call; the author should explain the purpose of the code changes and their expected behavior.
The reviewer testing the code changes, either locally or in a cloud environment.
Pair programming for more complex features or when the author and reviewer can’t reach a conclusion on the current PR/MR.

Representative test environments

Before deploying, you should always test and preview your code changes in an environment that is nearly infrastructure-identical to production. For many orgs, a staging environment serves this purpose. Unfortunately, most orgs don’t have more than a couple of static staging environments. This means that many less-critical features either get tested in production, or get caught in staging environment bottlenecks.

When pushing a high volume of features, you’re only able to maintain as much velocity as your test environments allow. Investing in ephemeral testing infrastructure means each feature now gets reviewed and tested in its own isolated environment, reducing the chances of bugs or regressions slipping through.

CI/CD practices and pipelines

Your CI/CD pipelines go hand-in-hand with your automated tests and your preview environments. When deploying GenAI code, teams should aim to test and validate individual code changes in isolation, ideally following the Trunk-Based Development pattern. CI/CD pipelines are most useful when you’re able to use them frequently. When you have access to representative test environments and a high quality test suite, you can run your pipelines against individual commits and PRs/MRs with little to no downtime.

It’s incredibly important that each feature is tested both in isolation and in staging among other features, and only deployed when QA has full confidence. As a best practice, orgs can employ canary or blue/green deployments and monitor in case a rollback is needed.

Future-proofing your org

Software development practices have quickly shifted with the general availability of numerous LLMs. You’re able to support your developers’ use of GenAI programming tools and preserve your software’s reliability, but you’ll likely have to take your infrastructure to the next level.

If you’re looking for a way to run your E2E suite against every PR/MR to catch LLM-introduced hallucinations, bugs, and regressions, talk to us. Shipyard can help you test earlier and more often.