How to keep AI hallucinations out of your code

It turns out androids do dream, and their dreams are often strange. In the early days of generative AI, we got human hands with eight fingers and recipes for making pizza sauce from glue. Now, developers working with AI-assisted coding tools are also finding AI hallucinations in their code.

“AI hallucinations in coding tools occur due to the probabilistic nature of AI models, which generate outputs based on statistical likelihoods rather than deterministic logic,” explains Mithilesh Ramaswamy, a senior engineer at Microsoft. And just like that glue pizza recipe, sometimes these hallucinations escape containment.

AI coding assistants are increasingly omnipresent, and usage is growing, with 62% of respondents saying they were using AI coding tools in the May 2024 Stack Overflow developer survey. So how can you prevent AI hallucinations from ruining your code? We asked developers and tech leaders experienced with using AI coding assistants for their tips.

How AI hallucinations infect code

Microsoft’s Ramaswamy, who works every day with AI tools, keeps a list of the sorts of AI hallucinations he encounters: “Generated code that doesn’t compile; code that is overly convoluted or inefficient; and functions or algorithms that contradict themselves or produce ambiguous behavior.” Additionally, he says, “AI hallucinations sometimes just make up nonexistent functions” and “generated code may reference documentation, but the described behavior doesn’t match what the code does.”

Komninos Chatzipapas, founder of HeraHaven.ai, gives an example of a specific problem of this type. “On our JavaScript back-end, we had a function to deduct credit from a user based on their ID,” he says. “The function expected an object containing an ID value as its parameter, but the coding assistant just put the ID as the parameter.” He notes that in loosely typed languages like JavaScript, problems like these are more likely to slip past language parsers. The error Chatzipapas encountered “crashed our staging environment, but was fortunately caught before pushed to production.”

How does code like this slip into production? Monojit Banerjee, a lead in the AI platform organization at Salesforce, describes the code output by many AI assistants as “plausible but incorrect or non-functional.” Brett Smith, distinguished software developer at SAS, notes that less experienced developers are especially likely to be misled by the AI tool’s confidence, “leading to flawed code.”

The consequences of flawed AI code can be significant. Security holes and compliance issues are top of mind for many software companies, but some issues are less immediately obvious. Faulty AI-generated code adds to overall technical debt, and it can detract from the efficiency code assistants are intended to boost. “Hallucinated code often leads to inefficient designs or hacks that require rework, increasing long-term maintenance costs,” says Microsoft’s Ramaswamy.

Fortunately, the developers we spoke with had plenty of advice about how to ensure AI-generated code is correct and secure. There were two categories of tips: how to minimize the chance of code hallucinations, and how to catch hallucinations after the fact.

Reducing AI hallucinations in your code

The ideal would of course be to never encounter AI hallucinations at all. While that’s unlikely (not with the current state of the art), the following precautions can help reduce issues in AI-generated code.

Write clear and detailed prompts

The adage “garbage in, garbage out” is as old as computer science—and it applies to LLMs, as well, especially when you’re generating code by prompting rather than using an autocomplete assistant. Many of the experts we spoke to urged developers to get their prompt engineering game on point. “It’s best to ask bounded questions and critically examine the results,” says Andrew Sellers, head of technology strategy at Confluent. “Usage data from these tools suggest that outputs tend to be more accurate for questions with a smaller scope, and most developers will be better at catching errors by frequently examining small blocks of code.”

Ask for references

LLMs like ChatGPT are notorious for making up citations in school papers and legal briefs. But code-specific tools have made great strides in that area. “Many models are supporting citation features,” says Salesforce’s Banerjee. “A developer should ask for citations or API reference wherever possible to minimize hallucinations.”

Make sure your AI tool has trained on the latest software

Most genAI chatbots can’t tell you who won your home team’s baseball game last night, and they have limitations keeping up with software tools and updates as well. “One of the ways you can predict whether a tool will hallucinate or provide biased outputs is by checking its knowledge cut-offs,” says Stoyan Mitov, CEO of Dreamix and co-founder of the Citizens app. “If you plan on using the latest libraries or frameworks that the tool doesn’t know about, the chances that the output will be flawed are high.”

Train your model to do things your way

Travis Rehl, CTO at Innovative Solutions, says what generative AI tools need to work well is “context, context, context.” You need to provide good examples of what you want and how you want it done, he says. “You should tell the LLM to maintain a certain pattern, or remind it to use a consistent method so it doesn’t create something new or different.” If you fail to do so, you can run into a subtle type of hallucination that injects anti-patterns into your code. “Maybe you always make an API call a particular way, but the LLM chooses a different method,” he says. “While technically correct, it did not follow your pattern and thus deviated from what the norm needs to be.”

A concept that takes this idea to its logical conclusion is retrieval augmented generation, or RAG, in which the model uses one or more designated “sources of truth” that contain code either specific to the user or at least vetted by them. “Grounding compares the AI’s output to reliable data sources, reducing the likelihood of generating false information,” says Mitov. RAG is “one of the most effective grounding methods,” he says. “It improves LLM outputs by utilizing data from external sources, internal codebases, or API references in real time.”

Many available coding assistants already integrate RAG features—the one in Cursor is called @codebase, for instance. If you want to create your own internal codebase for an LLM to draw from, you would need to store it in a vector database; Banerjee points to Chroma as one of the most popular options.

Catching AI hallucinations in your code

Even with all of these protective measures, AI coding assistants will sometimes make mistakes. The good news is that hallucinations are often easier to catch in code than in applications where the LLM is writing plain text. The difference is that code is executable and can be tested. “Coding is not subjective,” as Innovative Solutions’ Rehl points out. “Code simply won’t work when it’s wrong.” Experts offered a few ways to spot mistakes in generated code.

Use AI to evaluate AI-generated code

Believe it or not, AI assistants can evaluate AI-generated code for hallucinations—often to good effect. For instance, Daniel Lynch, CEO of Empathy First Media, suggests “writing supporting documentation on the code so that you can have the AI evaluate the provided code in a new instance and determine if it satisfies the requirements of the intended use case.”

HeraHaven’s Chatzipapas suggests that AI tools can do far more in judging output from other tools. “Scaling test-time compute deals with the issue where, for the same input, an LLM can generate a variety of responses, all with different levels of quality,” he explains. “There are many ways to make it work but the simplest one is to query the LLM multiple times and then use a smaller ‘verifier’ AI model to pick which answer is better to present to the end user. There are also more sophisticated ways where you can cluster the different answers you get and pick one from the largest cluster (since that one has received more implied ‘votes’).”

Maintain human involvement and expertise

Even with machine assistance, most people we spoke to saw human beings as the last line of defense against AI hallucination. Most saw human involvement remaining crucial to the coding process for the foreseeable future. ” Always use AI as a guide, not a source of truth,” says Microsoft’s Ramaswamy. “Treat AI-generated code as a suggestion, not a replacement for human expertise.”

That expertise shouldn’t just be around programming generally; you should stay intimately acquainted with the code that powers your applications. “It can sometimes be hard to spot a hallucination if you’re unfamiliar with a codebase,” says Rehl. Having hands-on experience in the codebase is critical to spotting deviations in specific methods or the overall code pattern, for example.

Test and review your code

Fortunately, the tools and techniques most well-run shops use to catch human errors, from IDE tools to unit tests, can also catch AI hallucinations. “Teams should continue doing pull requests and code reviews just as if the code were written by humans,” says Confluent’s Sellers. “It’s tempting for developers to use these tools to automate more in achieving continuous delivery. While laudable, it’s incredibly important for developers to prioritize QA controls when increasing automation.”

“I cannot stress enough the need to use good linting tools and SAST scanners throughout the development cycle,” says SAS’s Smith. “IDE plugins, integration into the CI, and pull requests are the bare minimum to ensure hallucinations do not make it to production.”

“A mature devops pipeline is essential, where each line of code will be unit tested during the development lifecycle,” adds Salesforce’s Banerjee. “The pipeline will only promote the code to staging and production after tests and builds are passed. Moreover, continuous deployment is essential to roll back code as soon as possible to avoid a long tail of any outage.”

Highlight AI-generated code

Devansh Agarwal, a machine learning engineer at Amazon Web Services, recommends a technique that he calls “a little experiment of mine”: Use the code review UI to call out parts of the codebase that are AI-generated. “I often see hundreds of lines of unit test code being approved without any comments from the reviewer,” he says, “and these unit tests are one of the use cases where I and others often use AI. Once you mark that these are AI-generated, then people take more time in reviewing them.”

This doesn’t just help catch hallucinations, he says. “It’s a great learning opportunity for everyone in the team. Sometimes it does an amazing job and we as humans want to replicate it!”

Keep the developer in the driver’s seat

Generative AI is ultimately a tool, nothing more and nothing less. Like all other tools, it has quirks. While using AI changes some aspects of programming and makes individual programmers more productive, its tendency to hallucinate means that human developers must remain in the driver’s seat for the foreseeable future. “I’m finding that coding will slowly become a QA- and product definition-heavy job,” says Rehl. As a developer, “your goal will be to understand patterns, understand testing methods, and be able to articulate the business goal you want the code to achieve.”

Sources: Info World
Published: Feb 17, 2025, 4:00:00 AM EST