How to Verify AI-Generated Code Before You Ship

Last quarter I asked a junior developer on a contract team to add a CSV export feature to an internal dashboard. He shipped it in under an hour. It worked beautifully in the demo. Three weeks later, a finance user exported a 90,000-row report and the server fell over because the AI-generated code he'd pasted in loaded the entire dataset into memory before streaming a single byte. Nobody read the code. It looked plausible, so it went out.

That story is now extremely common. GitHub reports that Copilot users accept roughly 30% of its suggestions, and a 2023 Stanford study found developers using AI assistants wrote less secure code while being more confident it was secure. That combination, lower quality and higher confidence, is exactly how bugs reach production. AI doesn't make mistakes that look like mistakes. It makes mistakes that look like working code.

This article is a practical playbook for how to verify AI-generated code before you ship it. We'll cover the specific failure modes that LLMs produce, a step-by-step verification workflow you can run in minutes, a comparison of review approaches, and a worked example with real numbers. The goal is simple: keep the speed AI gives you without inheriting the risk.

Key Takeaways

Treat AI code like a pull request from a stranger who is confident, fast, and occasionally wrong in dangerous ways.

Verify in four layers: read it, test it, scan it, and run it in isolation before it touches production data.

Hallucinated dependencies are real. AI invents package names that attackers then register, a tactic called "slopsquatting."

Security bugs hide in the boring parts: input validation, auth checks, SQL queries, and file handling.

Automate the repetitive checks (linting, SAST, dependency audits) so human review focuses on logic and intent.

Never paste secrets into a prompt and never trust code that references credentials, internal hostnames, or paths you didn't provide.

Why AI-Generated Code Needs Verification at All

Large language models predict the next plausible token. They do not understand your codebase, your threat model, or your data volumes. They produce code that statistically resembles correct code, which is a very different thing from correct code.

Here are the failure modes I see most often when reviewing AI output:

Confident wrongness. The code compiles, passes a happy-path test, and quietly mishandles edge cases like empty arrays, Unicode, timezones, or large inputs.
Hallucinated dependencies. The model imports a package that doesn't exist or suggests an API method that was deprecated two major versions ago.
Outdated patterns. Training data has a cutoff. You'll get code using libraries and idioms that were standard 18 months ago but are now insecure or removed.
Security blind spots. AI happily concatenates user input into SQL strings, skips CSRF tokens, and logs sensitive data because those patterns appear constantly in its training corpus.
Subtle license contamination. Generated code can closely mirror GPL-licensed source, which matters if you ship proprietary software.

None of this means you should stop using AI assistants. It means you need a repeatable verification process, the same way you'd review any third-party code. The same caution that applies when you verify open source software before you install it applies here, except the "author" is a model that can't be held accountable.

The Four-Layer Verification Workflow

I run every meaningful chunk of AI-generated code through four layers. The whole thing takes a few minutes for small snippets and scales up for larger features. Skip a layer and you're gambling.

Layer 1: Read it line by line

This sounds obvious and it's the step everyone skips. If you can't explain what each line does, you can't ship it. Read with these questions in mind:

Does every imported package actually exist, and is it the one I think it is?
Where does user input enter, and is it validated before use?
Are there any hardcoded credentials, URLs, or file paths I didn't provide?
What happens with empty input, null values, or a payload 1000x larger than the demo?
Are errors handled, or silently swallowed?

Layer 2: Test it against edge cases

AI writes code for the example you described. Write tests for the examples you didn't. At minimum, test the empty case, the maximum case, the malformed case, and one realistic production-scale case. The CSV bug I opened with would have surfaced instantly with a single 100,000-row test fixture.

Layer 3: Scan it automatically

Static analysis catches the boring, repeatable problems so your brain can focus on logic. Run a linter, a static application security testing (SAST) tool, and a dependency audit. For JavaScript that's eslint, npm audit, and a SAST scanner like Semgrep. For Python, bandit and pip-audit. These take seconds and catch a surprising share of injected vulnerabilities.

Layer 4: Run it in isolation first

Never let unverified code's first real execution be against production. Run it in a container, a sandbox VM, or a disposable environment with no production credentials. On Windows, developers often need to mirror a real directory structure into a sandbox without copying gigabytes of files; a tool like Windows Symlink Creator Pro makes that trivial by linking directories so the sandbox sees the same layout without the duplication.

A Worked Example: Verifying an AI-Written API Endpoint

Let's make this concrete. Say you asked an AI assistant to write a Node.js endpoint that fetches a user's orders by ID. It returns this:

app.get('/orders/:userId', (req, res) => { db.query("SELECT * FROM orders WHERE user_id = " + req.params.userId, (e, rows) => res.json(rows)); });

It looks fine. It would pass a demo where you hit /orders/42. Now run the four layers:

Read it. The userId is concatenated directly into the SQL string. That's a classic SQL injection. There's also no authentication check, so any caller can read any user's orders by guessing IDs.
Test it. Send /orders/1 OR 1=1. The endpoint returns every order in the table. Send /orders/abc and it throws an unhandled error that leaks the query in the response body.
Scan it. Semgrep flags the string-concatenated query as a tainted-input sink in under a second. npm audit confirms the database driver version is current.
Isolate it. You run it against a seeded test database, never the real one, so the injection test does no damage.

The fixed version uses a parameterized query (WHERE user_id = ?), validates that userId is numeric, and checks that the authenticated user owns the resource. The AI gave you a starting point. Verification turned it into shippable code. Here's the rough cost-benefit, with real numbers from this exact case:

Stage	Time spent	Bugs caught
Accepting AI output as-is	0 minutes	0 of 4
Reading line by line	3 minutes	2 of 4 (SQLi, missing auth)
Edge-case testing	8 minutes	1 of 4 (error leak)
Automated scanning	2 minutes	1 of 4 (confirms SQLi)

Thirteen minutes of verification versus a potential data breach. That's the trade you're making every time you skip review.

Verification Approaches Compared

Not all verification is equal. Here's how the common approaches stack up across the criteria that actually matter when you're deciding how much rigor a given change deserves.

Approach	Catches logic bugs	Catches security flaws	Speed	Best for
Manual code review	Excellent	Good	Slow	Complex business logic
Unit & integration tests	Excellent	Partial	Medium	Regression safety
SAST scanners	Poor	Excellent	Fast	Injection, secrets, unsafe APIs
Dependency audit	None	Excellent	Fast	Hallucinated/vulnerable packages
Sandboxed execution	Good	Good	Medium	Untrusted or destructive code

The lesson: no single approach is enough. SAST is fast but won't catch a business-logic error like charging the wrong customer. Manual review catches logic but humans miss the same injection patterns repeatedly. Layer them.

The Dependency Problem: Slopsquatting and Hallucinated Packages

This deserves its own section because it's the newest and nastiest risk. AI models routinely invent package names. A researcher in 2024 found that a meaningful percentage of AI-suggested packages didn't exist. Attackers noticed. They now register those hallucinated names with malicious payloads, betting that developers will copy-paste the import without checking. The community calls this slopsquatting.

Before installing any AI-suggested dependency, do three checks:

Verify it exists on the official registry and look at its real download count and publish date. A package that appeared last week with 12 downloads is a red flag.
Check the maintainer and repository. Does it link to a real GitHub repo with history, issues, and contributors?
Read the install scripts. Malicious packages often run code on install via lifecycle hooks. Audit before you run.

This mindset extends well beyond code packages. The

Cover image: The Torch Graduate circuit board (bottom) by Chris Whytehead, licensed under BY-SA 3.0 via Openverse.