The senior engineer's job has shifted. The split used to be roughly 60% writing code, 30% reviewing it, 10% direction and design. The split now runs closer to 80% review and 20% direction. The agent writes the bulk of the new code, and the engineer reads it before it ships. That shift sounds easy until you sit through the first month of it. The review has to be sharper than it ever was for human code, because the agent will produce diffs that compile, pass tests, and ship bugs that no compiler or runner will catch.
The reason the bar moved is that AI output passes the surface tests too easily. A junior engineer writing a buggy auth check will often produce code that looks broken at a glance, mismatched types, weird variable names, dead branches. An agent writing the same buggy auth check produces code that looks production-ready. The variable names match the codebase. The types check. The tests it wrote pass. The bug is in the logic of the check itself, and the only person who will spot it is the reviewer who actually reads what the function does instead of trusting that something which compiles is something which works.
This piece is about how to do that reading. It covers what to look for, in what order, with what tools, and how to build the habit so a 200-line diff takes 90 seconds to triage and 5 minutes to deeply review instead of an hour of confused scrolling. The premise is that review is the engineering work that survived the transition intact. Everything else got reshuffled. The engineers who can read AI output fast and find the three things that matter are the ones who ship reliably. The ones who skim and trust are the ones who ship incidents.
What changes when you review AI code instead of human code
Reviewing human code and reviewing AI code feel similar on the surface. Same diff viewer, same comment threads, same merge button. The work underneath is different. Human code review evolved over thirty years to address the failure modes humans actually have, which are not the failure modes models have.
When a person writes a function, the typical bugs are typos, off-by-one errors, forgotten edge cases they did not think about, code that works for the case they tested and breaks for the rest, and design choices that fit their head but not the codebase's existing patterns. Reviewers learn to look for those. The classic checklist runs: does the style match, is the naming clear, is the design appropriate, are tests present, are edge cases handled, is the commit message accurate. Most of the value is in the design conversation. A senior reviewer looking at a junior's PR is teaching as much as policing.
When a model writes a function, the typical bugs are different in kind. Style is rarely a problem because the model has read more of the codebase than the reviewer has, and it tends to mimic local patterns well. Naming is usually fine. Tests are often present, sometimes too many. The design fits the prompt the model was given, not necessarily the design the human intended. The bugs cluster in places humans rarely worry about: the model invented an API method that does not exist, the model handled the case the reviewer asked about but skipped one the reviewer mentioned in passing, the model touched a file that the prompt did not authorize, the model added a dependency that solves the problem at hand and creates two new ones.
The reviewer's eye has to retrain. The questions worth asking change.
Style and readability. Naming and clarity. Design appropriateness. Edge cases the author may not have considered. Test coverage. Commit message accuracy. Mentoring opportunities. The review is partly quality control and partly teaching. Reviewers assume the author understood the task; the bugs are usually in execution, not interpretation.
Scope adherence: did it touch only what was asked. Hallucinations: are imports, methods, and config keys real. Logic correctness: does it do what the prompt said, not just what compiles. Missing pieces: what did it skip. Security regressions: did it remove a check while refactoring. Style is rarely the issue; comprehension and completeness are. Reviewers cannot assume the model understood; the bugs cluster in interpretation.
The other shift is volume. A reviewer who used to see 5 PRs a day now sees 25, because the human is the bottleneck and everyone is sending AI-generated diffs to that bottleneck. The review has to be faster per item without becoming shallower. That is the whole problem. You cannot give every diff the same hour-long attention you used to. You have to triage hard, spot the dangerous diffs in the first 30 seconds, and apply real depth only to the changes that warrant it.
What follows is a way to do that. It is not a checklist you run through end to end on every PR. It is a set of reading habits that let you see the dangerous parts of a diff fast and skip the safe parts faster.
One last framing point before the substance. The relationship between the human reviewer and the AI author is not adversarial; it is collaborative with a clear division of responsibility. The agent has more knowledge of syntax, libraries, and local style than the reviewer does. The reviewer has more knowledge of the system as a whole, the team's history with similar features, and the things the prompt left implicit. Each compensates for the other's blind spots. A reviewer who treats the agent as an unreliable junior to be policed will miss the help it actually offers; a reviewer who treats the agent as a trusted senior will ship its mistakes. The middle posture, treating the agent as a fast pair-programmer whose work needs verification, is the one that produces good code.
Reading for correctness
The first pass on any AI-generated diff is correctness. Not "does it pass tests" correctness, which the CI will tell you in a minute, but "does this code do what the prompt actually asked for" correctness, which only the reviewer can check.
The framing question is simple: what was the prompt, and does this diff fulfill it. If the prompt said "add a rate limiter to the login endpoint that allows 5 attempts per IP per minute," the reviewer's job is to confirm that the diff does exactly that. Not 5 attempts per session, which is different. Not 5 attempts globally, which is different again. Not 5 attempts per IP per hour because the model misread the unit. The reviewer matches the prompt against the implementation and looks for drift.
The second question is whether the obvious edge cases are handled. Obvious here means the cases any working engineer would think of in the first 60 seconds. For a rate limiter that means: what happens at the exact 6th attempt, what happens when the clock rolls over, what happens with concurrent requests at the boundary, what happens when the IP is missing or spoofed. The model may have handled all of these. It often handles some and skips others. The reviewer's job is to spot the skip.
The third question is whether the diff stayed in scope. AI agents have a tendency to fix things they were not asked to fix. They see a slightly off naming convention three lines away from where they are working and they "improve" it. They notice an unused import and remove it. They reformat a block to match what they consider better. Each individual change is harmless. The aggregate is a diff that touches 14 files when the prompt was about 1, and the reviewer now has to check 14 files instead of 1. The fastest way to catch scope creep is to look at the changed-files list before reading any code. If the count is more than the prompt implied, that is a flag.
Before opening the diff, re-read what was asked. Hold the spec in your head as you read code. Without this anchor you cannot tell scope drift from intentional work.
How many files changed. How many lines. Does the count match what the prompt implied. A one-line bug fix that touched 9 files is almost always wrong.
What new functions, what changed signatures, what new modules. You are looking at shape, not detail. This 30-second pass tells you whether the architecture matches your model.
Now go line by line in the parts that matter. Compare the logic to what the prompt asked for. Note any branches that look suspicious, any conditions that read backward, any constants that look made up.
Run a few cases through your head. Empty input. Boundary input. Concurrent input. Malformed input. Were they handled or ignored.
Pull the branch, run the tests, run the actual code path with a real request. Most subtle bugs reveal themselves within 5 minutes of running the thing.
The reading order matters. Start high, go low. Structure first, logic second, syntax last. The reverse, syntax-first reading, is what most reviewers learned to do for human code, where syntax-level mistakes were the common bug. For AI code the syntax is almost always clean. The structural and logical mistakes are the ones that ship incidents.
The other habit is to read the test file alongside the implementation. Models often write tests that pass for the wrong reasons. A test that calls the function with input 5 and asserts the output is 10 may be passing because the function correctly multiplies by 2, or because the function happens to return 10 for that one input through some other path. Reading the test in isolation cannot tell you which. Reading the implementation alongside the test, and asking whether the test would catch a regression, can.
A practical way to pressure-test correctness without spending the full hour is the "perturbation question." Pick the most important branch in the new function. Imagine breaking it in the dumbest possible way: invert the condition, off-by-one the loop, swap two variables. Would any test in the diff fail. If yes, the test is real and the function is probably right. If no, the test is decorative and you have no evidence the function works. The perturbation question takes 15 seconds and reveals more than any number of green checkmarks.
Beyond perturbation, there is the contract question. Most functions have an implicit contract: given inputs of these types, return an output of this type, with these guarantees, having these side effects. The reviewer asks "what is this function's contract" and then "does the body honor it." Models can produce code that does the work but violates the contract in subtle ways: returns null when the contract said it would never return null, mutates an input the caller assumed was immutable, performs a side effect that the function name does not imply. These are the bugs that work for the immediate caller and break for the next one.
Reading for security
Security is where AI output goes wrong in subtle and dangerous ways. The model is trained on a vast amount of code, much of which has security issues. It has also read a vast amount of secure code. Whether it produces secure or insecure code on a given task depends on the prompt, the surrounding context, and a fair amount of luck. The reviewer cannot trust luck.
The most common security failure modes in AI-generated code cluster in a few areas. Auth flows where a check got removed during a refactor. Input handling where the model used string concatenation for SQL instead of parameterization. Secret handling where a key got inlined because the model did not know about the project's environment variable convention. File system access where a user-supplied path got passed to a file API without validation. Dependency choices where the model picked a package that has known vulnerabilities or is unmaintained.
Auth flows deserve the most attention. The pattern to watch is a refactor that touches an authenticated endpoint. The model takes the function, restructures it, moves logic around, and at some point the auth check gets reordered or omitted. The bug is invisible if you read only the diff, because the diff shows the function as it is now, not as it relates to the rest of the system. The fix is to read the function in full after the refactor and ask "where is the auth check, what does it check, and is it still required for this path." If the answer is "I cannot find the check," that is the bug.
Input validation is the second hot spot. SQL injection in 2026 is rare in greenfield code because most ORMs handle it. It still appears in older codebases when the model is asked to write a query that the existing helper does not support, and it falls back to string concatenation. XSS is similarly rare in modern frontend frameworks but appears whenever the model uses dangerouslySetInnerHTML or a raw template injection because it could not figure out a cleaner way. CSRF appears whenever an endpoint is added without going through the project's existing auth middleware. Each of these is easy to spot once you know the pattern. The reviewer asks "where does user input enter this code path, and what happens to it before it reaches anything dangerous."
Secrets handling has its own failure mode. The model, given a prompt like "add Stripe integration," will sometimes inline an example API key from training data. The key is fake, but the pattern of inlining is real, and once the diff is merged the next refactor may not fix it. Some models also write keys into config files that get checked in. The reviewer's habit is to grep the diff for anything that looks like a key or secret. Tools help here: tools like git-secrets, gitleaks, and TruffleHog will catch obvious patterns automatically. The reviewer still has to look for the non-obvious cases, like a hardcoded webhook URL that includes an auth token in the path.
Path traversal and file system access show up in any code that accepts a user-supplied filename or path. The pattern is fs.readFile(userInput) with no validation. The fix is path normalization plus a check that the result lives inside an allowlist directory. Models often handle this when prompted explicitly and skip it when not.
Dependencies are a separate category. When the model adds a package to package.json or requirements.txt, the reviewer should ask: is this package real, is it maintained, is it the right size for the job, is it a known compromised package. Tools handle the first three quickly: npm audit, pip-audit, and Snyk will flag known vulnerabilities. GitHub's dependency scanner does the same on push. Semgrep can be configured to flag the addition of any package not on an allowlist. None of this catches the case where the package is fine but unnecessary, which is the design issue and is the reviewer's job.
For deeper static analysis, Semgrep is the best general tool because it lets you write custom rules in plain syntax patterns. CodeQL is more powerful but harder to write rules for. Snyk is the strongest commercial option for dependency scanning. GitHub Advanced Security combines several of these into a single integrated experience if you are already on GitHub. The choice between them is mostly about ergonomics; the bigger choice is to actually wire one of them into CI so it runs on every PR.
One pattern worth highlighting because it surfaces in almost every codebase that uses agents heavily: the silent permission downgrade. A function originally checked if user.is_admin and user.org_id == target.org_id. The agent, refactoring for clarity, splits these into two helpers and somewhere the second condition gets dropped. The function still rejects non-admins, so the unit test passes. It now lets admins of organization A modify resources in organization B, which is exactly the cross-tenant breach you spend the most engineering hours preventing. The reviewer's check is to take any auth predicate and ask "what cases did the original check reject that this new check accepts." The answer should be "none." If the reviewer cannot tell, the refactor needs more thought.
Secrets handling deserves an extended note because the failure mode goes beyond inlined keys. Models frequently log objects in full when a more careful logger would have stripped sensitive fields. The line logger.info("user authenticated", user=user) looks fine and may be writing the user's password hash or session token to the log file, depending on what the user object holds. The reviewer's habit is to scan all new log calls and ask whether the things being logged could contain secrets. Structured loggers with field-level redaction help; the reviewer still has to check that the new logging code uses them.
Reading for performance
Performance regressions are the second most common AI-introduced bug after correctness errors, and they are sneakier because they often pass tests. The test runs against a database with 50 rows. The query is fine at 50 rows. In production with 50,000 rows the query is a 4-second N+1 disaster. The CI never tells you because the CI never had 50,000 rows.
The most common performance bug in AI-generated code is the N+1 query. The pattern is a loop that calls a database operation inside it. for user in users: posts = get_posts(user.id) looks innocent and is. With 10 users it makes 11 queries. With 10,000 users it makes 10,001. The fix is a join or a batched WHERE id IN (...) query. Models know the pattern and apply it when prompted, but a default request to "list users with their posts" will produce N+1 code more often than it produces a joined query. The reviewer's habit is to look for any loop that calls a function with a database, network, or file access inside it.
The second pattern is missing indexes. The model writes a query that filters on a field, and the field is not indexed. The query works, slowly. Test data does not reveal it. The fix is to add an index in a migration. The reviewer's habit is to look at WHERE and JOIN clauses in any new query and ask whether the filtered fields have indexes. If a new column is being added that will be queried, the migration should create the index in the same migration.
The third pattern is synchronous I/O in hot paths. The model writes a request handler that calls a slow external API synchronously, blocking the response. The fix is to make it async, or queue the work, or cache the result. The reviewer's habit is to look at any new request handler and ask "what does this block on, and is the block tolerable at request time."
The fourth pattern is over-fetching. The model writes SELECT * FROM users when the caller needs only the email. The model fetches all rows when pagination would have worked. The model joins tables that did not need to be joined. The fix is to ask for only what you need. The reviewer's habit is to look at any data access and ask "is this the minimum amount of data this function needs."
The fifth pattern is bundle size bloat in frontend code. The model imports a whole library to use one function. import _ from 'lodash' to use _.debounce adds 70KB to the bundle. import debounce from 'lodash.debounce' adds 2KB. Models are usually fine when prompted to consider bundle size and often forget when not. The reviewer's habit is to scan new import statements and flag anything that pulls in a heavy library for a small purpose.
The hard truth about performance review is that you cannot catch most regressions by reading code. You catch them by running the code against realistic data. CI usually does not have realistic data. The compromise is a small set of performance smoke tests, run nightly, that catch the worst regressions. Plus a habit of reading new database queries and frontend bundles with a skeptical eye.
One trick that pays back consistently: when reviewing a database query the agent introduced, run EXPLAIN on it locally with a copy of the production schema. The plan tells you whether the query is using indexes, doing sequential scans, or fanning out into nested loops. Postgres and MySQL both make this cheap. The agent does not run EXPLAIN by default; the reviewer who does catches inefficient queries before they hit production. For frontend, the equivalent is running the bundle analyzer (webpack-bundle-analyzer, source-map-explorer, or the built-in tools in modern bundlers) on a feature branch and checking whether the bundle grew unexpectedly. Both checks take under 5 minutes per PR and catch most of the performance regressions that would otherwise ship.
The other underrated performance category is React rendering. Models adding a new component often miss the memoization story: they pass new objects or new functions as props on every render, which busts memoization downstream and produces invisible re-render storms. The diff looks fine because everything still works; the regression shows up only when the page is interacted with at scale. Tools like React Devtools' Profiler, the why-did-you-render library, or the React Compiler can catch these. Without tools, the reviewer's habit is to scan new component code for inline objects and inline functions in props and ask "is this component memoized, and is the memoization actually working."
Reading for style and consistency
Style sounds like the lowest-stakes review category, and on a one-off PR it is. On a codebase that will be touched by AI agents for the next five years, style is one of the highest-stakes categories, because every inconsistency the agent introduces makes the next agent session slower.
The reason is that agents read the codebase to figure out what to do. If the codebase has three different ways of handling errors, the agent picks one of them roughly at random for the next change, sometimes one and sometimes another, and over time the inconsistency multiplies. Same for naming. Same for logging. Same for the shape of API responses. The codebase that started with three patterns ends up with eight.
The reviewer's job in style review is therefore not "is this elegant" but "does this match the existing pattern." The question is whether the new code reads as if it was written by the same hand as the surrounding code. The answer should be yes by default. If the answer is no, the reviewer pushes back, the model rewrites it to match, and the codebase stays coherent.
The patterns to watch are the ones that compound. Error handling is the biggest. If the codebase throws custom error classes and the new code returns error objects, that is a problem because the next caller has to know which to expect. If the codebase logs errors with a structured logger and the new code uses console.log, that is a problem because the logging dashboard will miss the new errors. If the codebase uses a specific HTTP response shape and the new code invents its own, every consumer of the API has to handle two shapes.
Naming is the next one. If functions in the codebase follow verbNoun ordering and the new function is named userGet instead of getUser, that is a small thing, but the next agent reading the code will pick the new style as much as the old, and the inconsistency spreads. Same for variable casing, file naming, and module structure.
The mechanical part of consistency is handled by Prettier, ESLint, Biome, mypy, and equivalents. Format and surface lint should not be a manual review concern; the linter catches it. The judgment part of consistency, which is whether the structural pattern matches, is what the reviewer reads for. A pre-commit hook plus a CI step running the linter takes care of 80% of style issues without anyone looking. The remaining 20% is what review is for.
The consistency tax compounds. Every PR that introduces a new pattern adds a few minutes to every future PR that has to deal with both patterns. Reviewers who push back on inconsistency are doing maintenance work for the entire team's future. Reviewers who let it slide because "it works" are paying the tax forever.
Reading for what's missing
The hardest review skill for AI output is noticing what the agent did not do. Diffs show what changed. They do not show what should have changed and did not. The reviewer who only reads the diff will miss a category of bug that the prompt asked for and the implementation skipped.
The pattern is consistent. The prompt has 4 things in it. The implementation handles 3 of them well and forgets the 4th. The 3 that were handled show in the diff, look correct, pass tests. The 4th, which never got written, leaves no trace. The reviewer reading the diff sees correct code and approves. The bug ships.
The detection is to read the prompt with a checklist mindset. Make a mental list of what was asked. Then check each item against the diff. Anything on the list that does not appear in the diff is suspicious. Either the agent decided it was unnecessary, in which case there should be a comment explaining why, or the agent forgot, in which case the reviewer's job is to flag it.
The categories of common omissions:
Error handling. The happy path is implemented; the failure paths are not. The function works when the network call succeeds and crashes when it fails. The reviewer's check is to ask "what happens when this network call returns an error" for every external call in the diff.
Tests. The agent wrote some tests; it did not write the test that would have caught the bug. Tests for happy paths are easy to generate; tests for edge cases require thinking about edge cases. The reviewer's check is to ask "if I broke this function in the obvious way, would the tests catch it." If no, more tests.
Edge cases the prompt implied but did not state. The prompt said "rate limit by IP." The reviewer mentioned in passing that some users come through proxies. The agent did not implement proxy header parsing. That is a missing piece, and only the reviewer who remembers the conversation will catch it.
Cleanup of temporary scaffolding. The agent left in debug logs, hardcoded test values, commented-out alternative implementations, or TODO comments that mark work that should have been finished in this PR. The reviewer's check is to grep the diff for TODO, console.log, print, and FIXME.
Documentation updates. The function signature changed and the docstring did not. The README mentions an option that no longer exists. The OpenAPI schema is out of sync with the new endpoint. These do not break anything immediately but rot over time.
Migration scripts for schema changes. The agent added a new column to the model and forgot the migration file. The local dev environment, where the agent ran tests, has the column because the agent created it manually. Production does not.
Implements all 4 items from the prompt. Includes error handling for both paths the function calls. Tests cover happy path plus 2 failure modes. Migration file present. No debug logs or TODO markers. Documentation in the public API matches the new signature. Diff stayed in scope; only the files that needed to change actually changed.
Implements 3 of 4 items; the 4th was acknowledged in commit message and silently dropped. Error handling on the primary path; secondary path crashes. Tests cover happy path; first realistic failure mode is uncaught. Schema change has no migration. Two console.log calls left in. README still describes the old API. Looks clean in the viewer; ships an incident.
The reviewer's habit for catching missing pieces is to read with a list, not just with eyes. Hold the spec in your head and tick items off as you find them. Anything unchecked at the end is a thing to ask about.
Reading for hallucinations
Hallucination is the AI-specific failure mode that has no analog in human code review. A human writing a function may make a typo or pick the wrong API method out of laziness, but they rarely invent an API method that does not exist. A model can do it casually. The model has read enough Python to know what Python code looks like, and it can produce code that looks like Python and uses methods Python does not have, and the code is syntactically clean and the bug only surfaces when you run it.
The most common hallucinations:
Made-up imports. The function uses from somelib import some_thing and some_thing does not exist in somelib. The import fails at runtime. CI catches this immediately if there is a CI; if there is no CI it ships and breaks on first execution.
Made-up API methods. The function calls obj.some_method() and obj does not have that method. Often the model is mixing up two libraries: it remembers the method from one library and is using it on an object from another. The check is to grep the diff for any method call on an external library and verify the method exists.
Made-up config keys. The code reads config.get('SOME_VAR') and SOME_VAR is never set in the config files. Or the code reads an environment variable that the deployment scripts do not provide. The check is to grep for any new env var or config key and verify it is plumbed through.
Made-up file paths. The code references ./templates/email.html and the file does not exist. The model assumed a conventional file structure that this codebase does not use. The check is to verify any new file path actually points to a real file or to a file the same diff creates.
Made-up error codes or status codes. The function returns HTTP 418 to indicate "rate limited," which is wrong; the correct code is 429. Or the function throws an error class that the surrounding code does not catch.
Made-up version numbers. The dependency entry says "some-package": "^4.2.1" and version 4.2.1 of that package does not exist. The install fails. This is a particular risk when models are slightly out of date and confidently quote versions that have been deprecated or never released.
The single fastest hallucination check. Pull the branch, run the actual code path with a real input. Half of all hallucinations show up as ImportError, AttributeError, or KeyError within the first 30 seconds.
The other half often surfaces under test. If the agent wrote tests that pass, run them. If they fail, the agent missed something. If they pass without exercising the new code path, the test is not testing what it claims to test.
Every import at the top of every changed file. Are they real packages, real submodules, real names. A 30-second scan catches most made-up imports.
For any method call on an external object, hover in your IDE or check the docs. If the IDE shows no completion for the method, the method does not exist.
Grep the rest of the codebase for every new config key or env var the diff introduces. If it appears only in the diff, the diff did not finish; something else has to be updated.
Any new dependency or version bump, verify the version exists. npm view package@version or pip index versions package takes 5 seconds.
The good news about hallucinations is they are mostly cheap to catch once you know the patterns. The bad news is they are the easiest category to skip in a hurried review, because they live in the parts of the diff that look most boring. Imports look the same whether they are real or not. Method calls look the same. Config keys look the same. The reviewer who reads only the "interesting" parts of the diff and trusts the boilerplate will ship hallucinations.
Tools that help
Tools do not replace review. They amplify it. A tool catches the things tools are good at catching, freeing the reviewer to focus on the things that require judgment. The right tool stack for an AI-heavy codebase has more layers than the equivalent stack for human code, because more bugs slip through that need automated detection.
The base layer is formatters and linters. Prettier and Biome handle JavaScript and TypeScript formatting. ESLint handles JavaScript linting; typescript-eslint extends it to TypeScript. For Python, Ruff has largely replaced flake8 and isort because it is faster and combines their functionality. Black handles Python formatting. These tools should run on save in the editor and as a pre-commit hook. They should never be a manual review concern.
The next layer is type checking. TypeScript in strict mode catches a substantial fraction of AI hallucinations because made-up methods and missing fields show as type errors. The strict flags worth enabling are strict: true, noUncheckedIndexedAccess, exactOptionalPropertyTypes, and noImplicitOverride. For Python, mypy or Pyright in strict mode does similar work. mypy is slower but more configurable; Pyright is faster and integrates better with VS Code. Pick one and run it in CI.
The next layer is static analysis for security. Semgrep is the best general tool here because rule writing is straightforward and the existing rule packs cover common issues across many languages. CodeQL is more powerful and is built into GitHub Advanced Security, but writing custom rules is harder. Snyk is the strongest commercial option for dependency scanning specifically. npm audit and pip-audit are the free baseline for dependency vulnerability scanning and should be wired into CI.
The next layer is test coverage. Coverage tools like c8 for Node, coverage.py for Python, and built-in coverage in many test runners give you a number for how much of the code is exercised by tests. The number is a flawed metric: 90% line coverage with bad tests is worse than 60% with good ones. But the trend matters. If a PR drops coverage by 5%, that is a flag worth looking at.
The advanced layer is mutation testing. Tools like Stryker for JavaScript, mutmut for Python, and PIT for Java mutate the source code in small ways and check whether the tests catch the mutations. Tests that pass against unmutated and mutated code are tests that are not actually testing the logic; they are passing for the wrong reason. Mutation testing is slow and not for every CI run, but it is the strongest way to evaluate whether your test suite is real.
For agent-generated code specifically, the layer that matters most is integration testing. Unit tests pass too easily because the agent wrote them around its own implementation. Integration tests, which exercise the actual code paths against real or realistic dependencies, catch a different and more important class of bug. A short suite of integration tests that runs in CI is worth more than a 10x larger suite of unit tests.
Prettier or Biome plus ESLint or Ruff, configured as a pre-commit hook. Catches surface issues before they reach review.
TypeScript strict, mypy strict, or Pyright strict. Catches most hallucinated method calls and missing fields automatically.
The basic CI gate. Should run in under 2 minutes for a healthy codebase; longer than that and developers stop running them locally.
Semgrep or CodeQL with default rule packs. npm audit or pip-audit for dependency vulnerabilities. Wire into CI; do not let warnings accumulate.
Smaller suite than unit tests, exercising real code paths. The single highest-value gate for catching agent-generated bugs.
A short suite of representative queries and request paths run against realistic data. Catches the regressions CI cannot.
If your codebase has critical logic, run mutation testing on those modules to validate the tests are real. Not for every PR; for the parts of the code where bugs hurt most.
The point of the toolchain is to absorb the rote part of review so the human can spend their attention on judgment. A reviewer who is reading every diff for missing semicolons is wasting attention. A reviewer who is reading every diff for hallucinated method calls is also wasting attention because the type checker would catch most of them. The reviewer's job is the part the machine cannot do: did the agent understand the prompt, did it cover the cases that matter, did it stay in scope. Everything else should be automated.
The review checklist for AI output
The pattern that works across teams is a short checklist the reviewer runs through on every AI-generated PR. Not exhaustive; not a substitute for reading the code; but a backstop that catches the categories of mistake that come up over and over. The checklist should fit on one screen and take 2 minutes to apply once the reviewer has internalized it.
Below is a 12-item version. It compresses every category covered in this piece into a single pass.
Does the changed-files list match the prompt. If the prompt was about one feature and the diff touches 12 unrelated files, push back before reading any code.
Read the prompt with a list mindset. Does the diff implement every item. If 4 things were asked and only 3 are in the diff, ask about the 4th.
Read every new import, every external method call, every config key, every env var. Are they all real and plumbed through. Run the code to confirm.
Auth checks present where required. User input validated. Secrets in env vars not inline. New endpoints behind correct middleware. Dependency choices reasonable.
No N+1 queries in the new code. Indexed fields for new query filters. Async work where blocking would hurt. No bundle bloat from full-library imports.
Tests present and meaningful. Edge cases covered. Tests would actually fail if the function broke. Coverage trend not regressing significantly.
Matches the codebase's existing patterns. Error handling shape consistent. Logging consistent. Naming consistent. Linter and formatter happy.
Migration files present for schema changes. Documentation updated for API changes. No leftover debug logs or TODO markers. Scaffolding cleaned up.
New dependencies are real, maintained, and appropriately sized. npm audit and equivalents pass. No abandoned or compromised packages.
The diff did not "improve" code unrelated to the task. If it did, those changes need their own PR or a clear note.
Describes what changed and why, not just the file list. Future archaeologists can understand the intent without reading the diff.
Pull the branch. Run the tests. Hit the actual code path with a real request. The single highest-value activity in the whole list.
The order is not strict. In practice the scope check and the run-it step bracket the others. Scope check first, because if the diff is in the wrong shape there is no point reading the rest. Run-it last, because most of the bugs that survive the read fall out the moment the code executes.
The checklist works because it is short enough to remember and broad enough to cover the failure modes. Teams that use it report catching about 90% of agent-introduced bugs before merge. The remaining 10% are the ones that require deeper understanding of the system, the kind of bug a checklist cannot encode.
Closing
Review is the engineering work that survived the AI revolution intact. Coding got reshuffled, design got reshuffled, testing got reshuffled. The agent does most of the writing now. What still belongs to the human is the judgment about whether the writing is correct, and that judgment lives in the review.
The 200-line diff lands in the queue. The agent took 90 seconds to write it. The reviewer has 90 seconds to triage it and another 5 minutes to look at the parts that matter. In those 7 minutes the reviewer asks: did the agent understand the prompt, did it stay in scope, did it cover the cases that count, did it ship a hallucination, did it leave a security hole. The answers to those questions, multiplied by every PR a team merges, determine whether the team ships software or incidents.
The engineer who looks at a diff and finds the three things that matter in 30 seconds is the one whose code base stays healthy. The engineer who skims and trusts because the tests pass and the linter is green is the one whose codebase rots. Both are equally productive on the surface; they differ in the half-life of what they ship. One ships things that are still working a year later. The other ships incidents that bleed in slowly until the on-call rotation gets unbearable.
The trick to becoming the first kind of reviewer is practice plus pattern recognition. The patterns are in this piece. Scope creep, hallucinations, missing pieces, performance pitfalls, security regressions, style drift. After a hundred PRs you can spot most of them in seconds. After a thousand you stop missing them. The tools amplify the practice; the checklist backstops the practice; the practice itself is what makes a reviewer reliable.
The agent is fast at writing. The human is slow at reading and good at judging. That asymmetry is the new shape of the work. Lean into it. Read carefully, judge sharply, ship clean.
