Testing AI-Generated Code: TDD With Agents

The test suite is the contract that lets AI iterate without breaking what already works. Without tests, you are debugging by inspection. With tests, the agent can refactor freely and you find out immediately when it broke something. That sentence sounds like a slogan and it would be a slogan, except the consequences of believing it or not believing it are the difference between shipping reliably and shipping incidents. The teams that have integrated agents into real workflows have one thing in common, and it is not their model choice or their prompt library or their IDE. It is that they have tests, and the tests run, and the agent operates inside the perimeter the tests define.

The shape of the work has changed. The agent makes twelve file changes per turn. The agent rewrites a function and three of its callers in one diff. The agent refactors a module and adds two new dependencies along the way. None of this is anomalous; this is the ordinary flow when you delegate at the granularity that agents actually work at. In the world without tests, every one of those turns is a roll of the dice. You read the diff, you decide whether it looks right, you commit, and you find out two days later that something downstream broke because the agent changed a function signature you did not notice. In the world with tests, the dice stop rolling. The agent commits, the suite runs, the failures surface, and you fix what broke before it ships. The difference is hours saved per week and incidents avoided per quarter.

This piece is about how to build that perimeter. It covers why TDD matters more under AI assistance than it did before, what kinds of tests AI writes well and what kinds it does not, how to use property-based testing to catch the bugs unit tests miss, the snapshot antipattern that will eat your codebase if you let it, the CI loop that turns the test suite into a real safety net, and the post-edit verification command that should be one keystroke away. The premise is that testing was always undervalued and AI assistance has made it priceless. The developers who ship with tests sleep. The developers who ship without tests wait for the user to find the bug.

Why TDD Matters More, Not Less, With AI Assistance

The argument against TDD pre-AI was about cost. Writing the tests first slows you down on the way to working code. The payback comes later, during refactoring, when the tests catch the regressions you would otherwise ship. The trade-off was real and the argument was reasonable. Some teams adopted TDD wholesale, some adopted it for critical paths only, some skipped it and shipped fine for years. The math depended on the project, the team, and how much refactoring the codebase actually got.

The math is different now. The cost of writing tests dropped to near zero, because the agent writes them. You ask for a function and you ask for the tests at the same time, and you get both in one turn. The marginal cost of "and write tests for this" added to a prompt is approximately the same as the cost of not adding it. The benefit, meanwhile, exploded. Refactoring used to be an occasional activity; now an agent does it on every other turn. The agent rewrites a helper, the agent restructures a module, the agent renames a variable across fifteen files. Each of those is a refactor and each of them needs the tests to validate that nothing broke. Without tests, you are reading diffs and hoping. With tests, the validation happens in seconds.

The compounding factor is that the agent's iteration speed makes test-driven workflows actually viable. The classic TDD cycle is red, green, refactor: write a failing test, write the minimum code to pass it, refactor, repeat. The bottleneck in that cycle was always the human typing. With the agent, the bottleneck moves to thinking, which is where you wanted it. You describe the behavior, the agent writes the test, you confirm the test captures the behavior, the agent writes the code, the test passes, the agent refactors, the test still passes, you ship. Each cycle is minutes instead of an hour. The discipline that was theoretically right but practically expensive is now both right and cheap.

TDD without AI assistance

Cost: 30-50% more time on the initial implementation. The tests have to be written by the same human who will write the code, and writing the test requires holding the same model of the system in your head twice. Benefit: regressions caught during the rare refactor. Net for most projects: positive but marginal. Adoption rate: high in some teams, optional in many, skipped in plenty without disaster. The trade-off was contested for thirty years for good reasons.

TDD with AI assistance

Cost: a few extra seconds per prompt to ask for tests alongside code. The agent writes both in one turn. Benefit: regression coverage that catches breakage on every multi-file edit, of which there are now many per session. Net: massively positive. Adoption rate: should be near universal because the cost dropped while the benefit grew. The teams that still skip tests are paying with their nights and weekends.

The other shift is the agent's twelve-file-change reality. A human PR usually touches one to three files. A human can read three files carefully and reason about the change. An agent PR routinely touches eight or twelve or fifteen, because the agent does not feel the cost of breadth the way a human does. Asking the agent to add a feature pulls in changes to the feature module, the model, the migration, the API, the controller, the tests, the type definitions, and the documentation. That is reasonable scope for the feature; it is too much scope to read by hand on every iteration. The test suite is what makes it tractable. You read the diff for design and intent. The tests verify that the eight or twelve files still compose into a working system. Without tests, you cannot read fast enough to keep up with the agent. With tests, you do not have to.

The argument that "I will just be careful" does not survive contact with the volume. The volume is too high and the cognitive load is too sustained. Careful reading was a defensible strategy when you produced one PR per day and had an hour to read each. When you produce ten PRs per day and have minutes per each, careful reading is not a strategy; it is wishful thinking. The discipline that scales is mechanical: the suite runs, the suite catches the breakage, you respond to the failures. Careful reading becomes a layer on top of mechanical verification, not a replacement for it.

The other thing TDD does, which becomes more valuable under AI, is force you to specify behavior before implementation. The agent will produce code that compiles for any prompt. Whether that code does what you intended is a different question. Writing the test first is the cheapest way to pin down what "what you intended" means in code. The test is the spec, in code, runnable. The agent's implementation is graded against it. Without the test, the agent's interpretation of the prompt is the spec, and you find out later whether your interpretation matched. With the test, the spec is explicit before any implementation lands.

The pushback you sometimes hear is "but I do not know what I want to build until I see it." That is fair, and it is a real workflow for exploratory code. The answer is that exploratory code is exempt from TDD by design. You sketch, you throw it away, you sketch again, you find the shape, and then you write tests for the shape you found. TDD applies to code that is about to ship. For prototypes, write the tests after the prototype stabilizes. For production code, write the tests before or alongside. The distinction matters because some teams use "I am exploring" as a permanent excuse to never write tests. That is a different problem, and it is one the agent does not solve for you.

Unit, Integration, End-to-End: The Pyramid Still Applies

The test pyramid predates AI by twenty years and survives the transition unchanged. Many fast unit tests at the base, fewer integration tests in the middle, a small number of end-to-end tests at the top. The reasoning has not moved: unit tests catch regressions cheaply because they run in milliseconds and isolate the behavior being tested. Integration tests catch the bugs that come from real component interactions but cost hundreds of milliseconds each. End-to-end tests catch the bugs that come from the whole system but cost seconds or minutes each and are flakier than the lower layers. The pyramid is the cost-benefit gradient drawn as a triangle.

The AI angle is that the agent is happy to write at any layer of the pyramid. Ask for unit tests and you get unit tests. Ask for integration tests and you get integration tests. Ask for E2E and you get E2E. The risk is that the agent's defaults skew toward whatever layer is most common in its training data, which is unit tests for libraries and end-to-end for product code. Without explicit guidance, you may end up with a suite that is heavy on the layer the agent saw most and light on the layers your project needs. The fix is being explicit about which layer you want at any given moment, and reviewing the resulting suite to confirm the pyramid shape is what you intended.

Unit tests are the base because they pay back fastest. A unit test runs in five to fifty milliseconds, depending on the framework and the test. A suite of a thousand unit tests runs in five to fifty seconds. That speed is what makes the inner-loop verification feasible. The agent makes a change, you run the unit tests, you know within a minute whether the change broke something. Vitest is the modern default for JavaScript projects because of its Vite-based speed; Jest is still common in older codebases and is fine. Bun's built-in test runner is the fastest in the JavaScript ecosystem if you are already on Bun. For Python, pytest is the unambiguous default; nose and unittest still appear in legacy codebases but pytest is what new projects pick. Go has its built-in testing package, which is good enough that very few projects bother with anything else.

Integration tests are the middle layer because they catch the bugs that show up when components meet but unit tests cannot see. A unit test mocks the database. An integration test uses a real database, possibly an in-memory one for speed, possibly a containerized one for fidelity. The cost goes up to hundreds of milliseconds per test because there is real I/O. The benefit is that bugs in SQL queries, schema mismatches, and ORM behavior surface here instead of in production. For Node projects, Testcontainers and the test database adapters from your ORM (Prisma, Drizzle, TypeORM) handle the setup. For Python, pytest-postgresql and pytest-mysql let you run real databases inside the test process. The pattern is the same regardless of language: spin up a real dependency, run the test against it, tear down.

End-to-end tests are the top because they cost the most and catch the most user-visible bugs. Playwright is the modern leader for browser automation and has matured into a tool that handles flakiness better than its predecessors. Cypress remains popular and has its strengths in the developer experience around test debugging. Both run a real browser against a real (usually local) instance of the app, click things, fill forms, and assert on the resulting state. The cost is seconds to tens of seconds per test. The benefit is that the tests catch the bugs that only appear when the whole system is running together: routing issues, state management bugs, frontend-backend contract violations. The pyramid says "few" of these because you cannot afford to have many. A hundred E2E tests at ten seconds each is a seventeen-minute suite, which is too slow for the inner loop. Forty good E2E tests is plenty if they cover the critical paths.

Many unit tests (ms)

Some integration tests (100s of ms)

Few E2E tests (seconds)

Critical-path coverage

The cost gradient is also the speed gradient, which is why the pyramid is the right shape. Unit tests are cheap to run, so you run them constantly. Integration tests are medium, so you run them on commit. E2E tests are expensive, so you run them on PR or on merge. The shape lets you maintain fast feedback for the cases where speed matters and accept slow feedback for the cases where speed is impossible. Reversing the shape (the so-called "ice cream cone" antipattern, with many E2E tests and few unit tests) produces a suite that runs slowly, fails flakily, and provides poor signal because failures point at the whole system rather than the specific component that broke.

Mocking is its own art. The principle is to mock what you do not own and use real implementations of what you do own. The HTTP API of a third-party service is mocked because you do not control it and cannot reliably test against it. Your own database is real (in integration tests) because you control it and want the tests to catch schema drift. The agent will sometimes mock things it should not, like your own internal modules, because mocking is a generic pattern in its training data and it does not know which dependencies are owned vs not. The reviewer's habit is to scan new test files for over-mocking and push back when the mocks have replaced the actual logic being tested.

For HTTP mocking, MSW (Mock Service Worker) is the modern choice because it intercepts at the network layer rather than at the fetch-call layer, which means the same mocks work across unit, integration, and E2E tests. Nock is older and simpler. For Python, responses or httpx-mock. The advantage of network-layer mocking is that you do not have to refactor your code to inject mocks; the mocks attach themselves to the network and your code runs normally.

One specific pattern that helps with the pyramid is naming the layer in the test file. auth.unit.test.ts, auth.integration.test.ts, auth.e2e.test.ts. The naming makes it possible to run only unit tests during the inner loop and reserve integration and E2E for slower passes. The agent will pick this up if you use it consistently in the codebase. Without the naming convention, the agent dumps all tests into one folder and you cannot easily filter.

Tests AI Generates Well vs Poorly

The agent has clear strengths and clear weaknesses in test generation. Knowing which is which lets you delegate to its strengths and reserve your own attention for its weaknesses. The mistake is treating "the agent generated tests" as a uniform claim. It generated some tests well and some tests poorly, and the breakdown is predictable.

The agent is good at parametric edge cases. Empty input, null input, undefined input, max-length input, zero, negative, very large, unicode, whitespace-only, leading/trailing whitespace. Ask for "tests for this function including edge cases" and you get a respectable list of cases the function should handle. The cases are real, the agent enumerates them well, and the tests genuinely catch off-by-one errors and null-handling bugs. This category alone is worth the cost of asking, because humans tend to forget at least one edge case per function and the agent does not get tired.

The agent is good at happy-path coverage. Given a function with five branches, the agent will write a test for each branch, hitting the typical input that exercises it. The tests are not always deep, but they cover the surface, which means a code change that breaks one of the branches will produce a failing test rather than a silent regression. Combined with edge cases, this gets you the bulk of unit test coverage without much effort.

The agent is good at mock setup. Given an external dependency, the agent writes the mock, the test that uses the mock, and the assertions that verify the mock was called correctly. The boilerplate of mocking is annoying for humans and the agent does it cheerfully. Especially in JavaScript and TypeScript, where mock setup involves multiple lines of imports, jest-mock-extended or vitest's mock helpers, and careful typing, the agent saves real time.

The agent is good at snapshot tests. The pattern is mechanical: render the component, capture the output, assert it matches the saved snapshot. The agent writes these correctly and quickly. The caveat is that snapshot tests have a specific failure mode the agent will exploit; that is the next section.

The agent is bad at tests that capture properties the code should not satisfy. The negative space. "This function should never return null when given valid input." "This endpoint should never expose another user's data." "This query should never return more rows than the user has permission to see." These are properties about what the system must not do, and they are the most important properties for security and correctness. The agent does not naturally generate these because they require understanding the threat model, which is your domain knowledge, not the agent's. You have to write these properties explicitly or prompt the agent with the threat model.

Tests AI generates well

Edge case enumeration: empty, null, max length, unicode, zero, negative. Happy path branch coverage. Mock setup boilerplate. Snapshot tests for component output. Type tests in TypeScript. Parametric tests where the same logic runs against many inputs. The pattern: mechanical, enumerable, locally derivable from the function signature and code. The agent does these faster and more thoroughly than a tired human.

Tests AI generates poorly

Negative space: properties the code must not satisfy. Cross-tenant data leak checks. Authorization bypass tests. Concurrency and race condition tests. Performance assertions. Business invariants ("this number should never be negative"). Property-based tests where the property has to be designed. Anything requiring a threat model or domain knowledge the agent does not have. These need human design first; agent implements after.

The agent is bad at tests that catch business-logic violations. "An order total must equal the sum of its line items." "A user's role can never be elevated by a non-admin." "A scheduled event must not be deleted while attendees are confirmed." These rules come from the product's domain, and the agent has not necessarily read your domain. It can produce them if you describe the rules, but it will not invent them from the code alone. The reviewer's job is to articulate the invariants and ask for tests that check them.

The agent is bad at performance assertions. "This query should complete in under 100ms with 10,000 rows." "This API endpoint should handle 100 requests per second." Performance tests require running against realistic data with realistic load, and most CI environments do not provide either. The agent will write a performance test if asked, but the test will run against the same fixture data as the unit tests, which means it will pass at scales the production system never sees. Real performance testing belongs in a separate suite, against a separate environment, with realistic data. That is a manual setup the agent cannot do for you.

The agent is bad at concurrency tests. Race conditions are notoriously hard to test for; even humans rarely write good concurrency tests. The agent will produce a test that calls a function twice and asserts on the result, which catches nothing because the test is not actually concurrent. Real concurrency testing involves running many threads or processes against the same resource and observing the failure modes. Tools like Stressed, k6, or custom fuzzing setups are needed; the agent's default test will not get you there.

The agent is mixed on integration tests. The framework code (spin up a database, seed it, run the test, tear down) is mechanical and the agent handles it. The test design (which scenarios to cover, what dependencies to include, where to draw the boundary) requires judgment the agent does not have without prompting. Ask for integration tests and review them against your sense of what the integration boundary actually is.

The discipline this implies is unsurprising: delegate enumeration to the agent, reserve invariants and threat modeling for yourself. The agent gives you broad coverage cheaply. You add the deep checks that come from domain knowledge. The combination is more thorough than either alone, and it is the division of labor that uses both parties' strengths.

Property-Based Testing

Property-based testing is the technique that closes some of the gap the agent leaves. The shape is different from unit testing. Instead of writing test cases (specific inputs and expected outputs), you write properties (statements about the function that should be true for all valid inputs). The framework generates random inputs and checks the property. When the property fails, the framework shrinks the failing input down to its minimal form and reports it. You get coverage that no human or agent could enumerate, and you get it from a few lines of code.

The classic example is the property "reversing a list twice gives the original list." A unit test for reverse would write three or four cases. A property test asserts the property and lets the framework throw thousands of random lists at the function. If reverse has a bug for any of them, the framework finds it and reports the smallest failing case. The same property test catches bugs you would never have written a unit test for, because you would not have thought of those specific inputs.

For JavaScript and TypeScript, fast-check is the established library. It generates inputs based on type-aware arbitraries: integers, strings, arrays, objects, custom shapes you define. The integration with Vitest and Jest is direct: you wrap the property in fc.assert(fc.property(...)) and the test runs as part of the normal suite. The shrinking is good, which means a failing test reports a minimal failing input you can paste into a regression test.

// Example: property test for a sort function
import { describe, it } from 'vitest'
import * as fc from 'fast-check'
import { sortAsc } from './sort'

describe('sortAsc', () => {
  it('output is the same length as input', () => {
    fc.assert(
      fc.property(fc.array(fc.integer()), (arr) => {
        return sortAsc(arr).length === arr.length
      })
    )
  })

  it('output is in non-decreasing order', () => {
    fc.assert(
      fc.property(fc.array(fc.integer()), (arr) => {
        const sorted = sortAsc(arr)
        for (let i = 1; i < sorted.length; i++) {
          if (sorted[i] < sorted[i - 1]) return false
        }
        return true
      })
    )
  })

  it('output contains the same multiset of elements as input', () => {
    fc.assert(
      fc.property(fc.array(fc.integer()), (arr) => {
        const sorted = sortAsc(arr)
        const a = [...arr].sort((x, y) => x - y)
        const b = [...sorted]
        return JSON.stringify(a) === JSON.stringify(b)
      })
    )
  })
})

Three properties together specify a sort function more completely than a dozen example-based tests. The first asserts the length is preserved. The second asserts the order is correct. The third asserts the elements are the same (no values lost or gained). Any sort function passing all three is correct. Any bug in the sort function breaks at least one property. The framework checks them against thousands of random inputs and finds bugs you would never have hand-written tests for.

For Python, Hypothesis is the equivalent and is one of the most polished property-based testing libraries in any language. The Hypothesis API is similar: you decorate a function with @given and a strategy describing the input shape, and the framework runs the function against many randomly-generated inputs. The shrinking is excellent. The integration with pytest is seamless. If you write Python and have not used Hypothesis, this is the highest-leverage testing tool you are not using.

The reason the agent struggles with property-based testing is that the properties have to be designed, not enumerated. The agent is good at "list the cases that might break this function." It is bad at "what invariant must this function maintain." The first is enumeration; the second is design. Design requires understanding what the function is for, what it must guarantee, and what it must not violate. That understanding is your domain knowledge, expressed as a mathematical property.

This is where the human earns the money. Identifying the invariants is the work the agent cannot do for you, because the agent does not have the model of the system that lets it know what must always be true. You write the property; the agent writes the test scaffold around it. The collaboration is clean: design from you, mechanical writing from the agent, comprehensive coverage from the framework.

Common invariants that pay back when expressed as properties: round-trip operations (encode then decode equals the original), idempotent operations (running twice equals running once), commutative operations (order does not matter), monotonic operations (output grows with input), conservation properties (sum of parts equals total). Each of these can be expressed as a one-line property and tested against thousands of random inputs. Each of them catches a class of bugs that example-based tests miss.

The other use of property-based testing is regression confirmation. When a property test finds a bug and the framework reports a minimal failing input, you save that input as a regression test. The next time the test suite runs, it explicitly tests against that input as well as new random ones. The combination of "random property check" and "saved regression cases" gives you both broad coverage and assurance that specific past bugs do not return.

The cost of property-based testing is real. The tests run slower because they execute the property against many inputs. Generating arbitraries for complex types takes thought. Debugging a property failure can be harder than debugging a unit test failure because the failing input may be unfamiliar. The benefit is the bugs you catch that no other technique would have caught. For high-stakes code (parsers, serializers, financial calculations, cryptographic primitives, anything where correctness is non-negotiable), the cost is worth paying.

Snapshot Testing (And the Antipattern)

Snapshot testing is the technique of capturing the output of a function or component, saving it to a file, and asserting on subsequent runs that the output still matches the saved file. For UI components, the snapshot is the rendered HTML or a serialized representation. For functions, the snapshot is the return value. For API endpoints, the snapshot is the response body. The pattern is simple: run, capture, compare on every future run.

The strengths are real. Snapshot tests catch unintended changes to output. They are cheap to write because the developer does not have to enumerate expected output; the framework records it on the first run. They produce readable diffs when output changes, so the reviewer can see exactly what changed and decide whether the change was intended. For component libraries, design systems, and code generators, snapshot tests are one of the most efficient ways to lock in correct output.

The antipattern is what happens when the agent is allowed to update snapshots automatically. The flow looks innocent: the agent makes a change, a snapshot test fails because the output changed, the agent runs the snapshot update command, the test passes. The output that was previously expected is now considered the new expected output. The change has been recorded but not reviewed. The next time the agent makes a change that breaks the same snapshot, the same flow repeats. After a dozen iterations, the snapshots in the repo have drifted from any human-reviewed reference and now record whatever the agent has been producing, regardless of whether that output is correct.

This failure mode is specific to AI workflows because humans rarely run the snapshot update command without thinking. The cost of typing the command is enough friction that a developer pauses, considers whether the change was intentional, and updates only when it was. The agent does not pause. The agent sees a failed test, sees that the test is a snapshot test, sees the conventional "fix" for a snapshot test failure (update the snapshot), and runs the command. The agent has "fixed" the test in the same way a human cleaning a windshield with a sledgehammer has "fixed" the bugs.

Takeaway

Never let the agent run snapshot update commands automatically. The whole point of snapshot tests is that a human approves the recorded output. If the agent regenerates the snapshot to "fix" a test, the test now records whatever the agent decided to produce, which is exactly the thing the test was supposed to verify. The defense is procedural: snapshot updates are human-only operations, and any commit message or PR that says "updated snapshots" gets reviewed for whether the new snapshot is actually correct. Automate the prevention by removing the update command from the agent's allowed commands list.

The defense has two layers. The first layer is access control. Whatever harness runs the agent (Claude Code, Cursor, an in-house tool) usually has a way to specify which commands are allowed and which are not. vitest -u, jest --updateSnapshot, pytest --snapshot-update should all be denied. If the agent cannot run them, it cannot trigger the antipattern. The second layer is review. When a PR includes changes to snapshot files, the reviewer reads the snapshot diff and confirms that the new output is correct. This is an extra step but it is the step that prevents drift.

The third layer is naming. Some teams put snapshot files in a dedicated directory and configure the diff viewer to expand them by default. The visibility helps. A snapshot change that is buried inside a file diff next to a hundred lines of code change is easy to skim past. A snapshot change that takes up its own panel is hard to miss.

The other thing to watch is snapshot test bloat. The agent will generate snapshot tests cheerfully, sometimes too cheerfully. A component library can end up with a snapshot test for every component variant, which is hundreds of files, none of which gets read carefully. The maintenance cost goes up because every visual change requires updating dozens of snapshots, and the value goes down because the snapshots are no longer human-reviewed in any meaningful sense. Keep the snapshot tests targeted to the cases where the snapshot meaningfully captures the expected output.

The healthy use of snapshot tests is for output that should rarely change and that has a clear "expected" form. A code generator's output. A stable API response. A design system component's HTML. For these, snapshots lock in correctness and catch regressions efficiently. The unhealthy use is using snapshots as a substitute for thinking about what the test should assert. If you do not know what the function should return, a snapshot does not tell you; it just records whatever the function returns and hopes that is the same as correct.

Continuous Integration for AI Workflows

The CI loop is the perimeter check. The agent commits, CI runs, failures surface within minutes, you fix them before they ship. Without CI, the test suite is a request: please run me. With CI, the test suite is a gate: nothing merges without passing. The difference matters more under AI than it ever did, because the volume of commits is higher and the human cannot manually run the full suite on every iteration.

The CI rule is simple and absolute: every PR runs the full test suite, including AI-generated PRs. The temptation under AI is to skip CI for "small" changes, because the agent's changes are often small in line count and feel low-risk. This is a trap. The agent's small changes have a higher probability of subtle breakage than human small changes, because the agent has less context about what else might be affected. CI is the correction for that information asymmetry. Run it on everything, no exceptions.

The CI providers are well-known and roughly interchangeable for the purpose of running tests. GitHub Actions is the default for projects on GitHub because the integration is seamless and the free tier is generous. GitLab CI is the equivalent for GitLab. CircleCI is the strong third-party choice with good caching and parallelism. Jenkins is still around for self-hosted needs. The choice between them rarely affects the technical work; the work is in writing the workflow file that runs the right tests in the right order with the right caching.

The pre-commit layer is the fast-feedback layer that runs locally before CI ever sees the change. Husky is the standard for setting up git hooks in JavaScript projects; lint-staged runs commands only on the files being committed. The hooks should run lint, type check, and the unit tests. They should not run integration or E2E tests because those are too slow for the commit phase. The point is fast feedback at minimal cost. The developer (or the agent) commits, the pre-commit hook runs in five to thirty seconds, and the obvious problems surface immediately.

Pre-commit: lint, typecheck, unit tests

Husky + lint-staged. Runs in under 30 seconds. Catches the obvious mistakes before they leave the developer's machine. Fails the commit if anything is wrong.

CI on push: full unit suite, lint, typecheck, build

GitHub Actions or equivalent. Runs in 1-3 minutes. Catches anything pre-commit missed (different OS, different Node version, missing dependency).

CI on PR: integration tests, security scans

Heavier checks that need real services or take longer. Runs in 3-10 minutes. Catches database, API, and cross-component bugs.

CI on merge: E2E tests, deploy preview

The slowest tests run last because they are the most expensive. Runs in 5-20 minutes. Catches the bugs that only appear in the assembled system.

Production deploy: smoke tests, monitoring

After deploy, lightweight checks that the key paths work in production. Pages someone if they fail. Catches the bugs that only appear with real traffic.

The order matters. Cheap and fast checks run first, expensive checks run later. The pyramid logic from the test layers applies again here: catch as many bugs as possible at each layer before moving to the next, because each subsequent layer is more expensive and slower. A bug caught by the linter is found in seconds. The same bug caught by E2E is found in minutes. The same bug caught in production is found in hours and may have already affected users.

For the agent specifically, CI is the contract that lets you delegate with confidence. The agent says it implemented a feature; CI confirms or denies. Without CI, the agent's claim is just text. With CI, the claim is verified or rejected by a process that does not lie. The agent cannot "convince" CI the way it can sometimes convince a tired reviewer. The tests pass or they do not.

One specific pattern that helps is requiring CI to pass before the agent can merge. In GitHub, this is "require status checks to pass before merging" plus "require branches to be up to date before merging." The combination means that even if the agent has push access and merges its own PR (which is sometimes the workflow), it cannot do so without green checks. The discipline is enforced by the platform rather than by individual willpower.

Caching is the other thing that matters in CI. Without caching, every CI run reinstalls dependencies from scratch, which adds minutes to every run. With caching, the cache hits and the install takes seconds. GitHub Actions has the actions/cache action; CircleCI has its caching primitives; all the major providers handle this. The configuration is fiddly the first time and saves hours per week once it works. The agent can write the cache configuration if you ask it to, and the configuration usually transfers across projects with minor adjustments.

Parallelism matters at scale. A unit suite of two thousand tests that runs sequentially in five minutes runs in one minute on five workers. Most providers support sharding the test suite across workers; Vitest and Jest both have built-in support. Playwright has it for E2E. The configuration is a few lines and the speedup is real. For projects with large suites, the difference between one-minute and five-minute CI is the difference between fast iteration and the kind of waiting that breaks flow.

The other CI consideration is flake management. Flaky tests pass sometimes and fail sometimes for reasons unrelated to the code change. Under AI workflows, flakes are particularly damaging because the agent will see a flaky failure and propose a "fix" that is unrelated to the actual cause, sometimes making the flake worse. The discipline is to identify flakes quickly (track which tests fail intermittently across runs) and fix them at the source rather than retrying them. Tools like Buildkite Test Analytics, GitHub's flaky test reports, or custom tracking can flag the offenders. Once flagged, the test is either fixed (if the underlying race condition can be eliminated) or quarantined (moved to a separate suite that does not block CI). Letting flakes accumulate is the surest way to make CI useless: when failures are 30% noise, developers learn to ignore failures, and real bugs slip through.

Coverage as a Heuristic, Not a Goal

Coverage metrics measure how much of the code is executed by the test suite. The standard measurement is line coverage: of the lines of code in the project, what percentage runs during tests. Branch coverage and function coverage are variants. Coverage tools (Istanbul/c8 for JavaScript, coverage.py for Python) integrate with the test runners and produce reports.

The measurement is useful as a heuristic. A file with 0% coverage has no tests at all and is a clear gap. A file with 50% coverage is being tested unevenly and probably has untested branches. A file with 90% coverage is well-covered. The numbers correlate roughly with how much the suite catches in that file.

The measurement is not useful as a goal. Coverage targets ("we must hit 90% coverage") produce tests that are written to satisfy the metric rather than to verify behavior. Under AI, the failure mode is acute: the agent will happily write tests that exercise code without asserting anything meaningful about it, because that is the easiest way to raise the coverage number. A test that calls every function and asserts expect(result).toBeDefined() hits 100% coverage and catches almost nothing. The metric goes up; the suite's actual quality stays flat.

The principled stance is to use coverage as a diagnostic, not a target. Look at the report, find the files with low coverage, ask whether those files actually need tests, and write the tests if so. The number itself is a means, not an end. Aiming for 100% coverage is vanity. Aiming for "the critical paths are covered well" is engineering.

0-30% coverage: untested, large risk 30%

30-60% coverage: spotty, branch gaps 60%

60-80% coverage: solid for most projects 80%

80-95% coverage: thorough, reasonable for high-stakes code 95%

95-100% coverage: vanity territory, often gamed 100%

Better metrics exist for measuring suite quality. Critical-path coverage asks whether the user-facing paths that matter most are tested. The list of critical paths is usually short (login, checkout, the three or four operations that drive the business). Coverage for these paths should be high. Coverage for utility code that is rarely touched can be lower without much risk.

Mutation testing is the deeper technique. The tool (Stryker for JavaScript, mutmut for Python) modifies the code in small ways (flip a comparison, change an operator, replace a constant) and runs the tests against the mutated version. If the tests pass with the mutation, the mutation "survived" and the tests do not actually catch that change. A high mutation score (the percentage of mutations the tests kill) means the tests are sensitive to the kinds of changes that matter. A low mutation score means the tests pass even when the code is broken in plausible ways, which is exactly the failure mode the suite is supposed to prevent.

Mutation testing is slower than coverage measurement (every mutation requires a full test run) but produces a more honest signal. A suite with 95% line coverage and a 30% mutation score is a suite that runs through the code without checking the code does anything. A suite with 80% line coverage and an 80% mutation score is a suite that actually catches breakage. The mutation score is the better number when you want to know whether the tests are doing their job.

The "what would breaking this look like" mental check is the human-side equivalent. For each test in the suite, ask: if I introduced a bug here, would this test catch it? If the answer is no for many tests, the suite has a quality problem regardless of what the coverage metric says. The check is informal but it surfaces the same issue mutation testing surfaces, faster, on the tests you are actively reading.

The agent will sometimes write tests that fail this check. A test that asserts only that a function returns truthy. A test that mocks the function under test and verifies the mock. A test whose assertion is unrelated to the function's actual purpose. The reviewer's habit is to look at each test and ask the breakage question. Tests that fail the question are deleted or rewritten.

The other coverage trap is the imbalance between unit and integration. A codebase with 95% unit coverage and no integration tests has no information about whether the components work together. Coverage at the file level can hide gaps at the integration level. The check is to look at coverage by test type, not just by file: what percentage of integration paths are covered, what percentage of E2E paths. The pyramid story applies to coverage as much as to test count.

The Post-AI-Edit Verification Pattern

After every multi-file change the agent makes, you run a verification command. The command does several things in sequence: type check, lint, unit tests, smoke integration tests, build. If any step fails, the change is treated as not-yet-done and the agent fixes the failure before moving on. If all steps pass, the change is considered safe to proceed from. The pattern turns "agent made changes" into a binary: verified or not.

The command should be one keystroke. Anything more and you will skip it under pressure, which means it does not actually run, which means the verification does not happen, which means you ship breakage. The convention in JavaScript projects is npm run verify or pnpm verify or bun verify; in Python it is often make check or ./verify.sh. The exact name does not matter. What matters is that there is one canonical command and that it does the full check.

// Example verify script in package.json
{
  "scripts": {
    "verify": "npm run typecheck && npm run lint && npm run test:unit && npm run build",
    "typecheck": "tsc --noEmit",
    "lint": "eslint . --max-warnings 0",
    "test:unit": "vitest run",
    "test:integration": "vitest run --config vitest.integration.config.ts",
    "test:e2e": "playwright test",
    "build": "next build"
  }
}

The contents of the verify script depend on the project. The principle is to include everything that is fast enough to run on every change. Type check, lint, unit tests, build, in that order. Type check first because it fails fastest. Lint second because it catches style and obvious bugs. Unit tests third because they are the deepest correctness check that is still fast. Build last because it confirms the whole thing actually compiles together. Integration and E2E tests are not in the default verify because they are slower, but they should be a separate command (npm run verify:full) that runs before merging.

The "if verify fails, the change did not happen" stance is the cultural anchor. Without it, a failing verify is just information; with it, a failing verify is a blocker. The agent has produced something that does not pass verification; the something does not exist as far as the workflow is concerned. The agent fixes it, runs verify again, and moves on only after verify passes. The discipline is mechanical and prevents the slow drift where small breakages accumulate into a big problem.

For agentic harnesses (Claude Code, Cursor, etc.), the verify command should be in the agent's allowed commands list. The agent runs it after each multi-file change. If it fails, the agent reads the failure output and fixes the problem. If it passes, the agent reports the change as complete. The harness is the enforcement layer; the verify command is the substance.

One specific pattern that helps is having the verify command also fail on warnings, not just errors. Many lint tools have a "max warnings" flag that causes the command to exit non-zero if any warnings are present. Setting this to zero turns warnings into errors for the verify pass, which prevents warning-bloat over time. The agent will sometimes generate code that produces a warning rather than an error; without the strict flag, the warnings accumulate. With the strict flag, the agent fixes them as they appear.

The verify command should also be fast. If it takes ten minutes, you will skip it. If it takes ninety seconds, you will run it every time. The optimization is to prioritize speed in the verify command and reserve the slower checks for the merge gate. Type check should be incremental (TypeScript's --incremental flag, or its watch mode). Lint should run only on changed files (lint-staged-style filtering). Unit tests should be fast enough that running the full suite takes seconds, not minutes. Build should use whatever caching the bundler provides.

The other thing the verify command does is provide a single source of truth for "is the project in a working state." When something goes wrong and you are debugging, you can run verify to confirm whether the issue is in the code you just changed or somewhere else. When you are about to start work, you can run verify to confirm the project is starting from a clean baseline. The command becomes a touchstone, and its existence makes the workflow more reliable.

For larger codebases, the verify command may need scoping. Running every test on every change is wasteful when only a small part of the codebase was touched. Tools like Nx, Turborepo, Rush, and Bazel can determine which tests are affected by a given change and run only those. The configuration is more involved but the speedup is dramatic for monorepos. The agent can navigate the affected-tests workflow if you wire it in; without the tooling, the agent runs everything and the loop slows down.

The closing test for whether your verify setup is working: ask, after a typical agent turn, whether you trust the project is still working. If the answer is yes because verify passed, the setup is good. If the answer is no, or if you are unsure, the setup needs more in it. The point of verify is to give you the confidence to keep going. If it does not give you that, fix it until it does.

Closing

The discipline of testing was always undervalued. Teams that wrote tests shipped fewer incidents and spent less time on regressions. Teams that did not write tests sometimes got away with it for years, and the costs were diffuse enough that the trade-off looked reasonable in the short term. The argument for testing was real but felt theoretical, especially for fast-moving projects where the next feature was always more urgent than the test for the last one.

AI assistance does not just shift the math; it inverts it. The cost of writing tests dropped to near zero because the agent writes them. The benefit of having tests rose dramatically because the agent makes more changes per session and you cannot review them all by hand. The combination means that not having tests is now strictly worse, on every axis. The teams that ship quickly and reliably under AI are the teams whose test suites catch the agent's mistakes before they become anyone else's mistakes.

The high-level moves are simple. Write tests alongside code, not after, and let the agent do most of the writing. Maintain the pyramid: many fast unit tests, fewer integration tests, a small number of E2E tests for critical paths. Use property-based testing for invariants you can articulate and the agent cannot. Treat snapshot tests as human-approval gates, not auto-update conveniences. Run CI on every PR, including AI-generated ones. Use coverage as a diagnostic, not a goal. Verify after every multi-file change with a one-keystroke command. None of these are new ideas. The newness is the urgency.

The closing thought is the one the section title implied. The developer who ships with tests sleeps. They sleep because the suite catches the bugs before the users do, and because the suite catches the regressions before the on-call rotation does, and because the agent's iterations land cleanly when they should and bounce back loudly when they should not. The developer who ships without tests waits for the user to find the bug. They wait for the support ticket, the angry email, the late-night page, the postmortem. The waiting is not free. It is paid in time, in trust, and in the slow erosion of the codebase as fixes pile on top of fixes without verification. The choice is which side of that line to be on. The tools are sitting there. The agent will write the tests. All you have to do is ask.