Debugging AI-Generated Code: New Bug Shapes, New Reflexes

AI bugs are not the bugs you grew up debugging. The shape is different. The location is different. The cause is different. The reflexes that worked on a tired junior dev's code do not transfer cleanly to the output of a model that has read most of GitHub. If you walk into an AI bug expecting it to look like a human bug, you will spend an hour reading lines that compile and run and produce wrong output for reasons you would not invent on your worst day. The first habit of debugging AI-generated code is recognizing that the failure modes have moved.

This is not pessimism about agents. Agents produce working code at a rate humans cannot match. The point is that when they fail, they fail in patterns that come from how they were trained, not from how a person thinks. A human writes a bug because they were tired, distracted, or working at the edge of their understanding. An agent writes a bug because the statistical center of its training data did not match your specific codebase, your specific library version, or your specific intent. The bug is structural. It is a mismatch between the median of what the model has seen and the particular truth of where you are working. That is a different kind of bug, and the diagnostic moves that catch it are different too.

How AI Bugs Differ From Human Bugs

Human bugs cluster around fatigue, complexity, and the edges the developer forgot. The classic shape is a person who handled the happy path well, ran out of energy on the third edge case, and shipped a function that works for ninety percent of inputs and explodes on the rest. The other classic shape is a person who held a complicated system in their head and missed an interaction between two parts that touched each other only on a particular code path. Both shapes are about the limits of human cognition and attention. Debug them by checking edges, by isolating the path, by walking through the code line by line until the moment of confusion becomes visible.

AI bugs cluster somewhere else. The agent does not get tired. The agent does not forget edge cases the way a human does, because the edge case is just another token sequence in its training data. What the agent does is average. It pulls toward the statistical center of patterns that look like the problem you described. That center is sometimes correct for your problem. Often it is correct for a similar problem and wrong for yours by a margin small enough that the code compiles, runs, and produces output that looks plausible. The bug is in the distance between the median and your specific truth. That distance is invisible until you look for it.

Human bug shape

Edge case forgotten because the dev was on hour seven of debugging something else. Off-by-one because they were tired. Missing null check because they assumed the upstream service would always return a value. The bug is loud, located in the lines the dev wrote last, and shaped like the limits of human attention. Found by tracing the path the dev was thinking about and noticing where the thinking ran out.

AI bug shape

Hallucinated method that looks like it should exist on this library because it exists on three similar libraries. Wrong submodule path that matches the median import pattern across the training set. Mixed-version syntax because the model averaged across two major versions both present in its data. The bug is quiet, structurally plausible, and shaped like the average of what code like this usually looks like. Found by checking that the code matches your specific reality, not just plausible reality.

The difference matters because the diagnostic strategy that catches a tired-human bug misses an AI bug routinely. A tired-human bug yields to careful line-by-line reading. You walk the code, you check each variable, you find the place where the assumption broke. The AI bug does not yield to that approach because each line is locally plausible. The line is wrong only in relation to your codebase, your library version, or your intent, and that wrongness does not show up in line-level reading. You have to compare the code to external truth: the actual library docs, the actual function signatures, the actual environment. Line-level reading does not reach that.

The other thing that changes is where bugs hide. In human code, bugs cluster in the parts the dev wrote at the end of a long session, in the new code, in the boundary between a refactor and the rest of the system. In AI code, bugs cluster differently. They show up in API calls because the model averaged libraries. They show up in imports because the model conflated submodule paths. They show up in version-specific syntax because the model's training data spanned multiple versions. They show up in config because the model invented a plausible env var name that is not yours. The geography of the bugs has moved. The places you used to look first are not the places to look first anymore.

Common AI Bug Patterns

The patterns are recognizable once you have seen them a few times. The first time, you spend an hour debugging because you are looking with human-bug eyes. The fifth time, you spot the pattern in thirty seconds because you know to check the API surface, the import path, or the version-specific syntax before reading the logic. Building the catalog of AI-specific patterns is one of the highest-payoff skills in agentic dev, because it turns hour-long debugging sessions into thirty-second pattern matches.

The first pattern is the hallucinated API. The agent calls a method that does not exist on the library you are using. The method exists in spirit because three similar libraries have it. The agent averaged across them and produced a method name that sounds right. The code compiles in interpreted languages because the call resolves at runtime. In typed languages it fails at compile time and you catch it immediately. In Python or JavaScript or Ruby, the call might fail at runtime with an unhelpful error, or it might silently do nothing, or it might be caught by a duck-typing check that passes for the wrong reason. You find this bug by reading the actual library documentation, not by reading the calling code.

The second pattern is the wrong-but-plausible import. The package name is right. The submodule path is wrong. The model has seen many imports across versions and projects, and the median path is sometimes not the path your version uses. You import from `package.submodule.thing` when the actual path is `package.thing` because the submodule was deprecated two versions ago. The error is loud (ImportError, ModuleNotFound) and easy to fix once you see it. Catching it costs you the time between writing the code and running it, which can be longer than you expect if the import is conditional or guarded.

The third pattern is mixed-version syntax. The model has seen v3 and v5 of a library in its training data. It picks v3 for one part of the code and v5 for another, because both look correct in isolation. Together they are inconsistent. The code might run because the library accepts both signatures during a transition period, or it might fail with type errors that point at the version mismatch. Either way, the diagnostic move is to pin the version and verify that all the code matches that version's idioms. The agent will do this if you tell it the version. Without the version, it will average.

Hallucinated APIs 30%

Wrong-but-plausible imports 20%

Mixed-version syntax 15%

Confidently invented config 15%

Silent type coercion 10%

Other (logic, scope, edge) 10%

The fourth pattern is confidently invented config. The agent knows your code needs an env var or a config file path. It does not know what your env var is named. So it picks one. The name is plausible. It might be `DATABASE_URL` or `API_KEY` or `REDIS_HOST`. Sometimes it is exactly your name and sometimes it is close-but-different. `DATABASE_URL` versus `DB_URL` versus `POSTGRES_URL` are all plausible names for the same thing, and the agent picks one without knowing which is yours. The bug surfaces when the code tries to read a config that is not set, or worse, when it reads a config you set for a different purpose and gets a value that is structurally valid but semantically wrong.

The fifth pattern is silent type coercion. The agent assumes a string where you have a number, or a number where you have a string, and the language quietly accepts both because of duck typing or implicit conversion. JavaScript is the worst offender because so much of the language coerces silently. Python is better because operations between strings and numbers usually fail loudly, but Python still has corners (boolean evaluation of zero versus empty string, for instance) where coercion produces wrong-but-plausible results. The bug surfaces when downstream code makes the wrong assumption about the type. The fix is type assertions or, in TypeScript and typed Python, actual type annotations the agent has to satisfy.

Beyond these five, there are smaller patterns: variable shadowing where the agent reused a name from outer scope, off-by-one in indexes when the agent translated between zero-indexed and one-indexed contexts, mishandled async where the agent forgot to await a promise. These are the same bugs humans write, just at different rates. The big five are the patterns that are distinctive to AI output. Build the catalog of those first because they are where the agent's failure modes diverge most sharply from human failure modes.

The Diagnostic Mindset for AI Output

The diagnostic mindset for AI code starts with three questions, asked in order, before any line-by-line reading. The questions are simple. The discipline is asking them every time, even when the code looks fine, because looking fine is the default state of AI output and is not evidence that the code works.

The first question is: did this run at all, or did it just compile? Compiling is a very low bar. A program can compile and crash on the first call. A program can compile and run for ten seconds before hitting the codepath that breaks. A program can compile and silently do nothing because the function it was supposed to call returned early on a condition that was always true. None of those count as the code working. The minimum check is: did the code execute the path it was supposed to execute, and did the side effects happen, and did the output match what was expected. That is more than compilation and more than tests passing on a single input. It is the actual run of the actual code on actual data.

The second question is: does the test pass for the right reason? Tests are great. Tests passing is not. A test can pass because the code is correct, or because the test is testing something other than what its name suggests, or because the test is mocking the very thing that broke and so the breakage is invisible. The test passing only tells you that some condition was met. You still have to check that the condition is the condition you cared about. This is where the agent will sometimes write a test that is structurally a test but is not actually testing what you wanted, because the agent's mental model of what to test diverged from yours.

The third question is: does the code touch only the files it claimed to touch? An agent that says "I added a function to file X" and you check the diff and the diff shows changes to files X, Y, and Z is an agent that did not do what it said it did. Sometimes the changes to Y and Z are correct and the agent just did not mention them. Sometimes they are wrong and the agent did not notice. Either way, the gap between the narrative and the diff is information you need before you commit. Read the file list before you read the lines.

Run the code, do not just read it

Compile is not run. Read is not run. The path you care about has to actually execute, on actual data, and the output has to match expectations. Until that has happened, you are guessing about whether the code works.

Check that tests pass for the right reason

Open the test. Read what it asserts. Confirm the assertion matches what you wanted to verify. Tests passing is information; tests testing the right thing is the information you actually need.

Compare the diff to the agent's narrative

The agent says it changed X. The diff shows X plus Y plus Z. Either the agent is being thorough in ways you should approve, or it is doing things off-script. Find out which before you commit.

Verify the API surface matches reality

Method names, import paths, function signatures, env var names. Do not trust that the agent got these right because they look plausible. Cross-check against real documentation, real type definitions, or a real running environment.

Read the logic, scoped to the high-stakes paths

Once the structural checks pass, read the lines on the paths that matter. Branching, error handling, state changes. Skip the boilerplate. Spend your attention on the parts that decide outcomes.

The "scan the diff before running" reflex is the cornerstone of all of this. Every multi-file change gets a quick diff scan before you do anything else. The scan is not a deep code review. It is a thirty-second pass to confirm three things: the file list matches expectations, the structural changes are what you asked for, and there are no surprise changes in unrelated files. If those three things are true, you proceed. If any of them is in doubt, you slow down. The scan costs you thirty seconds. Skipping it costs you an hour when you are debugging output and the bug is in a file you did not realize had been changed.

The mental shift is from reading-as-evaluation to reading-as-verification. Reading-as-evaluation is "I am going to read this code and decide if it is good." Reading-as-verification is "I have a hypothesis about what this code should do, and I am reading to confirm whether it does that." The second mode is faster and more accurate because it has a target. The first mode is slow and tends to drift toward "this looks fine" when you are tired. Build the habit of reading with a hypothesis. The hypothesis is the agent's stated plan or your own intent. The reading checks whether the code matches the hypothesis.

Debugging With AI vs Debugging By Yourself

Once you have identified a bug, you have a choice: ask the agent to fix it, or fix it yourself. Both options are real and both have failure modes. Knowing which to pick saves time on the cases where the agent is well-suited to debug, and saves time on the cases where the agent will keep proposing the same wrong fix in different costumes.

The agent is good at debugging when the bug has a clear stack trace, the error message is specific, and the symptom is isolated. A NullPointerException at a specific line in a function the agent recently wrote is a great candidate for delegation. The agent can read the line, infer what variable is null, and propose a check or a fix. A test failure with a clear assertion message is similar. The agent has all the information it needs to make a correct guess, and the verification is cheap because you can re-run the test.

The agent is bad at debugging when the bug is in its mental model of the system rather than in the code. If the agent thinks function X returns a number and it actually returns a string, all of the agent's fixes will assume a number. The fixes will not work because they are downstream of the wrong model. You can keep asking and the agent will keep producing fixes that miss because the underlying mental model has not been corrected. The pattern is recognizable: each fix changes a different line, none of them work, and the bug persists. That is the moment to take over and debug yourself, because no number of additional turns will produce a fix until the model is corrected, and you are the only one who can correct it.

Delegate-friendly bugs

Clear stack trace pointing at a specific line. Error message that names the actual problem. Failing test with a readable assertion. Bug isolated to code the agent recently wrote. Reproducible from a small input. The agent has enough information to make a correct guess, and you have a cheap way to verify the fix worked. Cost of trying: low. Cost of the agent failing: low, because you will know quickly.

Hand-debug bugs

Symptom in one place, cause in another. Bug that depends on production state you cannot reproduce locally. Agent has proposed two fixes that did not work and is about to propose a third in the same shape. The agent's mental model of how the system works is wrong, and no leaf-level fix will repair the bad model. You take over because you are the only one who can correct the model. Trying again with the agent is throwing turns at a problem the turns cannot solve.

The "explain it back" pattern is the cheapest way to find out which kind of bug you are dealing with. Before asking the agent to fix the bug, ask the agent to describe what it thinks is happening. If the description matches your understanding, the agent is in a good position to propose a fix. If the description diverges, you have learned something: either the agent is wrong about the system, or you are. Either case is worth knowing before you ask for a fix. If the agent is wrong, you correct the model. If you are wrong, you reconsider what you were debugging. Both outcomes are better than asking for a fix on top of a misunderstanding.

The escalation pattern is simple. First fix attempt: let the agent try. If it works, ship. If it fails in a way that gives new information, let the agent try again with the new information. If the second attempt fails in the same shape as the first, take over. Two failures with the same shape is a signal that the agent's mental model is the bug, and more turns will not break the loop. You can resume delegating after you have corrected the model and the agent has new context to work with.

The trap is the third attempt. People try the agent twice, fail twice, and then try a third time because the agent's third proposal sounds different. It is not different. It is a different costume on the same misunderstanding. The third attempt is a tax on your patience and a poor use of your debugging window. Set the rule at "two attempts, then I take over." The rule is not because the agent is bad. It is because the cost-benefit on the third attempt is worse than the cost-benefit on you stepping in.

One specific pattern that helps: when you take over, do not just write the fix. Tell the agent what was wrong with its understanding. The next time you ask it to debug similar code, the corrected understanding may carry forward in the conversation, or you may have to provide it again. Either way, capturing the correction is more valuable than just fixing the bug, because the same misunderstanding will produce the same kind of bug next time, and pre-empting it is cheaper than re-debugging it.

Root Cause vs Symptom Fix

The agent will default to symptom fixes if you let it. The training data is full of "make this error go away" patches. Stack Overflow answers, GitHub issues that closed with a one-line workaround, blog posts about silencing warnings. The agent has internalized this pattern: error appears, error gets caught or wrapped or ignored, error stops appearing. The error not appearing is treated as success. It is not.

The discipline is asking "what is actually wrong" before asking "how do I make this error stop." The error is information about a deeper truth. Silencing the error throws away the information without addressing the truth. Sometimes the truth is "the data is in an unexpected format" and the symptom fix is "wrap the parse in try-catch." The error stops. The data is still in the unexpected format, and now the system silently produces wrong output instead of loudly failing. That is worse. The loud failure was a feature. The silent wrong output is the bug, and you just shipped it.

Detecting a symptom fix takes practice but the tells are consistent. The fix is narrower than the actual issue. It addresses one specific input, one specific error path, one specific case. The fix adds a special case for the failing scenario. The fix wraps something in a try/catch that does not actually do anything in the catch block. The fix uses a default value that hides the underlying issue. All of these are signs that the agent has matched the surface of the problem to a pattern of "make this error stop" and is producing a fix that matches the pattern without addressing the cause.

Takeaway

The asymmetry of the symptom fix: it costs nothing to make the error stop, and it costs the rest of the project to discover that the error was real and now hides behind the fix. Default to root cause. The cost is minutes of investigation. The benefit is shipping code that fails when it should and works when it should, instead of code that hides problems until they are expensive.

The way to push the agent toward root cause is to ask it. Specifically, the question is "what is actually wrong here, before we talk about how to fix it." The agent can usually produce a real diagnosis if asked. Without the prompt, it defaults to fix-mode and you get a patch. With the prompt, you get the diagnosis first, and then you can decide whether the right response is a patch, a refactor, or a bigger change. The prompt costs nothing. The default behavior is biased toward symptom fixes. Override the default explicitly.

The other thing that helps is requiring the agent to explain why the bug is happening before proposing the fix. If the explanation is "the function returned null and we are dereferencing null," that is a real explanation. The fix can be informed by the explanation: maybe the function should not return null, maybe the caller should handle null gracefully, maybe the upstream service is broken and we should fix that. If the explanation is "this code path sometimes fails," that is not a real explanation. That is the symptom restated. Push for the actual cause and the fix follows from it.

There is a category of fix that is correctly a symptom fix. If the cause is genuinely outside your control (a third-party service that is sometimes down, a hardware glitch, a race condition you cannot eliminate cheaply), then a defensive wrapper is the right answer. The discipline is making sure you reached for the wrapper because you understood the cause, not because the wrapper was the agent's first suggestion. Defensive code is fine when chosen. It is bad when defaulted into.

The signal that you have a real root-cause fix is that the fix changes behavior across more than just the failing input. A real fix usually generalizes. If your fix only addresses the one input that produced the error, you have probably found the symptom and not the cause. The cause-fix tends to fix a class of inputs, including some you have not tested yet. The symptom-fix tends to be narrow and brittle and to break again when a slightly different input arrives. Notice which kind you have written. The narrow fix is sometimes the right call, but it should be a deliberate call, not the default.

Logging and Observation Strategies

Logging matters more in AI-built systems than in hand-written ones. The reason is that you cannot always trust the agent's mental model of execution flow. The agent says the code does X then Y then Z. The diff confirms X, Y, and Z exist. Whether they execute in the order the agent thinks, with the values the agent expects, on the inputs the system actually receives, is a different question. The only way to know is to observe the running code. Logs are how you observe.

The pattern that pays back in the early review of AI-written code is logging entry and exit of every non-trivial function. Entry log: function name, arguments. Exit log: return value, time taken. This is overkill for production but it is exactly the right amount for the first run-through of a new piece of code. You read the logs and you can see whether the execution flow matches the agent's narrative. Misalignment shows up as functions that ran when they should not have, functions that did not run when they should have, or values that were wrong when they should have been right. All of those are bugs you would not have caught from reading the code alone.

Once the code is past the initial review, the logging strategy shifts. You do not want entry/exit logs in production for every function. The volume is too high and the signal-to-noise drops. Production logging should be targeted: log at the boundaries between systems (HTTP requests, database queries, queue pushes), log on errors and unusual conditions, log enough state to reproduce a bug from the logs alone. The boundaries are where things go wrong and the boundaries are where you have the cleanest signal about what crossed them.

Structured logs are non-negotiable at any reasonable scale. Logs with consistent fields (timestamp, request ID, user ID, function name, level) are queryable and aggregable. Logs with freeform strings are not. The agent will write either kind depending on what it has seen most. If your codebase has a logging library and you tell the agent to use it, it will. If you do not specify, you get whatever the default is, which is sometimes structured and sometimes not. Specifying the logging library and the conventions in the prompt is one of those small upfront investments that pays back across hundreds of debugging sessions.

Code runs

Logs emit

Aggregator collects

Query for symptoms

Find the bug

The tooling stack for observation has matured to the point where small projects can have production-grade observability without much effort. Sentry catches errors and groups them into manageable buckets, with stack traces and context. Datadog and New Relic are the bigger end of the spectrum, with metrics, logs, and traces in one place. Honeycomb is the option people pick when they care about high-cardinality query patterns. Grafana plus Loki plus Tempo is the open-source equivalent if you want to host it yourself. For backend services in Node or Python, the OpenTelemetry libraries are the standard for emitting traces in a vendor-neutral way.

For local debugging, console.log is still fine. Pretending it is not fine is performative. The cost of adding a print statement is zero, the value when you are stuck is high, and you remove it before committing. The mistake is leaving prints in code that ships. The agent will sometimes scatter prints in production code while debugging, and the diff scan should catch them before merge. If you are using a structured logger, use the debug or trace level for these temporary prints; that way they are easy to remove and easy to disable in production even if you forget.

Request IDs and trace IDs are the unsung heroes of distributed-system debugging. A bug that crosses three services is impossible to reconstruct from logs unless you can connect the events from each service to the same originating request. The convention is to generate a request ID at the entry point (the load balancer, the HTTP server, the queue consumer) and pass it through every downstream call as a header or context value. Every log entry includes the request ID. When something breaks, you grep for the ID across all services and you have the full story. The agent will do this if you tell it to. Without the instruction, it will sometimes produce the right pattern and sometimes not, depending on what it has seen most.

The honest assessment of agent-written observability is that the agent will write the calls correctly if you have an existing pattern to follow. If you are starting fresh, the agent will produce a reasonable default but the default is not optimized for the specific shape of bugs your project will hit. Worth investing the time to set up the observation stack manually or with strong guidance, then having the agent write code that follows the pattern. The pattern is the payoff. The individual log lines are downstream of the pattern.

The "Ask the AI to Debug Its Own Bug" Pattern (And Its Limits)

There is a pattern that works often enough to be worth the first attempt: hand the agent the error, the relevant code, and your actual goal, and ask it to debug. The agent has more context than you do about what it just wrote. The error gives it a target. The goal gives it a constraint. With those three inputs, the agent can produce a real fix on the first try in a meaningful fraction of cases. The skill is knowing which cases.

It works when the error is clear, the code is recent, and the goal is unambiguous. A NullPointerException at line 47 of a function the agent wrote five minutes ago, where the goal is "this function should never crash on valid input," is a setup the agent can handle. The agent can read line 47, see what variable was null, infer how it got null, and propose a check or a guard. The proposal is correct often enough that delegating saves time even when you have to discard a few wrong attempts.

It fails when the error is ambiguous, when the code spans more than the agent has in context, or when the goal has not been stated. An error message like "something went wrong" is useless. The agent will guess and the guess will be wrong. A bug that involves three files where the agent only has one in context will produce fixes that ignore the other two. A goal that has not been stated will lead the agent to optimize for the wrong outcome (making the error stop, which is the symptom-fix trap from earlier).

The deeper limitation is when the bug is in the agent's mental model. If the agent thinks the API works one way and it works another way, the agent's debugging will keep producing fixes that assume the wrong model. You can ask three times and get three slightly different wrong fixes. None of them work. The pattern is the same: the bug is upstream of the code, in the agent's understanding, and no fix at the leaf will repair the broken trunk.

Hand the agent the error and the goal

Stack trace, relevant code, your intent. Ask for a fix. Let it try. Cost: a few minutes. Benefit: the agent often produces a correct fix on the first attempt for clear bugs.

If the first fix fails, ask it to explain what it thinks is happening

The explanation is diagnostic. If the explanation matches your understanding, the agent has the right model and the fix attempt is just iteration. If the explanation diverges, you have found the real bug: the agent's model is wrong.

Correct the model, then let it try again

Tell the agent what is actually true. "The function returns a string, not a number. The library is version 3, not version 5. The env var is named X, not Y." With the correct model, the next attempt is informed. Without the correction, the next attempt is another costume on the same wrong model.

After two failed attempts in the same shape, take over

If the second fix has the same problem as the first, the agent is not converging. More turns will not converge it. You step in, debug yourself, and produce the fix or the corrected understanding. Resume delegating after the model is repaired.

Capture what was wrong

Before moving on, write down the misunderstanding. Add it to the project's notes file or to the agent instruction file. The same misunderstanding will produce the same bug next time. Pre-empting it costs minutes. Re-debugging it costs hours.

The escalation ladder is the practical version of all of this. Step 1: hand it the bug. Step 2: read the explanation if step 1 fails. Step 3: correct the model and try again. Step 4: take over after two failures. Step 5: capture the lesson. The ladder is short. The discipline is following the rungs in order instead of jumping straight to "I will just fix it myself" the first time the agent's attempt fails. The first agent attempt is cheap and useful. The third agent attempt is expensive and rarely useful. The middle is where the judgment lives.

One specific tactic for step 3 is to ask the agent to write a minimal reproduction before fixing. A minimal repro forces the agent to articulate exactly what the bug is, with what input, in what context. The act of writing the repro often reveals the wrong assumption. If the repro fails to reproduce the bug, the agent's model is wrong about what the bug is, and you have learned something. If the repro reproduces the bug, the fix is now anchored to a concrete failure case rather than a vague description. Either outcome is better than skipping the repro.

When to Throw Out the AI's Code and Start Over

Sometimes the right answer is not to debug the code. The right answer is to delete it and have the agent regenerate from scratch with a better prompt. This is the option people resist because it feels like wasted work. It is not wasted. The original code was an experiment, the experiment told you what was wrong with the prompt, and starting over is cheaper than debugging if the architecture is rotten at the root.

The smell that you have hit this point is consistent. You have been debugging for thirty minutes. The agent's last three fixes have made it slightly worse, or have shifted the bug from one place to another without resolving it. Each fix takes longer to verify than the last because the code is getting more tangled. The path forward looks like more debugging and the debugging is not converging. That is the moment to step back and ask whether the architecture is wrong.

The cause, when this happens, is usually that the agent's first attempt set up a structure that does not match the problem. Maybe the function signatures are wrong. Maybe the data flow assumed something that turned out not to be true. Maybe the abstraction is at the wrong level. The leaf-level fixes do not repair these because they are upstream of the leaves. You can try to refactor your way out of it, but the refactor is harder than the regenerate. The regenerate is one prompt and a fresh start. The refactor is hours of unwinding choices that should not have been made.

Takeaway

The sunk cost trap in vibe coding: thirty minutes of debugging feels invested. It is not. It is signal that the prompt was wrong. Throw the code out, fix the prompt, regenerate. The thirty minutes are sunk, the next thirty minutes do not have to be. People who do not learn this pattern keep debugging for another two hours and end up with code that almost works. People who do learn it ship the right thing in twenty.

The fix is mechanical: revert the changes, rewrite the prompt with the constraint that the original prompt missed, generate fresh. The constraint is the thing you have learned during the debugging session. Maybe it is the library version. Maybe it is the actual API surface. Maybe it is the goal of the function in plain language. Whatever the missing constraint was, you now know it, and putting it in the new prompt prevents the agent from falling into the same trap.

The objection people have to this is that the regeneration "loses progress." It does not. The progress was learning what the prompt should have said. The code itself was a means to that end and is now disposable. The regenerated code, with the better prompt, will be closer to right than the patched-up version of the original. The patched version is a Frankenstein of the agent's first guess plus your fixes plus the agent's fixes for your fixes. It will not be cleaner than a fresh generation with the corrected prompt. It will only be older.

The exception is when the bug is genuinely small and isolated. If you have spent ten minutes debugging and the issue is a single wrong line, fix the line and move on. The throw-it-out heuristic is for the case where you have spent thirty minutes and the bug is structural. The signal is the duration plus the lack of convergence, not just the existence of bugs. Bugs are normal. Bugs that resist multiple fix attempts and keep mutating are the ones that say "the architecture is wrong, start over."

One specific case where regeneration is almost always the right call: the agent has produced something that mixes versions. v3 syntax in one part, v5 syntax in another. Trying to debug your way to consistency means hunting through the code for every version-specific idiom. The regenerate, with the version stated explicitly in the prompt, will produce coherent code in one pass. The cost of regenerating is low and the cost of hunting is high. The asymmetry favors regenerating.

Another case: the agent has produced code that does not match the codebase's conventions. Different naming, different file structure, different testing pattern. You can patch it to match the conventions, but the patches are surface-level and the deeper structure stays foreign. Better to regenerate with the conventions stated in the prompt or in the agent instruction file. The pattern lives in the prompt. Fixing it once in the prompt is more durable than fixing it case by case in the code.

The Patience Loop and Its Costs

Debugging AI code is sometimes a patience exercise. The agent produces a fix. You verify it. It fails. You ask again. The cycle is short, two or three minutes per iteration, and feels productive because work is happening. The trap is that productive-feeling iterations can stack into a half-day of debugging that produces nothing because the loop never converges. The patience loop has a cost and the cost is real, even though it is not visible in any single iteration.

The numbers, roughly: a fix-and-verify loop with the agent costs three to five minutes per iteration. Two iterations is fine. Five iterations is six hundred to fifteen hundred seconds, ten to twenty-five minutes, on a problem that you might have solved in five minutes by yourself. The cost of the loop only pays off when the agent converges. When it does not, you pay the loop cost without getting the result, and you then pay the cost of solving it yourself anyway. The combined cost is worse than just solving it yourself in the first place.

Max delegated attempts

3-5

Minutes per iteration

Minutes before regenerate threshold

Speed improvement from clear escalation

The discipline is keeping count. Two failed attempts triggers the takeover. Thirty minutes triggers the regenerate question. These are not laws but they are useful defaults that prevent the loop from running indefinitely. People who do not have these triggers tend to stay in the loop for hours because each individual iteration feels close to working. The cumulative cost is invisible until the half-day is gone and the bug is still there.

The other cost of the patience loop is mental load. Each iteration is a small context switch: read the new attempt, run it, observe the failure, formulate the next prompt. The switches are exhausting in a different way than focused debugging. You finish a session feeling tired but not feeling like you accomplished much, because the actual debugging value of the session was low even though the activity was high. Recognize this state when you are in it. The fix is to step out of the loop, take a break, and come back with a clearer head and a fresh decision about whether to continue, take over, or regenerate.

The honest version is that some days the agent is on its game and the loop converges quickly. Other days it does not. Both are fine. The discipline is recognizing which kind of day you are having and adjusting. If you have already had two non-converging loops in a session, the third is unlikely to be different. Switch modes. Either take more of the work yourself, or rebuild the prompts and instructions, or stop for the day. Pushing through a non-converging session usually produces lower-quality output than coming back fresh tomorrow.

Production Bugs vs Development Bugs

Bugs in development and bugs in production have different costs and require different responses. Both can come from AI-generated code. The shape of how you debug each is different because the constraints are different.

Development bugs are cheap. You can iterate freely, regenerate code, throw out approaches. The local environment has full visibility. You can add prints, run debuggers, inspect state. The cost of a wrong iteration is seconds. The discipline is using that freedom: try the fix, observe, iterate. Do not be precious about the code. It can be regenerated. The objective is reaching the right output, not preserving the path you took to get there.

Production bugs are expensive. Users are affected. The cost of a wrong fix is real. You cannot just throw out the code and regenerate without a deploy that affects everyone. Every iteration has overhead: write the fix, get it through code review, get it through CI, deploy, observe. A loop that takes five minutes in development takes thirty minutes or an hour in production. The math changes. You spend more time on the diagnosis before attempting the fix, because attempts are expensive.

The diagnostic moves shift accordingly. In production, the first move is to capture state. Logs, traces, metrics, customer reports. Whatever you can grab from the running system before it changes. The state is the evidence and you cannot recreate it after the fact if conditions change. In development, you can re-run with new logging. In production, the bug may be intermittent and you may only have one shot at observing it before it goes away.

The agent's role shifts too. In development, the agent is a fast collaborator on iterations. In production, the agent is a sanity check and a generator of hypotheses, not a deployer of fixes. You ask it to read the logs and tell you what it sees. You ask it to look at the code and propose what might be wrong. You verify those hypotheses against the production data. You write the fix. You deploy with care. The agent is in the loop but not in the control seat. The control seat is yours because the cost of a wrong call is too high to delegate.

One specific pattern that helps with production debugging: the agent is excellent at writing one-off scripts to query logs, parse traces, or aggregate metrics. The script does not run in production. It runs locally against production data you have exported. The script is cheap to write, cheap to run, and produces structured output that helps you reason about the bug. The agent's strength is writing this kind of glue code quickly. The combination of "agent writes the analysis script, you read the output, you decide the fix" is high-payoff and low-risk because the agent is not touching production directly.

Tests as Debugging Tools

Tests are not just verification. They are a debugging tool with specific properties that hand-tracing does not have. The agent can write tests faster than it can debug, and the tests it writes can illuminate where the bug is even when the agent cannot fix it directly. Using tests this way is a technique that pays back across hundreds of bugs.

The pattern is: when you have a bug, ask the agent to write a failing test that captures the bug. The test should be minimal: the smallest input that produces the wrong output, asserted against the expected output. The agent will produce this faster than you can. Once the test exists, the bug has a concrete shape. You can run the test repeatedly while iterating on fixes, and you have certainty that the fix worked because the test that was failing now passes.

The advantage over ad-hoc debugging is that the test is durable. After the bug is fixed, the test stays in the suite. It guards against regressions. The agent has produced something that has value beyond the immediate fix. Compare this with print-debugging, where the prints are removed and there is no lasting artifact. Test-first debugging produces a fix and a regression guard in the same time it would have taken to produce just a fix without the test.

The agent will sometimes resist this pattern by jumping straight to a fix without writing a test. Push back. Ask for the test first. The reason is that the test forces the agent to articulate exactly what the bug is, in code, in a way that can be verified. Without the test, the agent's understanding of the bug is in natural language and may diverge from yours. With the test, the bug is pinned. The fix can then be verified against a concrete target instead of a vague description.

The other use of tests as debugging tools is bisection. If you have a bug and you do not know which recent change introduced it, use git bisect plus the failing test. The agent can drive the bisect: at each step, run the test, mark good or bad, move on. The agent is faster at this than you are because it does not get bored. You verify the result and pick up from there. The combination of bisect plus failing test plus agent driving is one of the best ways to find regressions in a codebase that has many recent changes.

Specific Tools and Where They Help

The tooling for debugging AI code overlaps with the tooling for debugging in general, but some tools are particularly useful for the AI-specific failure modes. Knowing which tool catches which class of bug saves time on debugging because you reach for the right one without having to triage first.

For typed languages, the type checker is the first line of defense against hallucinated APIs and wrong-but-plausible imports. TypeScript will catch most of these at compile time. Mypy and Pyright catch them in Python, with the caveat that you have to actually have type annotations. If you are working in a typed codebase and the agent has generated code that fails type-check, do not silence the error. The error is information. The agent has called something that does not exist or imported from a path that does not match the type definitions. Fix it at the type level.

For untyped languages, the linter and the runtime are your tools. Pylint, ESLint, Rubocop. They will catch some of the AI-specific patterns: unused imports, undefined variables, calls to functions that the linter cannot resolve. They miss the truly AI-specific patterns where the call resolves to a real function but does not do what the agent thought it would. For those, the runtime is your friend. Run the code on actual data. The bugs surface fast.

For library mismatch issues, the package manager is the tool. `pip show`, `npm ls`, `cargo tree`. They tell you exactly what version of what library is installed. The agent's bug often manifests as code that assumes a different version. Confirming the version is the cheapest first step in those bugs. Once you have the version, the agent can be told it explicitly and produce code that matches.

For environment and config bugs, the shell and the running process are your tools. `env`, `printenv`, `ps`, the actual logs from the actual running process. The agent's invented config will not match what the process actually has. Comparing the two reveals the gap immediately. Most config bugs in AI code are five-second bugs once you know to compare the assumed config to the real one. They become hour-long bugs when you assume the agent got the config name right and start debugging downstream of it.

For runtime mysteries that compile and run but produce wrong output, the debugger is the tool. Step through the code. Watch the variables. See where the values diverge from your expectations. The agent can sometimes help interpret what you see, but it cannot do the stepping for you. The debugger is irreducibly a human-driven tool. The agent is a collaborator while you drive.

Building a Debugging Library Over Time

Every AI-bug debugging session leaves a residue. The bug existed for a reason. The reason was usually that the agent's training data did not match your specific reality on some axis. Capturing that mismatch is the difference between debugging the same class of bug over and over and debugging each class once. The capture is how you build compounding speed across hundreds of bugs.

The capture takes a specific form. Project notes file or agent instruction file: "When working with library X, always specify version Y." "Our env var for the database is named DATABASE_URL, not DB_URL." "The auth middleware expects the token in header X, not Authorization." Each entry is a thirty-second writeup that prevents future bugs in the same shape. The notes accumulate over weeks and months. After a quarter, you have a knowledge base that captures the specific reality of your project, and the agent stops making the same mistakes because the constraints are in its context.

The discipline is writing the note in the moment, not later. Later does not happen. The bug is solved, the relief of solving it is real, and the notes file feels like overhead. It is not. It is the highest-payoff minute you will spend that day, because the next time the same shape of bug appears, the note will save you the time you just spent debugging. Compounded across the project, this is the difference between a codebase that gets harder to work in over time and one that gets easier.

The format of the notes can be informal. A bullet list in a markdown file is fine. A structured agent instruction file is better if you want the agent to read it on every turn. The Claude Code convention is the CLAUDE.md file at the project root, which the agent reads automatically. Other agents have similar conventions. Pick one and use it. The act of writing forces the lesson into a durable form. The form does not have to be perfect. It has to exist.

One pattern worth specific mention: capture the prompts that worked. When you finally got the agent to produce correct code on a tricky task, save the prompt. Not the code, the prompt. The next time you face a similar task, you start from the working prompt and adjust, rather than starting from scratch. Prompts that work are intellectual property. They are the result of iteration, and the cost of recovering them by re-iterating is much higher than the cost of saving them once.

The notes file is not a journal. It is not a place to write narratives or feelings. It is a place to write specific constraints, library versions, naming conventions, and gotchas that the agent has hit. Keep it terse. Make it queryable. The value is in being able to grep for "auth" or "database" and get the relevant constraints in five seconds. A long-form journal cannot deliver that. A bullet list of constraints can.

Closing

Debugging AI-generated code is not harder than debugging human code. It is a different shape of work. The tells are different. The recovery patterns are different. The places to look first are different. Developers who learn the new shape ship faster than the ones who try to debug AI output the same way they debugged junior pull requests in 2018. The new shape is learnable. It just takes attention to the patterns that are specific to AI failure modes, and the willingness to update reflexes that worked on a different generation of bugs.

The high-level moves are simple to state. Recognize that AI bugs cluster around statistical-median patterns: hallucinated APIs, wrong-but-plausible imports, mixed-version syntax, invented config, silent type coercion. Diagnose with three questions: did this run at all, does the test pass for the right reason, does the diff match the narrative. Delegate to the agent on clear bugs with clear stack traces, take over when the agent's mental model is the bug. Default to root cause over symptom fixes. Use logs, types, and tests as the verification layer the agent cannot provide. Capture the lessons in a notes file or instruction file so the same bug does not have to be debugged twice.

None of these are hard individually. The discipline is doing them consistently across hundreds of debugging sessions, when the temptation to skip the verification step or accept the symptom fix is highest. The discipline is the thing that turns the productivity claims of agentic dev from a marketing slogan into actual hours saved. Without it, you ship bugs faster than you used to. With it, you ship working code at a rate that makes the old workflow look slow.

The closing thought is that debugging is not a punishment for using AI. It is part of the work, and it is changing shape because the work is changing shape. The developers who are productive in this era are the ones who treat debugging as a first-class skill that gets sharper with practice, not as an annoyance to be minimized. Sharpen the skill. The bugs are getting more interesting and the developers who can read them are getting more valuable. That is the real shift, and it is happening whether anyone names it or not.