Human in the Loop Development: The Discipline of Vibe Coding

AI builds. You decide. The discipline of vibe coding is staying in the decision seat without losing speed, and it is harder than it sounds because the temptation to rubber-stamp accumulates with every successful turn. The first ten approvals feel earned. The eleventh feels routine. By the fiftieth, you are clicking accept while looking at a different tab, and the agent has just renamed three exported functions in a file you did not open. The bug ships on Friday. You find it on Monday. You are not lazy. You are not careless. You are human, and humans habituate. The job of staying in the loop is the job of fighting that habituation while still moving fast enough to justify having an agent in the first place. This page is about that fight.

The phrase "human in the loop" gets used carelessly. It shows up in product brochures and risk-management slide decks and means almost nothing in those contexts. For a developer working with Claude Code, Cursor, Aider, or any other agentic tool, it has a specific meaning: you are the loop. You are not an observer of the loop. You are not a backstop for the loop. You are the part of the loop that decides what gets done, what gets shipped, and what gets rejected. Without you in that seat, the agent is a fast typist with no taste. With you in that seat but checked out, the agent is a fast typist whose mistakes you will own. The discipline is everything that happens between those two failure modes.

What "In The Loop" Actually Means

The first thing to clear up is what the discipline is not. It is not "review every line." If you review every line, you are typing through a slower interface, and you have given up the productivity gain that justified using an agent in the first place. A 600-line refactor takes ten minutes for an agent and ninety minutes for you to read carefully. If you read it carefully every time, you are paying ninety minutes for a ten-minute job. Nobody works like that for long. They give up and either write everything themselves or check out and accept everything. Both are failure modes.

It is also not "trust the agent and look later." The "look later" review never happens with the same depth as a review at the moment of generation, because by the time you look later, the bug has compounded with three other commits that built on the broken code. Untangling a four-commit-deep error costs more than catching the original. And "look later" is the road to never looking, because the next thing on the queue always feels more urgent than checking last hour's output.

The middle path is the only path that works long-term. You review the right things at the right time, with the right depth, and you have a model in your head of what counts as "right" for each axis. The model is not perfect on day one. It builds with experience. The discipline is committing to building it deliberately rather than letting it form by accident.

Direction

Execution

Review

Iterate or ship

The supervisor model is the cleanest way to think about it. You decide what to build. The agent figures out how. You check the work that matters. The agent does not care which functions get extracted or which file holds the helper. You care about what ships and what does not. Those are different jobs and they live with different parties. When you blur the line, the model breaks. You start telling the agent how to write the helper, the agent starts deciding what ships, and the loop has lost both its strengths.

The honest version of this discipline acknowledges that some review is going to be shallow and some is going to be deep, and the trick is making sure the deep review lands on the changes that matter. A formatting pass on three files does not need a deep read. A migration that touches the auth flow does. The depth of review should match the blast radius of the change, not the size of the diff. Small diffs in load-bearing code deserve more attention than big diffs in scaffolding.

The Supervisor Stance vs The Operator Stance

There are two stances you can take when working with an agent, and they produce wildly different outcomes. The supervisor stance is the productive one. The operator stance is the trap most people fall into when they first try agentic dev and then conclude that "AI does not really help."

Supervisor stance

You set objectives at the level of features, behavior, and constraints. The agent figures out the implementation. You review outputs at decision points and intervene when judgment is needed. Your time goes to what matters: the goal, the architecture, the edge cases, the production risk. The agent's time goes to typing, scaffolding, refactoring, and the mechanical work humans do badly anyway.

Operator stance

You issue keystroke-level instructions. "Add a try-catch here." "Rename this variable." "Move this function up three lines." The agent becomes a fancy autocomplete. You are still doing every micro-decision yourself, just with a slower interface. The productivity gain disappears. You feel busy because you are typing, but you are not building any faster than you were a year ago.

Vibe coding requires the supervisor stance. The whole reason to delegate to an agent is that the agent handles execution faster and more thoroughly than you can. If you do not let it execute, you are paying for the agent without getting the benefit. People fall into operator mode for understandable reasons: they got burned once, they want control back, they do not trust the agent yet. Those are real concerns and they are not fixed by going operator. They are fixed by building trust calibration over time, by writing better tests, by using better tools, and by sharpening your review habits. Going operator is the loud answer that does not actually solve the problem.

The supervisor stance has its own failure mode, and it is the inverse one. If supervision becomes lax, you stop reviewing real outputs and just nod along. You give up the operator's keystroke-level control and replace it with nothing. That is worse than operator mode because at least the operator is reading every line they type. The supervisor who has stopped reviewing is shipping unreviewed code at agent speed. That is how a Friday afternoon and a Monday morning become a war room.

Stay supervisor. Stay engaged. Trust at the level of "the agent can write this kind of function correctly" but verify at the level of "this specific function does the thing the spec asked for." Those are not the same thing and confusing them is what produces the bad version of either stance.

Reading AI Output Critically

The single most important habit in agentic dev is the diff scan. Every multi-file change gets a diff review before you commit. Every one. The diff scan is short. It is not a deep code review. It is a quick pass to confirm three things: the change does what was asked, it did not break anything else, and it did not modify files outside scope. If those three things hold, the diff scan ends and you move on. If any of them is in doubt, you slow down and read.

The order matters. Most humans, when they start reading a diff, jump to the line-level changes. That is the wrong order. Line-level changes are the easiest thing for an agent to get right. Structural changes are where the agent is more likely to do something surprising. Scope changes are where you discover that the agent decided to refactor a file you had no plans to touch. Read scope first, structure second, syntax third. That inverts the natural human reading order, which is why you have to do it deliberately.

Check the file list

Which files changed? Are they all files you expected to change? An agent that touched three files when you asked for one is either being thorough or going off the rails. Find out which before reading further.

Scan the structural changes

New functions, deleted functions, signature changes, exports added or removed. These shape how the rest of the code interacts with this change. A wrong structural choice produces ten downstream issues you will not see in the line diffs.

Read the logic on the critical paths

You do not need to read every line. You need to read the parts that decide outcomes: branching, error handling, boundary conditions, anything touching state. Skip pure formatting and obvious mechanical changes.

Run the tests

Tests are not a substitute for review, but they are a backstop. Green tests on a structural change you did not understand do not mean the change is correct. Red tests on a change you thought was simple mean stop and investigate.

Decide

Commit, ask for a fix, or roll back. The decision is binary: this change makes the code better or it does not. If you cannot tell, that is a third option, and the answer there is "ask for a different version" not "commit anyway and hope."

The "scan high to low" pattern feels unnatural at first because it asks you to skip details. People want to feel thorough by reading every line. Reading every line is not thoroughness. It is performative reading that produces fatigue and missed structural problems. The scan-high-to-low pattern is harder to learn but it catches the issues that actually matter, because the issues that matter are usually structural and most reviewers miss structural issues by getting lost in line-level noise.

Tools like Claude Code make this easier because they show you a clean diff per turn and you can ask follow-up questions about specific files. Cursor's inline-diff view encourages line-by-line, which is fine for small changes but trains the wrong habit on big ones. Aider falls somewhere in between depending on configuration. The tool matters less than the habit. Build the habit, then pick the tool that supports it.

When to Interrupt vs When to Let It Run

Interruption is a real cost. If you stop the agent mid-turn, you lose the partial work, you have to re-explain context, and you break the rhythm of the session. If you do not stop the agent when it is heading the wrong way, you wait while it produces output you will throw away. Both costs are real. The trick is knowing which is happening at the moment you are tempted to interrupt.

The interrupt triggers are short and specific. Stop the agent when it has misread the actual problem and is solving the wrong thing. Stop when it is about to do something destructive: delete a file you care about, drop a database table, run a migration in the wrong environment. Stop when it has gone in circles for two turns and the next turn is not going to break the loop. Those are the only three categories that justify a hard interrupt. Everything else is "give it one more turn and see."

Takeaway

The cost asymmetry favors patience. A wrong interrupt loses thirty seconds. A missed interrupt can lose an hour, because the agent will produce output you have to either review carefully or throw away. When in doubt, let it run one more turn and intervene only if the turn confirms your suspicion.

The "give it one more turn" pattern works because the agent's plan often becomes clearer after one more action. You think it is going the wrong way. You watch one more turn. Either it confirms your suspicion and you intervene with confidence, or it does something that makes sense and you saved yourself a wrong interrupt. The first turn after your suspicion is cheap. The third turn after your suspicion is expensive. Set the rule at "give it one more, then decide."

Conversely, the let-it-run triggers are: the agent is on a clear path that you can see; the next thirty seconds of work are mechanical and you would type them yourself if you were doing it; you are not a faster typist than the agent and intervening means you will produce the same output yourself, just slower. If those three are true, stay out of the way. The agent is doing what you would do, faster. Interrupting just makes you the bottleneck.

People interrupt too often when they are nervous and not often enough when they are tired. Notice which mode you are in. Nervous-mode interruption looks like rejecting four turns in a row because each one is "not quite right" when actually each one was fine and you were chasing perfection. Tired-mode under-interruption looks like accepting six turns and only later realizing the second one was wrong and the next four built on it. Both are correctable once you see the pattern in yourself.

Trust Calibration Over Time

Trust in an agent is earned, not assumed. You do not start at full trust and you do not stay at zero trust. You build a map of what the agent gets right and what it gets wrong, and you use that map to allocate your review attention. The map is personal because your codebase is different from someone else's, your conventions are different, and the agent's failure modes will land differently on your specific project.

Day 1: review depth on every change 100%

Week 1: low-risk operations skim-reviewed 60%

Month 1: trust map covers most categories 40%

Month 3: deep review only on critical changes 25%

Day one is full review. Every change gets a careful read. You are not yet calibrated, and you have to learn what the agent does well before you can skip parts of the read. This is slow and it should be slow. The investment of slowness in week one buys you speed in month two, because you will know which changes deserve fast review and which need slow review.

By week one, you have started identifying low-risk categories. File renames, simple refactors that the agent has done correctly five times in a row, formatting passes, import reorganization. These are categories where the agent rarely fails, and where the failure mode is loud (compile error, broken test) rather than silent. You can shift these to skim review without losing much. You still glance at the diff. You do not study it.

By month one, you have a real trust map. You know that the agent is reliable on isolated function changes in pure modules. You know it sometimes gets confused on cross-cutting changes that span multiple files with shared state. You know it tends to over-extract helpers when it should leave code inline. You know it occasionally invents API endpoints that do not exist if you do not give it real type information. You have specific, concrete failure modes you watch for, and your review attention concentrates on those.

By month three, deep review is rare and targeted. Most changes get a diff scan and a test run. The deep reviews land on critical paths: auth, payments, anything that touches the database schema, anything that goes to production without a feature flag. The trust map has matured to the point where you can spend most of your review budget on the small fraction of changes that actually warrant it.

The temptation, once you have a trust map, is to forget that it was earned. You start treating it as a default. That is fine until the agent gets an update, your codebase shifts, or you start working in a domain where the agent's strengths are different. The trust map needs maintenance. Periodically pick a "trusted" category and review it deeply. Confirm the trust is still warranted. If it is, great. If not, recalibrate. This costs an hour every few weeks and saves you from the failure mode where your trust map is out of date and you do not know it.

The Danger of Approval Drift

Approval drift is the silent killer of agentic dev. It happens slowly and you do not notice until something breaks. The pattern is simple: you approve a turn, it is fine. You approve another turn, it is fine. You approve fifty turns, all fine. By the fifty-first, you are clicking approve while looking at Slack, and the fifty-first turn was the one with the bug.

The problem is human pattern recognition. Humans tune out repeated stimuli. It is not laziness, it is biology. The seventh time you see something that looks the same, your brain processes it less carefully than the first. This is called habituation and it is the same mechanism that lets you ignore the hum of an air conditioner. The cost of habituation in agentic dev is that the bug that ships looks exactly like the fifty turns that came before, except one detail is wrong, and that detail is what breaks the production deploy.

Engaged review

Each turn gets a fresh read. The reviewer notices anomalies because they are still tuned to anomalies. The diff scan is short but real. The sense of "this looks normal" is a positive judgment, not the absence of a judgment. Catches subtle issues because the reviewer is actually looking.

Approval drift

Each turn gets a glance, and "looks normal" means "I have not detected any obvious problem in the half second I spent on this." The reviewer is no longer tuned to anomalies because every turn looks like the previous one. The fifty-first turn ships unreviewed and the bug compounds before anyone notices.

Defenses against approval drift exist and they work, but you have to actually use them. The first defense is forced full reviews on big changes. Any change that touches more than ten files, or any change in a critical path, gets a deliberate slow review. You do not skim it. You sit with it for five minutes minimum. This breaks the rhythm and re-engages the reviewing brain.

The second defense is regular deep-dive sessions. Once a week or once a sprint, you pick a recent commit and review it as if you had not seen it before. This is slow. It feels redundant. It catches issues that drift would have hidden, and more importantly, it recalibrates your sense of what "looks normal" should look like. You are training your pattern recognition by giving it fresh inputs.

The third defense is automated tests as a backstop, not a substitute. Tests do not replace review. They catch a different category of issues. A change can pass tests and still be wrong if the test coverage does not extend to the specific behavior that broke. Tests catch regressions in tested behavior. They do not catch architecture decisions that are bad but technically correct. Both kinds of failure ship with the same green-light signal, and only review catches the architectural one.

The fourth defense is pairing or rotation. If you have collaborators, rotate who does the diff scan on critical paths. A fresh pair of eyes on a familiar codebase catches things the regular reviewer has habituated to. If you are solo, the rotation is between fast and slow review modes on yourself. Schedule the slow modes. They will not happen by accident.

Managing the Loop's Tempo

The loop has a tempo, and the tempo matters. Some sessions are fast: low-stakes changes, small diffs, quick reviews, ship and move on. Other sessions are slow: structural change, careful read, multiple passes, talk it through before committing. Knowing which mode you are in matters more than the speed itself, because if you run a slow-mode session at fast-mode tempo, you ship bugs, and if you run a fast-mode session at slow-mode tempo, you waste hours on changes that did not need them.

Read the change

Assess stakes

Pick tempo

Match review depth

Fast tempo is for changes you would normally write yourself in five minutes or less. The agent does it in thirty seconds. You diff-scan in another thirty seconds. Total time: a minute. Multiply that across a session and you can ship dozens of small improvements in an hour. This is where agents are pure win. The review is shallow because the change is shallow. There is no architectural decision being made. You are not hand-wringing about whether the function name is right. You are moving.

Slow tempo is for changes that affect how the system fits together. New abstractions. Schema changes. Anything in the auth path. Anything that crosses module boundaries. These get the careful read. You ask the agent to explain its choice. You read the explanation. You read the diff. You think about edge cases. You may go through three iterations before you accept. This is how you keep the system coherent. If you run slow-tempo work at fast tempo, you accumulate small bad architectural decisions that compound over months.

The mistake is not picking a tempo. People drift into one tempo for the whole session and stay there even when the work shifts. They start fast on a feature, hit a tricky boundary, and keep moving fast through the boundary because they were already moving fast. Two days later they are unwinding the bad decision they made at speed. The fix is to actively pick the tempo at each significant transition. When the work changes shape, pause and ask "is this still fast-mode work?" If not, slow down deliberately.

Energy levels matter. You will not do good slow-mode review when you are tired. The pomodoro-like rhythm of agentic dev should put the slow work in your high-energy windows, when you can focus and absorb structural changes. Save fast-mode work for lower-energy hours when shallow attention is enough. This is not a productivity hack. It is recognition that review quality is energy-dependent, and the cost of bad review at the end of a long day shows up as bugs in production three weeks later.

One specific habit that helps: write down at the start of a session what kind of work you are doing. "Today I am wiring up a new endpoint, slow-mode for the auth piece, fast-mode for the response shape." Putting it in writing forces the calibration before you are mid-flight. It also creates a record you can use later to ask "did I actually keep the auth piece in slow mode, or did I drift into fast mode when it got hard?" The answer to that question is usually instructive.

When the Human Becomes the Bottleneck

At some point, the agent gets faster than you can review. This is not a hypothetical. It happens to anyone who works seriously with agentic dev for a few months. The agent can produce ten reviewable changes per hour. You can deeply review three. The other seven sit in a queue, get shallow review, or get accepted on faith. None of those options are great. All of them are real.

Agent changes per hour

Deep reviews per hour

Changes needing other handling

Throughput at scale via batching

The patterns that handle this scale are not new, they are just newly relevant. Batch review is the first one. Instead of reviewing each change as it lands, you queue changes from a session and review them in a block. The block review is faster per change because you are in review-mode and not switching contexts. The downside is that you ship review delays into the loop, so the agent waits for you between batches. That delay is fine if you set up the work so the agent has the next thing queued and is not blocked by your review.

Parallel agents are the second pattern. If you can decompose a project into independent tasks, run multiple agents on those tasks in parallel and review them when each completes. This works well for things like generating tests across modules, or doing parallel refactors in unrelated parts of the codebase. It does not work for things that share state, because parallel agents will collide on state in ways that are exhausting to untangle. The discipline is recognizing which parts of the project decompose and which do not.

The third pattern is letting CI/CD catch what you skip. This is the controversial one because it sounds like "let production catch the bugs." It is not that. It is "let your test suite, your linter, your type checker, and your staging environment catch what shallow review would not." The cost of pushing more verification into automation is upfront: you have to invest in better tests, better types, better observability, faster rollback. The benefit is that you can shallow-review more changes without the bugs ending up in front of users. This is a real tradeoff with real numbers behind it. It is how teams scale beyond the single-reviewer bottleneck.

The honest admission is that at scale, you trade some review depth for throughput, and you architect for that tradeoff. Good tests are not optional in a high-throughput agentic workflow. Good observability is not optional. Fast rollback is not optional. If your only safety net is the human reviewer, the human reviewer is going to be the limit on throughput, and at some point you will either hit that limit and stop, or you will skip review and start shipping bugs. The third option is investing in the safety net so the human reviewer can hand off some of the verification work to the system.

This is the moment a lot of solo devs underinvest. They see the agent producing fast and they think "great, more output." They do not invest in the tests because the tests feel like overhead. Three months later they have shipped a hundred changes with shallow review and the codebase is a minefield. The shape of the bug they will hit is "something subtle in module X that has been broken for two months and only manifests on certain inputs." That bug exists because the test coverage on module X did not extend to those inputs and the human review was shallow. The fix at that point is more expensive than the test would have been at the start.

The flip side is that teams that do invest in tests and observability hit a different ceiling. They can run agents at high throughput, the human stays in the loop at the right altitude, and the system catches what the human does not. Their bug rate is lower than the operator-mode dev who is reviewing every line, because the system is verifying things the human would have missed. This is the version of agentic dev that actually delivers the productivity claim. It is not free. It costs upfront work on the safety net. The work pays back across hundreds of changes.

Specific Habits Worth Building

Beyond the high-level disciplines, there are specific habits that pay back. Some of them feel small. They compound into the difference between a productive agentic workflow and one that quietly produces bugs.

First habit: read the agent's plan before it starts. Most agents will produce a plan or summary at the start of a multi-step task. Read it. If the plan is wrong, fix it now, before any code is written. The plan is the cheapest place to intervene. If you ignore the plan and only review the code, you are reviewing the symptom of a wrong plan rather than the plan itself, and you will end up rolling back hours of work.

Second habit: ask "what did you change and why" after non-trivial turns. The agent's explanation is a window into whether it actually understood what it was doing. If the explanation reads like a list of what changed without any why, the agent did not have a clear model. If the explanation matches what you expected, you can probably trust the change. If the explanation has a different shape than what you expected, that is a signal to slow down and read carefully.

Third habit: keep a notes file of "things the agent has gotten wrong." Personal to your project, your patterns, your codebase. After a few weeks you have a list of specific failure modes to watch for. This is the trust map made concrete. When you see one of those patterns starting, you intervene early instead of late.

Fourth habit: never commit something you do not understand. If you cannot explain what the agent did to a colleague, do not commit it. Ask the agent to explain. Re-read the change. Iterate until you can explain it. The exception is genuinely mechanical work where the explanation is "we renamed X to Y across these files" and the diff confirms exactly that. The rule is for substantive changes where understanding matters.

Fifth habit: separate "this works" from "this is right." Tests passing means it works. The architecture being correct means it is right. Both can be true. Either can be true alone. Tests-pass-and-architecture-is-wrong is the most common failure mode in agentic dev because the agent optimizes for tests passing without always optimizing for the architecture being right. Notice when you are about to commit something that works but does not feel right. The feeling is information.

Sixth habit: do not let the agent merge to main. The merge is the act that turns a change into shared reality. Even when you are working solo, manual merge gives you a final review point. When you are working with a team, the merge is the contract with everyone else. Setting the agent up to auto-merge is giving up the last review point you have. Do not.

When the Loop Breaks Down

The loop breaks down in specific ways and the failure modes are recognizable. Knowing them helps you catch the breakdown early instead of weeks later when the consequences have compounded.

The first breakdown is when you stop asking "why" and only check "what." You see what the agent changed, you confirm it compiles, you commit. This is fast and feels productive. It misses the architectural mistakes that pass syntactic checks. The fix is to deliberately ask "why" at least once per session, even on changes that look obvious. The why-question is calibration; it tells you whether the agent's reasoning matches yours.

The second breakdown is when you stop reading code and only read messages. The agent's narrative summary of what it did is convenient but it is also editable by the agent's framing. The summary may say "I refactored the auth module to be cleaner" while the diff shows the auth module is now structured around an assumption that does not match your actual auth flow. If you only read the summary, you miss the structural drift. The summary is helpful but the diff is the source of truth.

The third breakdown is when you stop running tests after every change and only run them at the end. This is fine for pure mechanical changes but not for anything substantive. A change that breaks tests three commits later is much harder to debug than a change that breaks tests immediately. Run the tests now. The cost is small. The benefit is keeping the bug close to where it was introduced.

The fourth breakdown is when you stop noticing surprises. The first time the agent does something unexpected, you flag it and ask. The tenth time, it has become "normal." But "normal" includes some things that were actually small drift you did not push back on. The accumulated drift is how the codebase ends up shaped differently than you intended. Stay surprised. When something is unexpected, even a small thing, ask the question. The answers are sometimes "no, this is correct because of X" and you learn. Sometimes they are "actually you are right, I was about to do something dumb" and you save yourself a mistake.

The fifth breakdown is when you stop having opinions. This sounds odd but it is real. After enough successful turns, you start trusting the agent's choice on small decisions because pushing back would slow you down. The trust is fine on truly trivial decisions. It is dangerous on decisions you would have had an opinion about a month ago and now do not. The loss of opinions is a signal that you have habituated. Spend an hour with a tricky change and rebuild your opinions deliberately.

Tooling That Helps the Loop

The discipline is yours to build, but the tool can either support the discipline or fight it. Some choices help. Some get in the way. Knowing which is which matters because the wrong tool will quietly push you toward the failure modes even when you are trying to do the right thing.

Claude Code is the protagonist here for a few specific reasons. The default workflow is turn-based. The agent does work, presents a diff, and waits for your acknowledgement before moving on. That structure forces a review point at every turn. You can override it with auto-accept settings if you choose, but the default biases toward the supervised path. The diff display is clean and shows the file list before the line changes, which supports the scan-high-to-low pattern. Plan mode lets you see the agent's strategy before any code is written, which is the cheapest place to intervene. These are not accidents. They are design choices that make the supervised path easier than the unsupervised one.

Cursor takes a different approach. The inline-diff experience is excellent for line-level work but encourages line-by-line reading even on big changes. Composer mode is closer to a turn-based agent and works well for multi-file work, though the review surface there is similar to Claude Code's. The lesson is to use Cursor's inline mode for small surgical changes and Composer for anything bigger, and to bring your own discipline to the diff scan because the tool will not enforce it.

Aider runs in a terminal and shows diffs as text. The minimalism is a feature for some workflows because it strips away the slick UI that can lull you into approving without reading. The downside is that the review experience is whatever your terminal makes it, and a noisy terminal makes the diff harder to scan. People who use Aider successfully tend to also have a clean terminal setup, syntax highlighting in their pager, and a habit of reading the full diff before accepting. The tool will not do those things for you.

Whatever tool you pick, the question to ask is "does this make the supervised path easier than the unsupervised path?" If the answer is yes, the tool is helping. If the answer is no, you are fighting the tool every session, and over months you will lose. Pick a tool that is on your side of the discipline.

Working With Multi-Agent Setups

Some workflows now run multiple agents in parallel or in sequence. A planning agent feeds a coding agent feeds a review agent. Or you have agents on parallel branches working on independent features. Or one agent writes tests while another writes implementation. These patterns are real and they multiply both the productivity gain and the review challenge.

The core principle is unchanged: you are still in the loop, you are just in the loop at a different altitude. Instead of reviewing every line the coding agent produces, you review the planning agent's plan, you spot-check the coding agent's output, and you watch the review agent's findings. The work has higher reach per minute of attention. So is the failure mode if you check out, because now multiple agents are producing in parallel and you cannot review all of it.

The first rule with multi-agent setups is that you still merge by hand. Whatever the agents do internally, the act of bringing their work into shared reality is yours. This is the same rule as solo agentic dev, just more important because the volume of output is higher. Auto-merging in a multi-agent pipeline is how you discover three weeks later that the system shipped something nobody noticed.

The second rule is that the review agent is not a substitute for your review. It is a backstop, like tests. A review agent can catch obvious issues, missing tests, broken patterns. It cannot make architectural judgment calls. It cannot decide whether the feature should ship. The human reviewer remains the final decision point, even when the review agent says "looks good." If the review agent is consistently right, that is great. You still spot-check it. If you stop spot-checking, the review agent becomes the new approval-drift surface, and now there are two layers between you and the code.

The third rule is to limit blast radius per agent. An agent with permissions to touch the auth flow is a different risk than an agent that can only touch the test files. Multi-agent setups should have scoped permissions that match scoped responsibilities. The planning agent does not need write access. The coding agent does not need to push to main. The review agent does not need to delete branches. Constrain each agent to its actual job and the failure modes are bounded.

Communication Patterns That Work

How you talk to the agent shapes how the loop runs. Some communication patterns make supervision easier. Others make it harder. The patterns are not magic. They are habits that pay back across sessions.

Be specific about scope. "Refactor the auth module" is too broad. The agent will make decisions about what counts as the auth module that you may not agree with. "Refactor the token validation function in src/auth/tokens.ts to use the new format described in RFC X" is specific. The agent has a bounded job. Your review can verify the bounded job got done. Vague scope produces sprawling diffs and uncertain reviews.

State the constraints. If you do not want the agent to touch tests in this turn, say so. If you do not want it to add new dependencies, say so. If you want it to preserve existing function signatures, say so. The agent will make reasonable defaults if you do not state constraints, but reasonable defaults are not always your defaults, and the divergence is a source of unwanted changes.

Ask for the plan first. On non-trivial changes, ask the agent to outline its approach before writing code. Read the plan. Approve or correct it. Then let it code. The plan-first pattern is the cheapest review you can do because the cost of changing direction at the plan stage is near zero. The cost of changing direction after the code is written is hours.

Push back when the answer is wrong. Agents will sometimes confidently say things that are incorrect. If you let the wrong answer pass, the agent builds on it and the wrong answer becomes the foundation of the next ten turns. Pushing back early costs a few seconds. Letting the wrong answer compound costs an hour. When something feels off, say so. The agent will either explain why you are wrong or correct itself. Either outcome is better than nodding along.

Use the agent's voice as a signal. When the agent's response gets vague, hedging, or starts using lots of "perhaps" and "might," that is a signal it is not sure about something. Vague answers from the agent often correspond to thin understanding of the actual problem. Drill in when the voice goes vague. Either the agent will surface the uncertainty and you can address it, or it will firm up the answer when pushed and you will have higher confidence in the result.

The Cost of Being In The Loop

Staying in the loop is not free. It costs attention, energy, and time. The cost is the price of the productivity gain. Pretending the cost is zero leads to burnout, which leads to either dropping out of the loop entirely or losing the discipline that makes the loop work.

The honest accounting is that an agentic workflow with proper review consumes about thirty to forty percent of the time saved by the agent. So a four-hour task that the agent does in one hour might take you another twenty to thirty minutes of review, leaving a net of two to two and a half hours saved. Not the full three hours that "the agent did it in one hour" implies, but still substantially better than four hours of solo work. The savings are real but they are smaller than the marketing claims, and that is fine because they are still significant.

The thirty-to-forty-percent figure is not universal. It is what experienced practitioners report after they have built their trust calibration and review habits. Day one users spend more like seventy or eighty percent of the saved time on review because they have not yet learned which categories of work need deep review and which do not. By month three, the figure drops. The investment in calibration pays back here.

What does not pay back is trying to skip the review entirely. The numbers from teams that do this are not better. They are worse, because the bug rate goes up and the rollback cost on a bad change is high. You can find anecdotes about devs who "vibe coded a whole feature and shipped it without reading a line." Those anecdotes are selection bias. You do not hear from the people who did the same thing and ended up rolling back a production deploy. The win-cases are loud. The fail-cases are quiet. Average them and the no-review path is worse than the supervised path.

The other cost is mental load. Staying in the loop means you are always partly in review mode, even when the agent is doing routine work. This is exhausting in a different way than writing code yourself. Solo coding has long stretches of focus. Agentic dev has frequent shallow context switches between "agent is working" and "I am reviewing." The shallow switches add up. Practitioners report being more tired at the end of an agentic dev session than a solo session, even when the agentic session shipped more output.

The fix is not to abandon the loop. The fix is to manage the load deliberately. Take breaks between sessions. Do not do agentic dev at the end of an exhausted day. Recognize when you are tired and stop, instead of pushing through and accepting bad changes. The agent will be there tomorrow. The bugs you ship at midnight will be there tomorrow too, and they will be harder to fix.

Closing

Vibe coding gives you superpowers and they break if you stop being the human in the loop. The discipline is staying engaged across hundreds of turns without burning out and without rubber-stamping. The agent does not care if you are paying attention; the bug it produces will, and so will the user who hits it. The loop is not a feature of the tool. It is a habit you build. The tool can support it or hinder it. Your habit is what makes it work.

The thing nobody says clearly enough is that the loop is the entire job. The act of writing code has been delegated. The act of deciding what gets written, what ships, what does not, and what gets fixed when something goes wrong, that has not been delegated and cannot be. That is the human's job and it matters more than ever. A reviewer who is in the loop and on top of their game can guide the output of an agent across a project that would have taken them months to write by hand. A reviewer who has checked out is shipping unreviewed code at machine speed. The difference between those two outcomes is not the tool. It is the discipline.

Build the discipline. Stay engaged. Trust the agent at the level it has earned, verify at the level the change deserves, and remember that the moment you stop noticing whether you are paying attention is the moment the loop has broken and you do not know it yet.