When Not to Vibe Code: AI's Real Limits

Vibe coding is a tool, not an ideology. Knowing when not to use it is what separates serious practitioners from believers. The cases below are not edge cases or hypothetical worries reserved for some future audit. They are real domains, with real names and real penalties, where the AI workflow that ships SaaS in a week ships incidents in regulated industries.

Every other topic in this curriculum has covered where the workflow helps. This one closes it by covering where the workflow hurts. If you read only this page, you would think AI coding agents are dangerous and should be avoided. That is the wrong takeaway. The right takeaway is that they are dangerous in some places and useful in others, and the engineer who has memorized the difference has the most valuable skill of 2026.

The pattern across this entire page is the same: the workflow that produces a passable login form in twenty minutes also produces a passable HIPAA violation in twenty minutes, and the second one is a federal matter. The same speed that compounds in your favor on a CRUD app compounds against you in a domain you do not understand. Speed is a multiplier on judgment, and when judgment is wrong, speed is the enemy.

$1.5M

Maximum HHS civil penalty per HIPAA violation category, per year, under the 2009 HITECH tier framework

DO-178C

Aerospace software certification standard with five Design Assurance Levels and roughly two decades of audit trail required for safety-critical avionics

16ms

Frame budget for a 60Hz game; an AI default that allocates a 4KB buffer and frees it next tick blows the budget on its own

The Pattern: Where AI Fails Consistently

The first thing to internalize is that AI coding agents are not occasionally wrong in the high-stakes domains discussed below. They are consistently and predictably wrong, in the same shape, every time. This is not a noise problem you fix with better prompting or a smarter model next quarter. It is a structural property of how these systems work, and treating it as a tunable parameter rather than a hard limit is the most expensive mistake practitioners make.

The shape of the failure is the part that fools people. AI does not fail by producing obviously broken output. If it did, the failures would be cheap and self-correcting. AI fails by producing output that looks right, runs cleanly, passes a reasonable code review, and quietly violates an unstated domain constraint. The code compiles. The tests pass. The reviewer nods. Three months later, the regulator sends a letter, or the trader notices a one-microsecond pattern, or the security researcher finds a side-channel timing leak in the signing function. The failure was there from the first commit. It was just invisible to the reviewer because the reviewer did not have the domain knowledge to see it.

The constraints AI does not see fall into a small number of categories, and naming them out loud is the entire point of this section.

Regulation is the first one. AI does not know that the medical device standard requires every change to have a documented rationale. It does not know that SOX requires segregation of duties on financial calculations. It does not know that the FDA's 21 CFR Part 11 has specific rules about electronic signatures. These rules exist in compliance documents, not in code. The AI was trained on code. The constraint is in the wrong format for the model to see it.

Performance bounds are the second. AI defaults to readable code. Readable code allocates objects, uses garbage-collected idioms, and prefers clarity over machine sympathy. In a real-time embedded system or a high-frequency trading engine, every one of those defaults is wrong. The constraint is "this loop must complete in under 800 nanoseconds." That number does not appear in the prompt, and even if it did, the model has no internal sense of how its proposed code will execute on the actual hardware.

Security primitives are the third. AI knows what an HMAC looks like syntactically. It does not have a deep model of what makes one constant-time and another vulnerable to a timing oracle. It will produce code that is functionally correct and cryptographically broken in the same breath, and the breakage is the kind a code review will not catch unless the reviewer is themselves a cryptographer.

Novel research is the fourth. AI is a sophisticated pattern-matcher over its training data. Truly novel work is by definition not in the training data. The model will gravitate toward the closest thing it has seen, which is often the very pattern your research is trying to escape from. The output will look like prior art, because prior art is what the model has.

Deep tribal knowledge is the fifth. The twenty-year codebase has rules nobody wrote down. The senior engineer's "obvious" decision is non-obvious to anyone outside the team. The AI was not on the team. It was not in the room when the decision was made. It does not know that file X must never be touched on a Friday because the deployment system has an undocumented dependency on file Y being older.

Where AI succeeds

Conventional CRUD apps. Greenfield SaaS. Marketing sites. Internal tools. Code that has ten thousand similar examples in the training data. Domains where the dominant constraints are taste and speed, both of which the AI is happy to follow your direction on. The well-trodden middle of the engineering landscape, where the cost of being wrong is "we ship a small bug and fix it next week" and the upside of being fast is huge.

Where AI consistently fails

Regulated industries with traceability requirements. Performance-critical systems with hard deadlines. Security primitives where subtle bugs become exploits. Novel research where the right answer is not in the training data. Legacy systems with deep undocumented domain knowledge. Domains where the constraint that matters is invisible to the model, and where being fast in the wrong direction is worse than being slow in the right one.

The list above is not exhaustive, but it covers most of the real cases. The rest of this page goes through them one by one, with the actual industry names, the actual regulations, and the actual reasons why the AI workflow that works elsewhere does not work here. If you find yourself working in one of these domains, do not panic. The right move is not "never touch an AI tool." The right move is "use the AI for the parts of the work it can help with, and write the load-bearing parts by hand." The hybrid approach is the answer, and the last two sections of this page lay it out in detail.

High-Stakes Regulated Domains

Regulated industries share a property that breaks the AI workflow at its core: every decision in the codebase must be traceable to an intent, and the intent must be defensible to an auditor years after the code was written. The AI workflow does not produce traceable intent. It produces code that satisfies a prompt, and the prompt is not a regulatory artifact. By the time the code reaches review, the rationale that an auditor needs has either never existed or has dissolved into the conversation history. This is not a problem you fix by exporting your prompts. The mismatch is structural.

The four big regulated domains for software in 2026 are healthcare, finance, aerospace, and legal tech. Each has its own regime, its own penalties, and its own specific reasons why hand-rolling the load-bearing code is the right call. Walk through them one at a time.

Medical software and HIPAA

HIPAA, the Health Insurance Portability and Accountability Act, has been the dominant US healthcare data regulation since 1996. The 2009 HITECH amendment added teeth in the form of tiered civil penalties up to $1.5 million per violation category per year, which is the number that finally made HIPAA a board-level concern rather than a back-office compliance matter. The rule that applies to software is the Security Rule, which mandates technical safeguards on protected health information including access control, audit controls, integrity checks, person authentication, and transmission security. Each of those is a specific technical control with a specific implementation expectation.

An AI agent will happily generate authentication code, audit logs, and encryption wrappers. The code will look fine. It will compile. It will pass tests written against the AI's mental model of what the code does. None of that survives an OCR audit, because the auditor is not asking "does the code work" but "is the code traceably implementing the safeguards your risk analysis identified." The risk analysis is a separate document. The mapping from risk to control to code line is what an auditor wants to see, and the AI workflow produces none of that mapping.

FDA Class II and Class III medical device software adds another layer. The FDA's guidance on Software as a Medical Device classifies software by the risk it poses to patients. Class II is "moderate risk, prescription-only, requires premarket notification under 510(k)." Class III is "high risk, life-sustaining or life-supporting, requires premarket approval." A continuous glucose monitor is Class II. An implantable pacemaker is Class III. The software in either has to demonstrate, on paper, that every line of code traces to a requirement, that every requirement traces to a hazard analysis, and that every hazard has a mitigation strategy that has been tested. Engineers in this space already write code more slowly than the SaaS world by a factor of five or ten. The AI agent's productivity multiplier collides with the regulatory burden in a way that produces approximately zero net speedup, and the cost of getting it wrong is a recall.

The European version is GDPR plus the Medical Device Regulation (MDR), which entered full force in 2021. GDPR's Article 32 demands "appropriate technical and organisational measures" for personal data, and Article 25 requires "data protection by design and by default." Both demand documentation of design decisions. The MDR requires technical files that include "the design and manufacturing of the device" with enough detail for a Notified Body to assess conformity. AI-generated code does not produce design files. Humans do.

Financial systems and FINRA, SOX, MiFID II, PCI-DSS

Financial regulation in the US splits across several bodies. FINRA, the Financial Industry Regulatory Authority, oversees broker-dealers and has rules on order handling, supervision, and recordkeeping under SEC Rule 17a-4. Sarbanes-Oxley, passed in 2002 after the Enron and WorldCom collapses, requires public company executives to personally certify financial controls under Section 302, with criminal penalties for false certification. SOX Section 404 demands an annual internal control over financial reporting (ICFR) attestation by the company and an independent auditor. Software that touches financial reporting falls under that scope, which means every line of code is part of a control that the CEO is signing for under threat of jail time.

The European parallel is MiFID II, which mandates pre-trade and post-trade transparency, transaction reporting, and best execution analysis. Article 17 specifically addresses algorithmic trading and requires firms to "have in place effective business continuity arrangements," "subject the trading systems to appropriate testing," and "have effective business arrangements" to ensure "the algorithmic trading systems are appropriately tested." The phrase "appropriately tested" is doing a lot of work, and what it means in practice is full traceability of every algorithm change, including who proposed it, why, what testing was performed, and who signed off. The AI workflow does not produce this trail by default.

Payment processing under PCI-DSS, the Payment Card Industry Data Security Standard, version 4.0 (released 2022, mandatory March 2025), requires twelve high-level controls covering everything from network segmentation to cryptographic key management to vulnerability scanning. Requirement 6 specifically covers secure software development, including code review, secure coding training, and verification of vulnerabilities introduced during development. PCI-DSS is enforced by the card brands, and a confirmed breach can result in fines of $5,000 to $100,000 per month from Visa, plus the cost of forensic investigation, plus the cost of card reissuance, plus the inevitable class-action lawsuits.

HIPAA documentation: percent of effort spent on traceability artifacts ~40%

SOX-scoped software: percent of changes requiring formal change-management ticket 100%

DO-178C Level A code: ratio of certification artifacts to lines of code ~10:1

PCI-DSS in-scope code: percent that requires annual penetration testing 100%

Aerospace and DO-178C

Aerospace software is the discipline that takes regulatory pain to its logical conclusion. DO-178C, "Software Considerations in Airborne Systems and Equipment Certification," published by RTCA in 2011 as a successor to DO-178B, is the standard that civilian aviation authorities require for any software in a certified aircraft. It defines five Design Assurance Levels (DAL): A through E. DAL A is "catastrophic," meaning a software failure could cause loss of the aircraft. DAL B is "hazardous." DAL C is "major." DAL D is "minor." DAL E is "no safety effect."

The cost scaling across DALs is what makes aerospace different. DAL E code is about as expensive as good commercial code. DAL A code is roughly ten times more expensive per line, because every line has to be traced to a requirement, every requirement has to be traced to a higher-level system requirement, every test has to demonstrate Modified Condition/Decision Coverage (MC/DC) at the source level, and every artifact has to be retained for the operational lifetime of the aircraft, which can be thirty to fifty years.

An AI agent producing DAL A code would have to also produce: the high-level requirements traceability, the low-level requirements traceability, the design description, the source code, the verification cases and procedures, the verification results, the configuration management records, the quality assurance records, and the certification liaison records. The agent does not produce any of those by default. A human still has to author them, and the cost of authoring them is the dominant cost in the project. AI saves you almost nothing on a DAL A program because the typing was never the constraint.

Legal tech and privilege handling

Legal software is the quietest of the four but the consequences of getting it wrong are real. Attorney-client privilege is a doctrine that protects communications between a lawyer and client from disclosure. The doctrine has specific rules: the communication must be made for the purpose of seeking legal advice, between the right parties, with no third-party present, and not waived by inadvertent disclosure. Discovery, the pretrial phase where parties exchange information, requires producing all relevant non-privileged documents while withholding the privileged ones. The mistake of producing a privileged document by accident can constitute waiver, and waiver can extend to the entire subject matter of the document, not just the document itself.

Legal tech software handles privilege review, evidence chains for chain-of-custody, redaction of privileged content, and document review at scales of millions of documents per matter. An AI agent generating a privilege classifier is producing code whose failure mode is "we accidentally produced a million dollars' worth of privileged communications to the opposing party in a billion-dollar matter." That is not a P2 ticket. That is a bar complaint and a malpractice suit. The Federal Rules of Evidence, particularly Rule 502 on attorney-client privilege waiver, govern the rules around clawback, but Rule 502 only protects "inadvertent" disclosure if the holder took "reasonable steps" to prevent it. An AI-generated classifier without human verification is unlikely to qualify as reasonable steps.

The shared problem

Regulation requires traceable decisions. AI-generated code has no traceability of intent. The conversation history that produced the code is not a regulatory artifact, and reconstructing intent after the fact does not satisfy auditors. In every regulated domain, the cost of compliance documentation dwarfs the cost of typing, and the AI productivity multiplier on typing is therefore close to zero. Use AI for scaffolding around regulated systems. Write the regulated parts by hand, with the audit trail captured as you go.

Performance-Critical Systems

The second category where AI consistently underperforms is anything where the constraint is "this code must complete within a fixed time budget." The AI workflow optimizes for readable, conventional, garbage-collected style. Performance-critical work demands the opposite: hand-tuned, allocation-free, cache-aware, predictable latency. The default the AI reaches for is the wrong default for this entire class of work, and asking the AI to "make it faster" produces incremental improvements that miss the structural problems entirely.

Five domains belong here, each with its own time scale and its own reason why typing was never the bottleneck.

High-frequency trading

HFT firms compete on microseconds. The fastest market makers in the world target tick-to-trade latencies under one microsecond, and the engineering shape of that work has nothing in common with the engineering shape of a SaaS product. Code paths are written without dynamic memory allocation. Object pools are pre-allocated at startup. Branch prediction is hand-optimized. Cache lines are tracked at the byte level. Networking is offloaded to FPGAs or to kernel bypass libraries like Solarflare's OpenOnload or Intel's DPDK. The compiler's output is inspected at the assembly level, and a one-instruction difference can be the difference between a profitable strategy and a losing one.

An AI agent does not think at the instruction level. It thinks at the source-code level, which is the wrong level for this domain. Asking it to "make this loop faster" produces source-code changes that may or may not affect the assembly. The actual optimization happens in the relationship between the source, the compiler, and the hardware, and the AI has no privileged view of any of those layers. HFT firms do use AI tooling for the engineering shell, the test infrastructure, the visualization, the configuration management, and the data pipelines that feed the trading code. But the trading code itself is hand-written by engineers who can read the disassembly and reason about the L1 cache.

Real-time embedded systems

Real-time embedded covers medical infusion pumps, automotive engine control units, industrial control systems, avionics flight controllers, and a long list of other domains where the software runs on a microcontroller with kilobytes of RAM and a deadline measured in microseconds. The dominant standards are RTCA DO-178C in aerospace (already discussed), ISO 26262 in automotive (with ASIL safety integrity levels A through D), IEC 62304 in medical, and IEC 61508 as the general functional safety standard.

The defining property of real-time embedded code is that the worst-case execution time matters more than the average case. A function that runs in 5 microseconds on average and 50 microseconds in the worst case fails its budget if the budget is 20 microseconds. The AI workflow optimizes for typical-case readability and provides almost no help with worst-case analysis. Tools like AbsInt's aiT or Rapita's RVS perform static worst-case execution time analysis, and the analysis depends on the structure of the code being amenable to it. Loops have to have known iteration bounds. Recursion is usually disallowed. Dynamic allocation is forbidden. The AI's defaults violate every one of these.

Game engines

Game engines run on a 16.67ms frame budget at 60Hz, or 8.33ms at 120Hz, and the budget has to cover input handling, AI, physics, animation, rendering, audio, networking, and the dozen other subsystems that have to land their work before the next vsync. The hot path through a game engine is one of the most carefully optimized chunks of code on any general-purpose computer, and the optimization is structural: data-oriented design, tight memory layouts, SIMD-friendly arrays of structures, avoiding branches in inner loops, prefetching the next cache line before the current one is consumed.

An AI agent has read every Unreal and Unity tutorial in existence. It can absolutely scaffold a game engine. What it cannot do is hand-tune the renderer to fit a 4ms budget on a Switch GPU. The optimization is hardware-specific, profile-driven, and dependent on the exact game and the exact target platform. The AI has access to none of that context, and the part of the engine where the budget actually lives is the part where AI's defaults work against you. Studios use AI for tooling, for asset pipelines, for editor extensions, for the gameplay scripting layer where the budget is generous. The renderer's hot loop is still hand-written by graphics engineers who have spent a decade doing exactly that.

Compilers and interpreters

Compiler engineering is one of the oldest and most mature subfields of computer science, and the work is exactly the opposite shape of vibe coding. A compiler's job is to take source code in one language and produce optimized output in another. The optimization is what the user pays for, and the optimization is the result of decades of accumulated tricks: dataflow analysis, alias analysis, escape analysis, loop transformations, register allocation, instruction selection, peephole optimization. Each of these is a deep subfield with its own literature, and the AI's training data, while including a lot of compiler textbooks, does not include the kind of intuition that comes from running benchmarks against your own changes for years.

Compilers also have correctness as a hard constraint. A compiler that produces 1% faster code 99.9% of the time and miscompiles 0.1% of programs is unusable. The bar is "always correct, ideally fast," not "fast in the common case." AI's tendency to produce plausible-looking code that handles the common path and fails on edge cases is exactly the wrong tendency for compiler work.

Honest take

AI scaffolds around performance-critical systems beautifully. Build systems, configuration, monitoring, dashboards, test harnesses, fuzzing infrastructure, debugging tools, deployment pipelines: all of these benefit from the same productivity multiplier that vibe coding gets in any other domain. The hot paths, the fast paths, the budget-bound code: those are written by humans, by hand, with the AI as a research assistant and a code reviewer rather than as a primary author. The boundary is sharp and worth respecting.

Security-Critical Primitives

The third domain is the one that scares senior engineers most, because the failure mode is not "the user notices and reports a bug" but "the attacker notices and never reports anything until it shows up in a credential dump." Security-critical primitives have a property no other code has: a subtle bug is exploited, and the exploit is silent until the consequences are catastrophic.

The list of things that count as security primitives is short and precise. Cryptographic operations: hashing, signing, encryption, key derivation, random number generation. Authentication: password hashing, token generation, session management, the actual logic of who is who and how the system knows. Authorization: the logic of what users can do and how the system enforces it. Secrets management: how API keys, database passwords, and signing keys are stored, rotated, and accessed. Each of these has decades of attack literature, and each has a battle-tested library that has already implemented it correctly under real adversarial pressure.

Why AI is dangerous here

An AI agent will produce HMAC code that compiles, runs, and fails the comparison check by performing a non-constant-time string compare. The check is functionally correct: it returns true when the values match and false when they do not. It is also a textbook timing oracle: an attacker measuring the time of the comparison can recover the secret one byte at a time, because the compare returns faster when the first byte mismatches than when it matches the first byte and mismatches the second. This is a real attack, well-documented since the early 2000s, and the fix is to use a constant-time compare like Python's hmac.compare_digest or Go's crypto/subtle.ConstantTimeCompare. The AI knows about the attack in the abstract. It still produces the vulnerable code because the prompt did not specifically ask for constant-time, and the default in most languages is the non-constant-time compare.

The same pattern repeats across every security primitive. Random number generation: the AI uses Math.random() or random.random(), which are predictable and unsafe for cryptographic use. The fix is crypto.randomBytes or secrets.token_bytes. Password hashing: the AI uses SHA-256 with a salt, which is fast and exactly the wrong property; the fix is bcrypt, scrypt, Argon2, or PBKDF2 with appropriate work factors. Session tokens: the AI generates them by concatenating user ID and timestamp, which is forgeable; the fix is signed tokens with proper key management or opaque tokens stored server-side with a CSPRNG-derived value.

Each of these is a known failure mode. Each has been documented, attacked, and exploited in the wild. The AI has read the documentation. It still produces the vulnerable code, because the security-correct option is rarely the language default and is usually a couple of imports away from the obvious one.

AI generates code

Looks correct

Tests pass

Subtle vulnerability

Silent exploit

The rule

The discipline that has worked for security-conscious engineering for the last twenty years still works in the AI era, and the AI does not change it: do not hand-roll security primitives. Use a battle-tested library. The libraries that have survived adversarial testing for years are the ones to use. Argon2 for password hashing. NaCl/libsodium for authenticated encryption. AWS KMS or HashiCorp Vault for secrets management. OAuth 2.1 with a real implementation like Auth0, Keycloak, or Authentik for authentication. WebAuthn for second-factor. Each of these has already absorbed the cost of getting it right, and the AI's job is to wire them together correctly, not to reimplement them.

The wiring is where AI is genuinely useful in security work. Configuring a Vault client. Setting up the Auth0 callback flow. Writing the middleware that validates a JWT against a JWKS endpoint. Calling libsodium with the right parameters. The AI is fine at all of this, and the productivity multiplier holds. What it is not fine at is being the one who decides which constants to use, which library is appropriate for the threat model, or whether your specific use case needs constant-time operations. Those decisions are human work, and they have to be human work.

Takeaway

Use battle-tested libraries for the primitives. Use AI for the scaffolding around them. Never let the AI hand-roll your crypto, your auth, your session management, or your secrets handling. The cost of a subtle bug here is not "we patch it next week"; it is "we are in the news." The fix is the same fix the security-conscious engineering community has used for decades, just applied to a new tool: trust the libraries, not the tool that produces the libraries' callers.

Novel Research

The fourth domain is research, and the failure mode here is different from the others. Where regulated industries fail because of constraints invisible to the model, and security primitives fail because of subtle bugs that look fine, novel research fails because of a property of how the model works internally. The model is a sophisticated pattern-matcher over its training data. Truly novel work is by definition not in the training data. The model has nothing to match against, so it does the next-best thing, which is to match against the closest thing it has seen and propose that.

The pattern is consistent across research domains. A new neural architecture that does not look like anything in the literature: the model proposes the closest thing in the literature, which is not what you wanted. A new cryptographic protocol with a novel security property: the model proposes a known protocol with a different security property, presented as if it were the new one. A new physics simulation with a novel boundary condition: the model proposes a standard finite-element scheme that handles the wrong boundary. In each case, the output is plausible, well-formed code that solves a different problem from the one you actually have. The seductive part is that the output looks like it solves your problem, because the model is good at making things look right.

Where the model genuinely helps

Research has an engineering shell. The shell is the data loading code, the experiment tracking, the visualization, the hyperparameter sweep infrastructure, the cluster job submission, the result aggregation, the figure generation. All of this is conventional engineering, well-represented in the training data, and amenable to the same productivity multiplier any other engineering work gets. Research labs that do not use AI for their engineering shell are giving up free productivity for no reason.

The core of the research, the part that constitutes the contribution, is the part the model cannot help with. If your paper proposes a new algorithm, you wrote that algorithm. The model can transcribe it into code, possibly with bugs, but the algorithm itself came from you. If the model proposed it, the algorithm is not novel, because the model is producing patterns from its training data, and the training data is the prior literature. The boundary is sharp: the contribution is human, the implementation of the contribution is partly human and partly AI, the engineering shell is mostly AI.

Examples worth naming

New machine learning architectures. The model knows about Transformers, CNNs, RNNs, attention mechanisms, and a few hundred other published architectures. If your contribution is a new gating mechanism inside attention, you write that mechanism. The model can write the surrounding training loop and the data pipeline, but the gating itself, including the mathematical reasoning about what it should do, is yours.

New cryptographic protocols. The model knows about TLS, Signal, Noise framework, OPAQUE, and a long list of published protocols. If you are designing a new key exchange that handles a property no existing protocol handles, the model will helpfully suggest TLS or Signal as starting points. They are not your starting points. The protocol design is human work, and the implementation needs the same care that any cryptographic implementation needs.

New physics simulations. The model knows about finite-element methods, finite-volume methods, particle-based methods, lattice Boltzmann, and a long list of standard schemes. If your simulation involves a new coupling between two physical regimes that has not been done before, the model will reach for the closest standard scheme. The novel coupling is your work.

Existing Code With Deep Domain Context

The fifth category is the one that quietly costs the most money in working software organizations. It is the case of the twenty-year-old enterprise codebase, with rules nobody documented, written by people who have mostly left the company, integrated with internal systems that have no public docs, and load-bearing for a business that cannot afford an outage. AI agents do not work well in this codebase, and the failure mode is specifically that the agent is confidently wrong in ways that are expensive to discover.

The shape of tribal knowledge

Tribal knowledge is the collection of things a senior engineer knows that nobody wrote down. It includes: which file has the deceptively simple-looking function that is actually load-bearing for three other systems. Which environment variable, if changed, breaks a downstream batch job that nobody on this team owns. Which API endpoint has an undocumented rate limit because the integration partner verbally agreed to one in 2018. Which database column has a default that compensates for a bug in a different system that was never fixed because fixing it would break the customer's billing reconciliation. Which deployment time of day is safe and which is not, and the reason involves a cron job in a different timezone.

None of this is in the code. None of it is in the docs. All of it is real, and all of it is necessary to make correct changes. The AI agent has access to none of it. It will read the code, infer a reasonable model of what the code does, propose a change that is consistent with that model, and the change will violate three pieces of tribal knowledge that the senior engineer would have caught in five seconds.

Domain models that took years to get right

Beyond tribal knowledge, established codebases also tend to encode domain models that the team has refined over years of contact with the actual business. A health insurance claims system has a model of "what counts as a claim" that has been adjusted dozens of times to handle edge cases that the lawyers and adjusters discovered the hard way. A logistics routing system has a model of "what counts as a delivery" that incorporates four years of feedback from drivers, customers, and dispatchers. A trading order management system has a model of "what counts as an order" that has been refined through a dozen exchanges and a hundred regulatory changes.

The AI cannot reverse-engineer these models from the code. The code is the result of the model, not the documentation of it. Reading the code tells you what the system does. It does not tell you why, and the why is what governs whether a proposed change is safe. The senior engineer who has lived with the system for five years knows the why. The AI does not, and it cannot acquire it from reading.

Integration with internal services

Modern enterprises run dozens or hundreds of internal services, most of which have no public docs. They have internal docs, sometimes, but the internal docs are usually out of date, and the actual behavior is determined by the service team's current understanding plus the accumulated workarounds in the calling code. An AI agent reading the calling code will infer a model of the called service. The inferred model is usually wrong in subtle ways, because the calling code accumulates workarounds for service quirks, and the AI cannot tell which lines are real logic and which are workarounds.

AI reads code

Infers reasonable model

Confidently proposes change

Violates undocumented constraint

Production incident

The signal

The most reliable sign that a codebase has deep tribal knowledge is when a senior engineer's "obvious" decision is non-obvious to anyone outside the team. If you ask three engineers on the team why a particular thing is done a particular way, and they all give the same explanation in five seconds, that explanation is tribal knowledge. The AI does not have it. It will be confidently wrong on questions that the team finds trivial, and the wrongness is the most expensive kind because it is invisible during review.

The strategy in these codebases is the same as the strategy in regulated codebases: use AI for the engineering shell, write the load-bearing parts by hand. The shell here means: test scaffolding, build configuration, internal tooling, documentation generation, log analysis, monitoring dashboards. The load-bearing parts are anything that touches the actual domain logic. A senior engineer who has been on the team for years drives those changes, and the AI assists at the edges.

When the Speed of Typing Actually Matters

The sixth case is the smallest in scope but the most underrated, and it deserves an honest section because the discourse around vibe coding tends to skip past it. There are real moments in real engineering work where launching an AI agent costs more than the edit itself. The cases are not common in greenfield work but are extremely common in maintenance, debugging, and the late-stage refinement of a feature.

The brief, focused edit

You know the bug. You know the fix. The fix is three characters: a missing semicolon, a wrong variable name, a flipped comparison operator. You can type the fix in two seconds. Launching an agent and prompting it to make the fix takes thirty. The agent will probably make the fix correctly, and it might also touch two adjacent lines because it noticed something it wanted to clean up, and now you have to review three lines instead of one. The round trip is expensive when the edit is cheap.

The same applies to the small refactor where you know exactly what you are doing. Renaming a variable across one file. Inlining a function that is only called once. Reordering two lines because the dependency direction was wrong. The AI agent will happily do any of these, but the prompt-and-review cycle is longer than the edit. The edit takes ten seconds. The agent round trip takes a minute. You have lost net time.

When you are debugging and need exact control

Debugging is a case where the human's mental model has to stay tight, and the agent's tendency to interpret the debug session through its own model creates friction. You add a log statement at line 47 because you specifically want to see the value of x at that point. The agent might add the log statement, but it might also add a related one at line 52 because it inferred you also wanted that. The extra noise breaks your flow, because now you have to reason about why both logs are there, when you only meant to add one.

The pattern is the same in print-debugging, in interactive debugger sessions, in stepping through code. The human mental model is precise, the agent's interpretation is approximate, and the gap between the two creates friction that is faster to avoid than to manage. Senior engineers debugging a tricky bug will often turn off their AI tools entirely for the duration of the session, then turn them back on once they understand the bug and want to write the fix and the test.

The honest admission

Yes, sometimes a senior engineer types faster than they prompt. The cases are usually short, focused, and well-understood. They are not the cases where AI's productivity multiplier is supposed to dominate, and the multiplier does not dominate in those cases. The discipline of vibe coding includes recognizing when to drop out of the workflow and just type, and the engineers who refuse to drop out lose time to the workflow they are religiously committed to.

Vibe coding wins

Greenfield features. Multi-file changes. Anything where the agent can read the existing code and produce code that fits. Boilerplate-heavy work. Test scaffolding. Documentation. Refactors that span multiple files. Tasks where the human does not yet know the exact shape of the answer. Work that has clear specifications and conventional patterns. Anything where the typing is more than a couple of minutes.

Human-typing wins

Single-character bug fixes. The three-line tweak you already know is right. The print statement at line 47 and only line 47. The variable rename you can do faster in your editor's refactor tool. The interactive debugging session where the agent's interpretation creates noise. The ten-second edit where the prompt-and-review cycle takes a minute. The case where you are deep in flow and a context switch costs more than the edit.

The Tradeoff Matrix

The decision of when to vibe-code and when not to is not binary. It is a function of three dimensions, and learning to think along all three is the difference between an engineer who applies the workflow correctly and one who applies it religiously. The dimensions are stakes, novelty, and domain. Each dimension has roughly three levels, which gives a 27-cell decision space, and the decision space breaks down into three rough zones: full vibe coding, hybrid, and hand-write.

The three dimensions

Stakes is the cost of being wrong. Low stakes: a side project, a prototype, an internal tool that one person uses. The cost of a bug is "we fix it next time." Medium stakes: a SaaS product with paying customers, where bugs cost reputation and churn but not lawsuits. High stakes: regulated industries, security primitives, performance-critical systems, anything where a bug shows up as an incident, a fine, or a recall.

Novelty is how well-trodden the work is. Well-trodden: CRUD apps, marketing sites, conventional integrations, things with thousands of similar examples in any training data. Mixed: less common patterns, unusual tech stacks, integrations with quirky internal services. Cutting-edge: novel research, new protocols, work that has not been done before in any public form.

Domain is the regulatory and operational environment. General: no special compliance requirements, conventional security posture, normal operational expectations. Specialized: domain-specific best practices but no regulator on your back, e.g. e-commerce with PCI scope mostly handled by Stripe, internal tools with normal corporate IT requirements. Regulated: HIPAA, SOX, FINRA, MiFID II, DO-178C, GDPR-covered personal data with serious operational consequences.

Place the work on the stakes axis

Ask: what is the cost of a bug here? "We fix it next sprint" is low stakes. "Customers churn and we lose reputation" is medium. "We get fined, sued, recalled, or breached" is high. Be honest about which it is. Most engineers underestimate stakes on their own projects because they are anchored on best-case outcomes.

Place the work on the novelty axis

Ask: how many similar projects exist in public form? If you can find ten thousand examples on GitHub, the work is well-trodden. If you can find a few dozen, it is mixed. If you can find none, it is cutting-edge and the AI's training data does not include the answer.

Place the work on the domain axis

Ask: what regulatory regime applies? If none, the domain is general. If best-practice frameworks apply but no regulator audits, the domain is specialized. If a regulator can fine, sue, or recall, the domain is regulated. Healthcare, finance, aerospace, legal tech, automotive, energy: regulated. Most of these have specific software standards.

Read the matrix and pick the mode

Low stakes plus general plus well-trodden: full vibe coding. Multiplier is highest, downside is lowest. Medium stakes plus general plus well-trodden: still full vibe coding, with disciplined review. High stakes or regulated or cutting-edge along any axis: hybrid mode. Two or three of those at once: heavy human authorship, AI for shell only.

Recheck as the work evolves

The cell can shift mid-project. A side project that catches on becomes medium stakes overnight. A research prototype that goes into production crosses domains. The mode you started in may not be the mode you should be in now, and updating the mode based on the actual situation is part of the discipline.

The simple decision rules

Three rules cover most cases. First, low stakes plus general plus well-trodden equals full vibe coding. Run it freely, ship fast, accept the occasional bug as part of the multiplier. Second, high stakes plus regulated plus cutting-edge equals do not vibe-code the load-bearing parts. The AI helps with the engineering shell, the test scaffolding, the documentation, the monitoring. The actual domain logic is human work. Third, anything in between is hybrid: read the next section.

Greenfield SaaS, conventional patterns: AI productivity multiplier ~50x

Twenty-year enterprise codebase: AI productivity multiplier ~1.5x

DO-178C Level A safety code: AI productivity multiplier ~1.05x

Novel ML architecture core: AI productivity multiplier ~1.2x

Cryptographic primitive implementation: AI productivity multiplier ~1x

The numbers above are rough but directionally correct. The 50x figure for greenfield SaaS is reproducible by experienced practitioners and matches the literature. The 1.05x for DO-178C Level A is what you would expect from a domain where the typing was never the bottleneck and the documentation is the dominant cost. Anywhere the human still does most of the actual cognitive work, the multiplier collapses to roughly the productivity gain on the engineering shell, which is small relative to the total project cost.

Hybrid Approaches

The conclusion of all this is that the right strategy in the high-stakes, regulated, cutting-edge zones is not "no AI" but "AI for the parts where AI helps, humans for the parts where humans matter." The discipline is to draw the boundary precisely, in the same project, sometimes in the same file. Done well, the hybrid approach gets most of the multiplier on the parts that benefit from it and zero of the risk on the parts that do not.

The pattern

The pattern is "AI scaffolds, human writes the load-bearing parts." Scaffolding here means the engineering work around the load-bearing logic: the test infrastructure, the build configuration, the deployment pipeline, the monitoring, the dashboards, the documentation generators, the data fixtures, the seed data, the developer tooling. All of this work is conventional engineering, well-represented in the training data, and amenable to the same productivity multiplier any other conventional work gets. The AI handles it.

The load-bearing parts are the ones that fail in expensive ways. The HIPAA-relevant access control. The constant-time crypto compare. The HFT inner loop. The novel ML algorithm. The 20-year codebase's domain model. These are written by humans, with the AI as a research assistant and a code reviewer rather than as a primary author. The human reads the AI's suggestions but writes the final code by hand, line by line, with full understanding of why each line is there.

Specific hybrid patterns worth naming

AI for tests, humans for the security-critical functions. The function that signs the JWT is written by hand. The test that exercises it is generated by AI, possibly by feeding the AI the function and asking for a comprehensive test suite. The test gets reviewed and committed. The function does not get rewritten by the AI even when the AI helpfully suggests changes during the test review.

AI for documentation, humans for the architecture decisions. The system architecture is decided by humans, with the rationale captured in an architecture decision record. The documentation that describes the architecture for other engineers is drafted by AI and edited by humans. The decision is human; the writeup is AI-assisted.

AI for the engineering shell, humans for the research core. In a research lab, the data loading, the experiment tracking, the visualization, and the cluster scripts are AI work. The novel algorithm is human work. The implementation of the algorithm in the training loop is hybrid: human mathematical content, AI code structure.

AI for the boilerplate, humans for the regulated logic. In a healthcare app, the React components, the database migrations for non-PHI tables, the build configuration, and the marketing site are AI work. The PHI-handling code, the audit log writes, the access control checks, and the consent management are human work, with audit-trail documentation captured as the code is written.

Pure AI authorship in regulated domain

Code that compiles and tests pass, but the audit trail is missing or fabricated. Risk analyses do not match the implementation. Security primitives that look correct and have subtle vulnerabilities. Domain logic that violates undocumented constraints. The first audit fails. The first incident is silent until it is loud. The team scrambles to retrofit traceability after the fact, which is harder than capturing it during development.

Hybrid authorship in regulated domain

Engineering shell built fast by AI: tests, build, docs, monitoring, deployment. Load-bearing logic written by humans with audit trail captured inline. Risk analysis written first, then code traces back to it. Security primitives use battle-tested libraries with AI wiring around them. Domain logic written by senior engineers who know the tribal knowledge. The audit passes. The incident does not happen. The team ships faster than pure-human work, slower than pure-AI work, and survives.

The discipline of the boundary

The hybrid approach only works if the boundary is clear and the team holds it. The failure mode is "the AI started on the test, kept going into the test setup, kept going into the helper function the test setup needed, kept going into the access control utility the helper function called, and now the AI has written the access control without anyone noticing." Each step is a small drift. The cumulative drift is across the boundary, into the load-bearing logic, in a piece of code labeled "test setup helper" that is now governing who can see what.

The fix is to label the boundary explicitly in the codebase. A directory called internal/regulated/ that the AI is told not to author in. A naming convention on functions that mark them as load-bearing. A code review process that flags any AI-authored change to a marked file. The boundaries are conventions; the discipline is keeping them.

Closing

The most useful skill in 2026 is knowing where AI helps and where it hurts. The hype assumes AI is a universal tool, applicable everywhere, with the multiplier holding equally across every domain. It is not. The serious practitioner picks the right tool for the job, even when the right tool is "type it yourself." The future is not AI-only. It is humans who fluently choose between AI assistance and direct work, with the boundary chosen on the actual properties of the domain rather than on ideology or fashion.

This curriculum has covered when AI helps. The other topics laid out the tools, the workflows, the prompting techniques, the agent instruction files, the context engineering, the debugging strategies, the code review practices for AI output. All of that is real and useful. None of it is universal. This page closes the curriculum by covering the cases where the workflow does not apply, and the takeaway is the same one a senior engineer would give about any tool: know what it is for, know what it is not for, and pick it accordingly.

The honest practitioner reads the curriculum, internalizes the multiplier on the parts where it applies, and respects the limits on the parts where it does not. The believer reads the same curriculum, applies the multiplier everywhere, and ships incidents in regulated industries. The difference is calibration, and calibration is what this page is for. Bookmark it. Reread it when you find yourself reaching for the AI tool in a domain you have not thought hard about. The cost of being wrong about the boundary is much higher than the cost of taking thirty seconds to check which side of the boundary you are on.

Vibe coding is a tool, not an ideology. The cases above are not edge cases; they are real domains where the workflow that ships SaaS in a week ships incidents in regulated industries. Knowing the difference is the skill. The rest of the curriculum teaches you how to use the tool. This page teaches you when not to. Both halves are needed. The practitioners who internalize both are the ones who actually become good at this, not just the ones who are visible on social media talking about it.