AI agents shine on greenfield code. Empty repo, modern stack, fresh dependencies, no history to respect. The agent generates a Next.js app with Tailwind and Postgres in twenty minutes and most of it works. The demos are real, and the productivity numbers are real. Then someone asks the same agent to fix a bug in a 2010 Rails application that was written by people who left the company in 2014, layered with patches by their successors, and held together by deployment scripts that nobody has read in eight years. The agent does not shine. It struggles, and it sometimes makes things worse, and the team that thought it was about to ten-x its velocity ends up wondering whether agentic dev was overhyped.
It was not overhyped. The agent's failure on legacy is not a failure of the agent. It is a structural problem about what legacy is and what training data is. Legacy code is, by definition, code that has accumulated context the model has not seen. Tribal knowledge that lives in nobody's documentation. Idiosyncratic patterns that one team chose for reasons that were valid at the time and are now unwritten. Outdated framework versions whose idioms have drifted from the modern conventions the model has internalized. The model averages across what it has seen, and what it has seen is mostly recent open-source code in mainstream patterns. Your 2010 Rails app is not in that average, and the agent's first attempt to read it produces output that is plausible for a generic Rails 7 app and wrong for your specific Rails 4 app.
Knowing why is the lesson. Once you understand that legacy is hard for AI because legacy is by definition out of distribution, the workarounds become obvious. They are not magic. They are the same things you would do to onboard a new senior engineer who has never seen the codebase: walk them through the system, write down the tribal knowledge, give them a safe place to make small changes, build up their context over weeks of deliberate work. The discipline of onboarding the agent is the discipline of legacy work in the AI era. Teams that do it well ship steadily on legacy code with agent help. Teams that skip it watch the agent break things and conclude AI does not work on legacy. Both observations are true. Only the first one is useful.
Why Legacy Is Hard for AI
The first thing that makes legacy hard is tribal knowledge. Tribal knowledge is the set of facts about a system that exist nowhere on disk. The build only works after a specific dance involving an environment variable that has to be set in a particular order. This function looks innocuous but is special-cased for one customer who pays seven figures a year. That deployment script appears to be dead code, but is actually the production hot-path because nobody updated the docs when the rewrite happened. None of these facts are in the code. They are in the heads of the people who have worked on the system, and when those people leave the company, the facts leave with them unless someone wrote them down. Almost nobody writes them down.
The agent cannot see tribal knowledge. The agent can read the code. The agent can read the docs you have. The agent cannot read the email thread from 2017 where the original architect explained why the order-processing service has to retry exactly three times with exponential backoff. So the agent's mental model of the system is missing the explanations. It can describe the code accurately and still misunderstand the system, because the code does not encode the why. When the agent then proposes a change, the change can be locally correct and globally wrong. The retry count gets reduced from three to one because the agent thinks three is excessive. The customer experience breaks because the downstream service has known three-attempt retry behavior coded against it. The bug is not in the code the agent wrote. The bug is in the gap between the code and the unwritten constraints around the code.
The second thing is undocumented patterns. Every team that has worked on a codebase for years has developed its own idiosyncratic style. Maybe they always wrap database calls in a particular helper that does retry plus logging plus metrics in a custom way. Maybe they have a naming convention where service classes end in `Manager` and use classes end in `Service`, the opposite of the more common pattern in the wider Ruby community. Maybe they have a way of structuring tests that involves a custom DSL one engineer built in 2013 that nobody has touched since. These patterns are not wrong. They are just different from the median patterns in the agent's training data. When the agent writes new code in the codebase, it writes in the median style, and the new code does not look like the existing code. The team has to constantly correct the style or accept a codebase that grows two visual languages over time.
The third thing is dead code. About thirty percent of every legacy codebase is unused but indistinguishable from used at first read. There are functions that were called from code that has since been deleted but which still exist because nobody wanted to take the risk of removing them. There are entire modules that were part of a feature that got rolled back five years ago and never properly cleaned up. There are configuration flags that toggle code paths that are no longer reached. Dead code is invisible to the agent unless someone tells it which code is dead. The agent sees a function with a clean signature, well-named parameters, and a clear body, and it might use that function in a new feature, only to find that the function has been broken for two years because nothing was calling it and nobody noticed.
The fourth thing is outdated framework versions. The agent has seen a lot of Rails. Most of what it has seen is Rails 6 and Rails 7. Rails 4 is in the training data, but it is a smaller share, and the patterns the agent uses by default are the modern patterns. When the agent writes code for a Rails 4 app, it sometimes uses Rails 5+ idioms that did not exist yet, or it uses Rails 7 conventions that break on Rails 4 at runtime. The same thing happens with Spring Boot 1.x versus Spring Boot 3.x, with Django 2 versus Django 5, with Java 8 versus Java 21. The agent's default is the modern version. The legacy codebase is not on the modern version. Without explicit instruction, the agent will write code that does not work in your version because it does not know what your version is.
The fifth thing is custom in-house tools. Every legacy codebase has a few of these. A homegrown ORM somebody wrote in 2009 because the available ORMs at the time were not good enough. A custom test framework that predates the language's modern testing conventions. A deployment script that wraps Capistrano in a way that does something specific to the company's infrastructure. The agent has never seen these tools because they are not in any open-source corpus. The agent will read the code, recognize that it looks like an ORM or a test framework, and assume the API is similar to a public tool it knows. The assumption is wrong. The wrong assumption produces wrong code.
The sixth thing is the COBOL problem, which deserves its own paragraph because it is a real and large category. There are still significant systems running on COBOL, on mainframes, with build pipelines and runtime environments that have nothing to do with modern stacks. The agent's COBOL knowledge is real but limited. The agent's knowledge of mainframe environments is small. When you ask an agent to help with a COBOL program that runs on z/OS and processes batch jobs from JCL, the agent can read the code but its mental model of the runtime environment is shaky. This is the extreme end of the legacy spectrum. The pattern is the same as Rails 4 versus Rails 7, but the gap between training data and reality is wider, and the consequences of the gap are more severe because the agent has fewer reference points to fall back on.
The seventh thing, and maybe the most underrated, is that legacy codebases are large. A 2010 Rails app that has been actively developed for fifteen years has a million lines of code. The agent's context window holds a fraction of that. The agent cannot read the whole codebase. It can only read the files you give it, plus what it can find by tool calls. Choosing which files to give it, and in what order, is a skill. The default of "give it the file you want changed" is not enough on legacy because the file you want changed has callers and tests and configuration and conventions that the agent cannot see if you do not surface them. The agent's productivity on legacy is bottlenecked by your ability to put the right context in front of it.
The Codebase Tour Pattern
Before any change to a legacy codebase, the agent should build a mental map. Not a complete map, because complete maps of large codebases do not fit in any context window. A working map: enough to know where things are, what the major shapes are, what depends on what. The map gets built by reading specific files in a specific order, and the order matters because each file informs how to read the next.
The first file is the manifest. In Node it is `package.json`. In Java it is `pom.xml` or `build.gradle`. In Python it is `requirements.txt` or `pyproject.toml`. In Ruby it is `Gemfile`. The manifest tells the agent the stack: which language, which framework, which versions, which dependencies. This is the highest-leverage information you can give. Everything else the agent reads will be interpreted in light of the manifest. If the manifest says Rails 4.2 with Ruby 2.3, the agent knows to expect Rails 4 idioms and to avoid Rails 5+ syntax. If the manifest says Spring Boot 1.5 with Java 8, the agent knows the application context will use XML or older Java config and not the newer functional bean style. The manifest is the disambiguator that prevents the agent from defaulting to modern idioms in an old codebase.
The second file is not a file but a directory listing. Show the agent the top-level structure. In a Rails app, the standard directories are app, config, db, lib, test or spec, vendor. In a Django project, you have settings, apps, templates, static, migrations. In a Spring Boot app, you have src/main/java with the package hierarchy. The directory listing tells the agent the layout: where models live, where controllers live, where the configuration is. The agent can then make better guesses about where to look for things, and you do not have to spell out every path.
The third thing is entry points. Where does execution start? In a web app, that means the routes file or the URL conf. In a CLI tool, the main function. In a worker service, the queue consumer registration. The entry points anchor the agent's mental model in actual execution paths instead of static file structure. Two codebases with identical directory layouts can have entirely different runtime behavior because the entry points differ. Reading the entry points first prevents the agent from constructing a mental model that is structurally accurate and dynamically wrong.
The fourth thing is the file you are about to change, plus its callers, plus its tests. The file itself is obvious. The callers tell the agent how the file is used and what assumptions other code makes about it. The tests tell the agent what behaviors are expected and which are tested versus untested. Changing a function without seeing its callers is dangerous because the agent might change the signature in a way that breaks the callers. Changing a function without seeing its tests is dangerous because the agent might change behavior the tests depend on. Both kinds of breakage are common when the tour skips these steps.
package.json, pom.xml, requirements.txt, Gemfile, build.gradle, pyproject.toml. Whichever applies. The manifest pins the stack and the version, which prevents the agent from defaulting to modern idioms in an old codebase.
One level deep, maybe two. The shape of the project tells the agent where to find things and what conventions to expect. ls or tree output is enough.
Routes, main functions, queue handlers, scheduled jobs. Where does execution actually start? The entry points anchor the static structure in dynamic behavior.
The file you are about to change. Read it once end-to-end so the agent has the full local context, not just the snippet you are touching.
Find the places that call into the file. Find the tests that exercise it. Both shape what changes are safe and what changes will break things downstream.
The tour pattern is not optional. It is what separates agent work that produces correct changes from agent work that produces plausible-looking changes that break things in production. Skipping the tour saves five minutes and costs you an hour of debugging when the change breaks a caller you did not know existed. Doing the tour costs five minutes and saves you the debugging. The math is not subtle. The discipline is making the tour the default, even when the change feels obvious.
The agent will sometimes do the tour without being asked. If you are using Claude Code with a well-configured project, the agent will read the manifest, scan the directory, and find the relevant files before making changes. This is good behavior and you should not interrupt it. If your agent is not doing this, you can add the tour pattern to the project's instruction file, telling the agent that before any non-trivial change it should read those files in order. The instruction file is the place where you encode the workflow you want the agent to follow on every task in this project.
One specific anti-pattern worth flagging: the agent reads only the file you mentioned, makes changes, and stops. The output looks correct. The change works in isolation. It breaks something three files away because the agent did not read those files. The fix is to require the agent to surface the files it read before making changes. If the file list is too short, you tell it to read more. The process feels slow at first and gets fast with practice. The slowness is on the wrong side of the diff: it is slow before the change and fast during the change. Skipping the tour is fast before the change and slow during cleanup.
Incremental Documentation Generation
Legacy codebases rarely have good documentation. The docs that exist are often outdated, sometimes by years. The team that wrote them moved on, the system changed, and nobody kept the docs in sync. You cannot trust the docs you have. You also cannot rewrite all the docs at once, because the cost is enormous and the value is uncertain. The middle path is incremental documentation: every time you (or the agent) work on a part of the system, append what was learned to a notes file. Over weeks and months, the notes file becomes the documentation that should have existed all along, and the agent's effective knowledge of the codebase compounds.
The mechanic is simple. End every session by asking the agent to summarize what it learned. The summary lands in a file. The file is named something like AGENTS.md or CLAUDE.md, depending on the convention the agent supports. The next session, the agent reads the file at the start, and now it has the previous session's findings as context. The work that took two hours of exploration the first time takes ten minutes the next time, because the exploration is captured. Repeat across many sessions and the file becomes the project's effective documentation.
The format of the file matters. Long-form prose is bad. The agent has to read the whole thing every session, and prose does not surface specific facts efficiently. Bullet lists work better. Each bullet is a fact: "the User model has a custom soft-delete that lives in the deleted_at column, not in a separate table." "The build requires DATABASE_URL to be set before bundle install or it fails with a cryptic error." "The Sidekiq workers run on a separate Redis instance from the cache, named REDIS_QUEUE_URL." Each bullet is queryable by grep, by the agent reading the file, or by a human looking something up.
The "explain it back" prompt is the workhorse for generating the bullets. After the agent has done some work, you ask it to summarize what it discovered about the system. The summary surfaces the implicit knowledge the agent built up while doing the work. Most of that knowledge would otherwise be lost when the session ends. The summary captures it. Some of the summary will be obvious or already documented; that part you ignore. Some will be new, and that part lands in the notes file. The filtering is fast because you know your codebase and can spot the new bits at a glance.
The compounding payoff of the documentation flywheel: month one the agent is slow because it knows nothing about your codebase. Month two it knows a few things and is faster on tasks that touch them. Month six the file has hundreds of bullets and the agent can work in your codebase nearly as well as a senior engineer who has been there for years. The investment is fifteen minutes per session. The return is months of accumulating context that pays back on every future task.
One specific pattern that helps: when the agent encounters something surprising, capture it immediately rather than waiting for the end of the session. "Surprising" means anything that took more than a few minutes to figure out, anything that contradicted an initial assumption, anything that turned out to be a special case. The surprise is the signal that the knowledge is worth capturing because it is not obvious from the code. If you wait until the end of the session, some of these surprises will be forgotten or compressed into vague summaries. Capture them when they are fresh.
The notes file should be in the repo, committed, reviewed in pull requests like any other artifact. Keeping it outside the repo means it does not stay in sync with the code, which defeats the purpose. Keeping it in the repo means future you, future teammates, and the agent all have access to the same knowledge base. The cost of putting it in the repo is zero. The cost of not putting it in the repo is that the knowledge becomes another piece of tribal knowledge that lives only on one machine.
The structure of the file evolves as the project does. Early on, it is a flat list of bullets. Once the list has fifty entries, you start grouping by area: database, deployment, auth, billing. Once it has two hundred entries, you might split into multiple files: AGENTS.md for the index, plus per-area files. The structure does not have to be planned upfront. Let it grow organically and refactor when it gets unwieldy. The agent can help with the refactoring, because reorganizing notes is exactly the kind of mechanical work agents are good at.
The other category that belongs in the notes file is anti-patterns. "Do not call the User#destroy method directly; use User.soft_delete because there are five callbacks that depend on the soft-delete path." "Do not run migrations against production directly; the deployment script wraps them with the lock-acquisition logic." Each anti-pattern is a landmine you mark on the map. The agent reads the file and avoids the landmines. Without the file, the agent has no way to know the landmines exist until it steps on one.
Refactoring Patterns That Work With AI
The standard refactoring patterns that work in legacy code by hand also work with AI assistance. The patterns are not new. The agent's role is to do the mechanical parts of the patterns faster, while the architectural choices stay with the humans. The two patterns that pay back most consistently are the strangler fig and branch-by-abstraction. Both are designed for systems where the goal is to replace something gradually without breaking anything during the replacement, which is exactly the legacy situation.
The strangler fig pattern, named after the strangler fig tree that grows around a host tree and eventually replaces it, is the standard approach for replacing a legacy service with a new one. The idea is to put the new implementation in place alongside the old one, route traffic to the new one for a small set of cases, and gradually expand the routing until the old implementation handles nothing and can be deleted. The key property is that the system is always working: at every moment, every request either goes to the old code (which works) or the new code (which has been verified for the cases it handles). There is never a flag day where you flip a switch and hope.
The agent's role in the strangler fig is to generate the new implementation alongside the old, generate the routing logic that decides which requests go where, generate the tests for both implementations, and generate the migration scripts for any data movement. The agent is good at all of these because they are well-scoped tasks with clear definitions. The hard part of the strangler fig is the architectural decisions: where to draw the boundary between the new and old implementations, what shape the new implementation should have, when to expand the routing. These stay with humans. The agent does the typing, you do the design.
Branch-by-abstraction is the other major pattern. Where the strangler fig replaces a service, branch-by-abstraction replaces an internal implementation. The mechanic is: identify the thing you want to replace, introduce an interface or abstraction in front of it, refactor existing code to depend on the abstraction rather than the concrete implementation, then build a new implementation behind the same abstraction, then swap. The original implementation and the new one coexist behind the abstraction, with a flag or configuration choosing which one runs at any moment. Once the new one is verified, the old one is deleted.
The agent is well-suited to branch-by-abstraction because the refactoring is mechanical: introduce the interface, change the dependencies, generate the new implementation. The hardest part is choosing the right abstraction. Too narrow and you have to revisit; too broad and you are designing in advance for needs that might not exist. The agent can suggest abstractions and you can evaluate them, but the choice should be deliberate, not delegated. The choice is the leverage point. Get it right and the refactor is mechanical. Get it wrong and the refactor stalls because the abstraction does not actually carve the code at the joints.
Big-bang rewrites still work in some cases. If the legacy system is small enough to rewrite in a few weeks, the strangler-fig overhead may not be worth it. The pattern is: write the new system in parallel, achieve feature parity, switch over, decommission the old. The risk is that the new system takes longer than expected (it always does) and you end up maintaining two systems for longer than planned. The pattern is also worse than strangler fig for systems with active development, because every feature the old system gains during the rewrite has to be ported to the new one, and the porting cost grows with time. For mostly-static legacy systems with bounded scope, big-bang rewrites are still viable. For active systems, the strangler is almost always better.
The agent's role across all these patterns is the same. Generate code, generate tests, generate the boilerplate that connects the pieces. The agent does not pick the pattern. The agent does not decide which functions belong on which side of the abstraction. The agent does not judge when the new implementation is ready to handle more traffic. Those are human calls. The combination of human-driven architecture and agent-driven typing is what makes large refactors tractable in a way they were not five years ago. The architecture work is unchanged. The typing work is much faster. The total time drops without the architectural quality dropping.
One specific tactic that helps on big refactors: keep a refactor journal in the repo. Every step you complete, write a one-liner about what changed and what is next. The journal serves as a recovery mechanism if you lose the thread for a week, and as a coordination mechanism if multiple people are working on the refactor. The agent can write the journal entries based on the changes you make, so the cost of maintaining it is near zero. The benefit is that the refactor becomes resumable, which is the difference between a refactor that finishes and a refactor that gets abandoned at sixty percent.
Another tactic: feature flags for new code paths. Even when the routing is internal, wrap the choice between old and new behind a flag. The flag lets you turn off the new path immediately if it misbehaves, without a deploy. The agent can generate the flag-checking code as part of the implementation. The cost is a few minutes. The value is the ability to roll back without a code change, which on a production system is the difference between a five-minute incident and a thirty-minute incident.
Test Backfill (Giving the Agent a Safety Net)
Legacy code rarely has the tests it should. Either there are no tests, or the tests that exist test the happy path and ignore the edges, or the tests are flaky and the team has stopped trusting them. None of these conditions allow the agent to refactor with confidence. Without tests, every change is a gamble: the change might work, or it might break behavior that nobody tested but that some downstream system depends on. The way out of this is to backfill tests before refactoring, using a specific pattern called characterization testing.
The characterization test does not test what the code should do. It tests what the code currently does, including any bugs. The point is to capture the current behavior so that the refactor preserves it. If the refactor changes the behavior, the test fails, and you have a decision: was the change intentional (in which case update the test) or accidental (in which case fix the refactor). Without the test, the change is invisible, and an accidental behavior change ships to production where it can break customers who depended on the buggy behavior.
The agent is excellent at writing characterization tests when given examples of the function's actual usage. The pattern is: pick a function you are about to refactor, find calls to it in the codebase or in production logs, capture the inputs and outputs of those calls, ask the agent to write tests that assert the captured outputs for the captured inputs. The agent does this fast and accurately. The tests do not have to be elegant. They can be a list of input-output pairs with assertions. Elegance is not the goal. Coverage of actual behavior is.
// Characterization test pattern (Java, JUnit 5)
class OrderProcessorCharacterizationTest {
@Test
void processOrder_currentBehavior_capturedFromProduction() {
Order order = new Order("cust-123", List.of(new Item("sku-1", 2)));
OrderResult result = processor.processOrder(order);
// Captured from prod log entry on 2024-03-15:
assertEquals("PROCESSED", result.status());
assertEquals(1, result.warningCount());
assertEquals("DEPRECATED_SKU_FORMAT", result.warnings().get(0).code());
}
}
The example shows the pattern. The test name says "currentBehavior_capturedFromProduction" because that is what is being asserted: the current behavior, not the ideal behavior. The assertion captures both the happy-path result (status PROCESSED) and the quirky details (a warning about deprecated SKU format) that might not be intentional but are part of the system's actual behavior today. If a refactor would change either of these, the test catches it.
The source of the test cases matters. The best source is production logs, because real production traffic represents real production behavior, including the inputs that nobody anticipated. The second-best source is existing tests, even if they are weak. The third-best source is the agent's guesses about what inputs the function should handle. The third source is the most common because it is easiest, but it is also the least valuable because the agent is producing inputs that match a typical mental model of the function, which is exactly the model that may diverge from the actual behavior.
Tools like ApprovalTests and Snapshot testing automate the characterization test pattern. You run the function, capture the output, save it to a file, and the test asserts that future runs produce the same output. When the output changes, the test fails and shows you the diff, and you decide whether to accept the new output or revert the change. The mechanic is the same as the manual pattern but faster, and the agent can drive the tooling.
The volume of tests matters more than the polish. A hundred ugly characterization tests beat ten elegant unit tests for refactoring purposes, because the ugly tests cover more of the actual behavior surface. The agent can write a hundred ugly tests in an hour. Writing them by hand takes a day. The cost-benefit favors the agent doing the bulk work, with you spot-checking the tests for obvious nonsense. The tests do not need to be beautiful. They need to exist and run.
Once the tests are in place, the refactor proceeds with confidence. Every change is verified against the test suite. If the suite stays green, the change preserved behavior. If the suite goes red, the change altered behavior and you have to decide whether the alteration was intentional. The tests are the safety net that allows the agent to make changes without you having to read every line for fear of breaking something invisible. The safety net is what unlocks agent-driven refactoring on legacy code. Without it, every change is risky and the agent's velocity does not pay back the risk.
Migration Strategies
Migrations come in three categories: framework upgrades, language migrations, and database engine moves. Each has its own pattern, and the AI's role is similar across them: generate the mechanical changes, run the tests, isolate the failures, iterate. What does not change is the discipline of small steps. Every successful migration on a legacy codebase happens in small, verifiable increments. Every failed one tries to do too much at once.
Framework upgrades are the most common. Rails 4 to 5 to 6 to 7. Django 2 to 3 to 4 to 5. Spring Boot 1.x to 2.x to 3.x. Java 8 to 11 to 17 to 21. The pattern is: upgrade one major version at a time, run the full test suite after each, fix the breakages, repeat. The agent is helpful at each step because the breakages are usually well-documented (the framework's own upgrade guide lists deprecations and their replacements) and the agent can apply the documented replacements faster than a human can. The agent can also detect uses of deprecated APIs and propose replacements proactively, before they break in the next upgrade.
The trap in framework upgrades is skipping versions. "We are on Rails 4, let's go straight to Rails 7" sounds tempting because two upgrades are slower than one big upgrade. They are not slower. They are slower-feeling but actually faster, because each one-step upgrade has well-documented gotchas and the framework's own tooling supports it. The four-step upgrade has no documented path, no tooling, and a combinatorial explosion of incompatibilities. The pattern is small steps, run tests, never combine two migrations.
Language migrations are the second category. JavaScript to TypeScript is the most common. Python 2 to Python 3 is largely behind us but still happens in some codebases. The pattern for both is incremental: convert one file at a time, with the rest of the codebase unchanged, and rely on the build system to maintain compatibility during the migration. TypeScript supports this natively because it can consume JavaScript files alongside TypeScript files. Python 2 to 3 used the `2to3` tool plus the `__future__` imports to ease the transition. The agent is excellent at the mechanical parts: converting `var` to `let`/`const`, adding type annotations, fixing print-statement-versus-print-function differences. The architectural decisions about how strict to be (TypeScript's `strict` mode flags, Python's optional vs required type hints) are still human calls.
Database engine migrations are the third category and the riskiest. Oracle to Postgres. MySQL to Postgres. SQL Server to Postgres. The destination is usually Postgres because Postgres has matured into the default open-source choice. The migration involves schema differences (Oracle's PL/SQL versus Postgres's PL/pgSQL, MySQL's loose typing versus Postgres's strict typing), query differences (Oracle's hierarchical query syntax versus Postgres's recursive CTEs), and tooling differences (different backup tools, different replication tools, different monitoring tools). Tools like AWS DMS or pgloader handle a lot of the mechanical work, and the agent can help with the SQL translation. The hardest part is verifying that data semantics are preserved, especially around things like NULL handling, character encoding, and timezone semantics, which differ subtly between engines.
One migration at a time. Run the test suite after each step. Commit after each green test run. Roll back at the first sign of failure that does not have a clear cause. Use the framework's own upgrade tools and follow the framework's documented upgrade guide. Capture lessons in the agent instruction file as you go, so the next migration in the same codebase is faster. Treat the migration as a series of small, verifiable changes, not as a single big-bang move.
Combine multiple migrations because "we are touching the code anyway." Skip framework versions because the bigger jump feels efficient. Run the full migration in a single branch over six weeks, then merge to main and pray. Skip running tests because the test suite is slow and the migration "obviously preserves behavior." Treat the migration as a refactor opportunity and try to clean up unrelated tech debt at the same time. End up with a branch that diverged from main, conflicts everywhere, and no clear path forward.
The reason small steps work and big-bang migrations fail is risk compounding. Each step has a small probability of breaking something. If you take ten small steps and verify after each, you find each break in isolation, in a state where the rollback is one commit. If you take one big step that combines ten small ones, all ten breaks happen at once, and the rollback unwinds work you wanted to keep alongside work you wanted to discard. The cost of the big-bang failure is much higher than the cost of incremental failure, and incrementality is what lets you actually finish migrations on production systems.
The agent's contribution to migrations is steady. It generates the code changes for each step. It runs the tests and reports failures. It proposes fixes for the failures. It catches uses of deprecated APIs that you would otherwise miss. The contribution is mechanical and additive, not architectural. The architectural work, of choosing which migrations to do in what order and when to stop, stays with the humans. The combination is what makes migrations on legacy systems tractable in a way they were not before agents existed.
One specific case worth flagging: do not combine a framework migration with a feature change. The combination obscures which problem caused which break. If the migration finishes and the feature is broken, you do not know whether the migration broke the feature or the feature change is bad. Splitting them, with the migration first and the feature change second on the upgraded version, makes both problems debuggable in isolation. The agent will combine them by default if you ask it to do both at once. Ask it to do them separately. The serialization is faster than the combined approach, even though it feels like it should be slower.
Working Around Tribal Knowledge
Tribal knowledge is the hardest part of legacy work because it does not exist on disk. The agent cannot read it, no matter how good the agent is, because there is nothing to read. The knowledge lives in the heads of the engineers who built and maintained the system. The only way to extract it is to ask them and write down what they say. The "rubber duck the codebase" pattern is the most efficient version of this extraction.
The pattern is: get a senior engineer in a room (or a video call). Have them walk through the system out loud, explaining what each major component does and why it works the way it does. Record the audio. Transcribe it (Whisper or similar tools handle this in seconds). Feed the transcript to the agent. The agent now has access to a chunk of tribal knowledge that did not exist on disk an hour ago. The transcript can be added to the agent instruction file or kept as a separate document the agent reads when working on the affected components.
The walk-through is more efficient than written documentation because the senior engineer does not have to plan it. They just narrate. The narration captures things they would not think to write down, because the act of writing forces a kind of editing that drops everything that feels obvious. In speech, the obvious things come out alongside the non-obvious things, and the obvious things are sometimes exactly the tribal knowledge you needed to capture. The senior engineer says "and of course we always restart the worker before deploying because of the connection-pool issue" and now you have a piece of tribal knowledge that would never have made it into a written doc because the engineer would not have thought it was worth mentioning.
Someone who has worked on the system for years. Bonus if they were involved in some of the original architectural decisions. The deeper their context, the more tribal knowledge there is to extract.
Sixty to ninety minutes. Have them narrate the system, component by component. You ask follow-up questions when something is unclear, but mostly let them talk. Record audio.
Modern speech-to-text is good enough that the transcript is mostly clean. You skim it for errors, fix the technical terms that the model got wrong, and you have a usable document.
Add it to the agent instruction file or keep it as a separate doc the agent reads when working in the relevant area. The agent now has tribal knowledge it could not have known otherwise.
Different engineers know different parts of the system. The combined transcripts from three or four engineers cover most of the tribal knowledge that matters. Each session takes ninety minutes. The total investment is small relative to the years of context you are capturing.
Slack, email, and wiki archaeology is the other extraction technique. Many decisions about the system live in chat threads and email exchanges from years ago. The original architect explained why the order processor uses a particular retry strategy in a Slack message in 2018. The decision to switch from MySQL to Postgres was discussed in an email thread in 2020. The choice to use a custom auth library instead of Devise was debated in a wiki page that has not been updated since 2017. All of this is tribal knowledge in archived form. Extracting it takes patience but the volume is finite, and modern search tools (especially Slack's search and email client search) make it tractable.
The output of the archaeology is a collection of decision records. Each record captures a decision the team made and the reasoning behind it. The format does not matter much; the existence does. With the records in place, the agent can answer questions like "why does the order processor retry three times" by reading the record and summarizing it, instead of guessing. The records also help future humans, because the same questions get asked over and over and answering them once durably is better than answering them every time someone new joins.
One specific anti-pattern: trying to extract all the tribal knowledge before doing any work. The extraction is open-ended and you can spend months on it without finishing. The pragmatic version is to extract the knowledge you need for the current task, do the task, and move on. Over time, the cumulative extraction covers most of the system, but you never have to halt work to do it. The extraction is interleaved with the actual work, which means you only pay the extraction cost when there is a real return on it.
Another tactic: when the senior engineer is not available, ask them asynchronously. A short Slack message or email asking "why does X work this way" often gets a clear answer in a few sentences. The answer goes into the notes file. The cost is minutes for them and minutes for you. The value is a piece of context that prevents future bugs. Most senior engineers are happy to share this knowledge if asked, because the alternative is being interrupted with the same questions every six months when someone new tries to figure out the same code. Documenting the answers is a favor to them as much as to the team.
When to Give Up and Start Fresh
Sometimes the right answer is to stop refactoring and rewrite. Knowing when to make that call is one of the harder judgments in legacy work. Make it too early and you waste the existing investment. Make it too late and you waste years on a refactor that was never going to converge. The signals that say "stop refactoring" are recognizable, but they only become obvious in hindsight unless you are watching for them.
The first signal is duration without shipping. You have been on the refactor for six months. Nothing customer-visible has shipped. The refactor itself is in progress, but no value has been delivered. Six months is a long time. If a refactor cannot ship value in six months, it is probably structurally too big to finish at the rate you are working. The alternative is to either accept that the timeline is two years (and plan accordingly) or to acknowledge that the refactor is wrong and pivot.
The second signal is widening scope. The refactor that started as "replace the auth module" became "replace auth and the user model and the session handling and the API routing." Each expansion was justified at the time because the new component depended on something that was also legacy. Eventually, the scope is so wide that the refactor is the entire system, at which point you are doing a rewrite without admitting it. The honest version is to acknowledge that the rewrite has happened in your head, plan it explicitly, and switch from refactor mode to rewrite mode.
The third signal is recurring breakage. Every time you change one component, two others break. The fixes propagate. You spend more time fixing downstream effects than making the actual change. The pattern says the abstractions are wrong: the components are coupled in ways that the existing structure does not capture, and any change ripples through the coupling. The refactor cannot fix this without changing the abstractions, which is itself a rewrite. At that point, the leaf-level refactors are working against the wrong target.
The judgment call is not "is the refactor finished" but "is the refactor converging." Convergence means each step makes the next step easier. Non-convergence means each step exposes new problems that require their own steps. If you are six months in and the next step looks harder than the first step did, the refactor is not converging, and a rewrite-plus-strangler-fig may be the cheaper path despite feeling more drastic.
The cost of a green-field rewrite is real but bounded. You can estimate it. A team of three engineers rewriting a system over six months costs the salary plus opportunity cost of those engineers for six months, which is a number you can put in a budget. The cost of an indefinite refactor is unbounded. The refactor might finish in three more months or three more years; nobody knows. The bounded cost beats the unbounded cost when the alternative is "we will figure it out as we go," because "figure it out as we go" has been true for the last six months and the figuring has not converged.
The rewrite plus strangler fig is the typical pattern. The new system is built green-field, with modern tools, modern frameworks, modern architecture choices. As pieces of the new system come online, traffic is gradually routed from the old system to the new one. The old system is not deleted until the new one has handled all the traffic for some period. The rewrite has the speed advantage of green-field. The migration has the safety advantage of incremental. The combination is the pattern that has shipped most successful legacy replacements in the last decade.
The agent's role in the rewrite is the same as in any green-field project: generate fast, verify, iterate. The agent shines on green-field, which is exactly why a rewrite leverages the agent's strengths better than a refactor of legacy code. The architectural decisions are upstream of the agent's contributions, but the typing speed of generation is meaningfully faster than typing by hand. A team of three engineers with agent assistance can rewrite a system in months that would have taken a year by hand.
The reason teams resist this call is sunk-cost reasoning. Six months of refactor work feels like an investment that should not be discarded. It is an investment, but it is a sunk one. Continuing the refactor does not recover the sunk cost; it adds to it. The decision is between continuing to spend on a refactor that may not converge, versus starting fresh on a rewrite with a known cost and a higher probability of success. Framed that way, the rewrite often wins. Framed as "throwing away six months of work," the refactor wins by default. The framing is the leverage point. Get the framing right and the call follows.
One more nuance: the rewrite-plus-strangler is not always available. Some systems are too coupled to external constraints (regulatory requirements, hardware integrations, vendor lock-in) to rewrite cheaply. For those, the refactor is the only option, and the right answer is to commit to it with realistic expectations about the timeline. Two years is sometimes the right answer for a refactor of a complex regulated system, and the team should plan for two years rather than pretending it will be six months. The pretending is what makes the refactor feel like a failure even when the actual timeline is reasonable for the actual scope.
The Onboarding Frame
The summary of all the patterns above is a single shift in perspective. The agent on a legacy codebase is not a tool you give a task and expect a result. It is a new senior engineer you are onboarding. The same things that work for a new senior engineer work for the agent. The same things that fail for a new senior engineer fail for the agent. The frame is how you allocate effort: not "I have an agent, the agent should know things," but "I have a new collaborator, my job is to onboard them well."
A new senior engineer needs the same things the agent needs. A tour of the codebase. Documentation about the parts that have tribal knowledge. Tests they can rely on when making changes. A senior engineer to ask questions when stuck. Time to build context before being trusted with major changes. The agent needs all of these, and the discipline is providing them as deliberately as you would for a human hire.
The teams that succeed on legacy with AI are the ones that invest in this onboarding. They spend the first weeks of an agent-assisted legacy project building the AGENTS.md file, doing the rubber-duck walks, capturing the test coverage, mapping out the module boundaries. The first weeks are not high-output. The next months are. The investment compounds because the agent's effective knowledge of the codebase grows over time, and as it grows the agent's productivity grows. By month six, the agent is operating like a senior engineer who has been on the codebase for a year. The investment pays back many times over.
The teams that fail are the ones that expect the agent to "just figure it out." They throw the agent at a legacy codebase, give it a task, and expect output. The output looks plausible because the agent always produces plausible output. The output breaks because the agent does not have the context to produce correct output. The team concludes that AI does not work on legacy. The conclusion is wrong, but it is the natural conclusion from the workflow they tried.
The frame also clarifies what the agent should not be asked to do. A new senior engineer in their first week is not asked to make architectural decisions about the system. They do not have the context. The agent does not have the context either, and asking it to make those decisions produces decisions that are locally plausible and globally bad. Architectural decisions stay with the humans who have the years of context. The agent does the typing, the running of tests, the mechanical refactors, the boilerplate generation. The combination is what works.
The other thing the frame clarifies is the expectation around mistakes. A new senior engineer makes mistakes. They misread part of the codebase. They miss a caller. They use a deprecated API. The team catches these mistakes in code review, fixes them, and the engineer learns. The agent makes mistakes too, in similar shapes, and the team catches them in the same way. The expectation is not "the agent never makes mistakes" but "the agent makes mistakes that are caught and corrected." The catching mechanism is code review plus tests plus the diagnostic patterns covered in earlier sections of the guide. The mechanism works for both humans and agents.
One nuance: the agent makes mistakes faster than a human, because it generates code faster. This is not a problem if the catching mechanism keeps up. It becomes a problem if the team accepts the agent's output without review, because the volume of unreviewed mistakes accumulates faster than humans can debug them. The discipline that prevents this is the same discipline that works for human PRs: every change reviewed, every test run, every assumption verified. The discipline does not change because the source is an agent.
Specific Tools and Where They Help
The tools for legacy work with AI overlap with the general AI dev tools, but a few are particularly useful in legacy contexts. Worth knowing where each one fits, because the right tool for a given subtask saves time on dozens of similar subtasks across a large project.
For the codebase tour, Claude Code is a strong default. The agent reads files in order, builds context, makes changes with the context informed. The instruction file is at CLAUDE.md by convention, and the agent reads it on every session. For teams that prefer a different workflow, Cursor and Aider are alternatives, with similar conventions and similar abilities. The capability gap between the major agentic dev tools is small for legacy work; the convention details (where the instruction file lives, how files get into context) matter more than the agent itself.
For characterization tests, ApprovalTests is the dedicated tool. It exists in libraries for Java, .NET, Python, Ruby, and JavaScript. The pattern it codifies is the snapshot-test pattern: run the function, capture the output, save it, compare future runs against the saved version. The agent can drive ApprovalTests as well as any other testing library, and the workflow is well-documented enough that the agent does not need extra guidance. For ad-hoc characterization tests, Jest's snapshot testing in JavaScript and pytest's `--snapshot-update` flag in Python are equivalent built-ins.
For framework upgrades, the framework's own tooling is the first stop. Rails has `rails app:update` plus the official upgrade guide. Spring Boot has the migration guide for each major version. Django has `django-upgrade` and similar tools. The agent can apply the documented changes, but the documented changes themselves are the work product of the framework team and reading them is the foundation. Skipping the documentation in favor of the agent's intuition produces upgrades that miss things the framework team specifically warned about.
For language migrations, the language's own tooling is the foundation. TypeScript's `tsc` with `allowJs` enables incremental migration from JavaScript. Python had `2to3` for the Python 2 to 3 migration. Go has its `go fix` tool for some changes. The agent layers on top of these tools, applying changes that the tools cannot automate, like inferring types from usage patterns or refactoring class hierarchies that do not translate cleanly. The split is mechanical-from-tools, judgment-from-agent-or-human.
For database migrations, pgloader is the workhorse for moving data into Postgres from MySQL, MS SQL, or SQLite. AWS DMS is the cloud-managed equivalent for larger migrations or when you want managed monitoring of the migration progress. For schema translation, the source database's information_schema is the canonical source, and the agent can generate the Postgres equivalent reliably for most cases. The hard cases (stored procedures, custom types, unusual extensions) need human review.
For tribal knowledge extraction, Whisper from OpenAI or distilled equivalents are the speech-to-text foundation. Output quality is high enough that transcripts are usable with minor cleanup. For wiki and chat archaeology, the platform's own search is usually the best tool: Slack search, Confluence search, Notion search. Pulling everything into a vector database for cross-platform search is overkill for most cases, because the volume is small enough that platform-native search covers the need.
For observability across the legacy plus new system during a strangler-fig migration, the choice of tools matters less than the consistency of instrumentation. Sentry catches errors, Datadog or New Relic captures metrics and traces, OpenTelemetry is the open standard for emitting telemetry. Whichever stack you pick, the discipline is to instrument both the old and new code paths with the same tooling, so you can compare behavior across the migration boundary. A migration where only the new code is instrumented produces blind spots that catch you when the old code has a regression.
Closing
Legacy is harder than greenfield. AI does not change that. AI shifts the work, but it does not eliminate it. The teams that succeed on legacy with AI are the ones that invest in onboarding the agent the same way they would onboard a new senior engineer. They write the agent instruction file. They do the rubber-duck walks and capture the transcripts. They backfill the characterization tests. They run the framework upgrades one version at a time. They use the strangler fig instead of the big-bang rewrite when the system is actively developed. They know when to give up on a refactor and start a rewrite, and they make the call at six months instead of two years.
The teams that fail are the ones that expect the agent to figure things out without the onboarding. They throw the agent at a Rails 4 codebase and ask for changes. They get plausible-looking output that breaks in subtle ways. They conclude AI does not work on legacy. The conclusion is wrong but the workflow that produced it is real, and it is the default workflow most teams try first. Avoiding it requires deliberate effort, and the effort feels expensive in the early weeks before the compounding kicks in.
The shift in mindset is the lesson. Stop thinking of the agent as a tool that takes tasks and produces results. Start thinking of it as a collaborator that needs context. Provide the context deliberately. Update the context as the work proceeds. Let the context compound across sessions. The agent's productivity on legacy code grows with the context. The growth is slow at first and steep later. Teams that stay with the discipline through the slow phase get to the steep phase. Teams that quit during the slow phase never see the steep phase exist.
The closing thought is that legacy work is going to be a meaningful share of professional software engineering for the foreseeable future. The systems that run the world were built decades ago, in many cases, and they are not getting rewritten en masse. They are getting maintained, extended, migrated, and occasionally rewritten in pieces. AI changes the speed and texture of all of these activities, but it does not change the underlying nature of the work. The work is still about understanding what is there, why it is there, and how to change it without breaking it. AI helps, when used well. The using-well is the discipline this whole guide has been about. Internalize it and the legacy code stops being a productivity drag and starts being where the most valuable work happens.
