Context Engineering: Managing Codebases in Finite LLM Memory

The constraint is no longer model intelligence. It is how much of your codebase fits in the model's working memory. Context engineering is the discipline of getting the right information into that memory at the right cost, and it is what separates an agent that ships real software from one that hallucinates plausible nonsense and burns through your token budget on the way down.

This page is about that discipline. Not the abstract version that shows up in research papers about retrieval-augmented generation, but the practical one a working engineer needs when they sit down to direct a coding agent through an actual project with hundreds of files, a real architecture, and a real deadline. The frame is opinionated. Anthropic's Claude family is the recommendation throughout because Claude's context window, prompt caching mechanics, and agent tooling are the most mature options in 2026. The principles generalize, but the specifics are calibrated to the tool that most people doing serious work are actually using.

200K

Tokens of context that became standard for frontier coding agents in mid-2024, the threshold where whole-codebase reasoning became feasible

Tokens of extended context available on Claude in 2025-2026, large enough to hold a medium-sized codebase plus its tests and docs

~90%

Cost reduction on cached portions of a prompt when prompt caching is correctly engineered, the single largest cost lever in agent workflows

The Context Window Evolution

The context window is the number of tokens a model can hold in working memory at once. It bounds what the agent can see when it produces output. The size of that window has grown by more than two orders of magnitude in roughly four years, and the growth tracks the entire arc of what kinds of work an AI agent can credibly take on. The dates matter because the dates explain why the practice of context engineering even exists as a named discipline now and did not exist meaningfully in 2022.

Late 2022, GPT-3.5 at 4K tokens

Roughly three pages of code. Useful for snippets, explanations, and one-off scripts. Useless for anything resembling a codebase. The mental model was "feed the model a question and a small amount of relevant code, copy the answer back into your editor." There was no agent. There was no codebase awareness. Context engineering as a discipline did not need to exist because there was almost no context to engineer.

2023, GPT-4 at 8K to 32K tokens

An order of magnitude more space. Enough to hold a single file, sometimes a small set of related files. The phrase "AI pair programmer" started to mean something, and the first generation of autocomplete-style tooling matured. But the workflow was still snippet-by-snippet. The human had to manually feed the model the right slice of the codebase, and the model gave back a slice in return. Context engineering at this stage was almost entirely the human doing copy-paste with judgment.

Mid-2024, Claude 3.5 Sonnet at 200K tokens

The threshold that mattered. Two hundred thousand tokens is roughly the size of a small-to-medium application: source files, schema, config, tests. For the first time, an agent could hold an entire project in working memory and reason about it as a coherent system. This is when context engineering became a real discipline, because there was finally enough space for the engineering choices to pay off and finally enough waste possible for them to matter.

2025-2026, extended context and 1M-token frontiers

Claude with extended context up to 1M tokens, Gemini at 2M, GPT-5 settling around 128K with a different architectural posture. The largest windows can hold most of a real codebase including dependencies. Prompt caching matures into a primary cost lever. Sub-agent dispatch becomes a standard pattern in agent harnesses. The context window stops being the binding constraint on most projects and the cost of context becomes the new optimization target.

Context window size, 4K to 1M, 2022 to 2026 250x

Share of agent sessions in 2026 that hit context-window pressure ~25%

Share of agent sessions where context cost exceeds raw model cost ~70%

The number to fixate on is 200K. That is the inflection point, and it is still the floor for serious work. Below 200K, the agent is reasoning about a fragment and inventing the rest. At 200K, the agent can see most of what matters in a typical small-to-medium project. Above 200K, the question shifts from "does it fit" to "what should be in there at all," because the model that has access to a million tokens still pays for every one of them and still has to find the relevant ones inside the noise. Bigger windows do not save you from context engineering. They change which parts of it bind hardest.

What Burns Context (And Doesn't Belong)

Most agent sessions waste a lot of context. Not because the agent is dumb, but because the default behavior of "throw everything at the model and let it sort the relevance" is a tax that compounds over a long session. Naming the specific kinds of waste is the first step toward controlling them. There are six recurring offenders, and they show up in roughly the same proportions across most working agents.

Verbose log spam from previous turns. The agent runs a build, the build prints two thousand lines of output, those two thousand lines sit in the conversation forever. The relevant signal is usually three lines: the failing test, the file, the error message. Everything else is noise that the model has to skim past on every subsequent turn, paying tokens both ways. A disciplined session prunes log output to the relevant lines on the way through, and a disciplined agent harness summarizes long command output before it lands in context.

Generated boilerplate. Lockfiles, build artifacts, minified bundles, generated migrations, anything that the agent does not need to read but might end up reading because of an over-broad search. A typical Node.js lockfile is tens of thousands of tokens of dependency hashes the agent will never need. The same is true of compiled assets, vendored third-party code, and any directory the build process writes into. Tell the agent which directories to ignore. If the harness has an ignore file convention, use it.

Repeated reads of the same file. The agent reads a file in turn three. It reads the same file again in turn seven, this time because it forgot it had already seen it. Twice the tokens, no new information. This is one of the easier wastes to spot in the wild because the duplicate file content is sitting right there in the transcript. A serious harness deduplicates file reads inside a session. If yours does not, prune the duplicates by hand or summarize the file once and refer to the summary on later turns.

Long unstructured chat history. The session opened with a clear request. Then the user pivoted three times, asked for a refactor, changed their mind, asked to revert, asked for tests, decided to pause and talk through the architecture instead. By turn fifteen, half the context window is conversational debris that does not describe the current work. The fix is to consolidate and start a fresh session, which costs less than carrying the noise. Treat session boundaries as a tool, not as a default to be feared.

The agent's own thinking-out-loud, unsummarized. Modern agents reason verbosely. The reasoning is useful in the moment and waste afterward. If the harness preserves the entire reasoning trace from every previous turn, you are paying tokens for monologues you have already read. Some harnesses summarize older reasoning automatically. Some do not. Know which yours is, and if it does not, occasionally ask the agent to consolidate and restart.

Off-topic detours. The agent went looking for one bug, found a different one, traced through a third file, and came back to the original. The exploration was useful but the trail it left in context is a forest of file content that no longer matters. Detours are valuable when they happen and become deadweight once the answer is in hand. A good practice is to ask the agent to summarize what it found at the end of a detour and then implicitly let the older detail drop out of attention.

User request

File reads (often duplicated)

Tool output (often verbose)

Agent reasoning (often unsummarized)

Final answer (small)

The flow above describes a typical turn. The final answer is a few hundred tokens. Everything before it is what fills the window. If you only optimize one thing in your session, optimize the volume of stuff in front of the final answer. Fewer file reads. Tighter tool output. Summarized reasoning. The model still produces the answer; you just do not pay for the rest of the trail forever.

What Stretches Context (Patterns That Work)

The flip side of the waste list is a set of patterns that consistently make context go further. These are not exotic techniques. They are the working set of habits that distinguish someone who has spent fifty hours debugging context-bound sessions from someone who has not. Five patterns matter most.

Prompt caching. The single largest cost lever in modern agent workflows. Anthropic's Claude API supports cache breakpoints in the prompt: you mark a stable prefix (system instructions, large reference docs, project-wide context), and on subsequent calls the cached portion is read at roughly a tenth of the input token cost. The cached content does not consume your wall-clock time on each call either; the cache hit is faster than fresh processing. For an agent workflow that hits the same project context dozens of times across a session, the cost difference between cached and uncached can be the difference between an affordable habit and a budget-blowing one. Anthropic's caching has a five-minute TTL on the default tier and longer on premium tiers; structure your prompt so the stable parts come first and the volatile parts come last, and the cache will earn its keep.

File summaries instead of full file dumps. A 2,000-token file usually has a 200-token summary that captures everything the agent needs to know to use it correctly. Function names, signatures, exported symbols, the broad shape of what is in there. If the agent later decides it needs the actual implementation, it can read the file then. Most of the time it does not. Train yourself to give the agent summaries as a default and full reads only when summaries are not enough. The asymmetry is large.

Hierarchical memory. Not every detail needs to be at the top of the prompt. A useful pattern is a tiered structure: a short top-level brief on what the project is and how it is organized, a mid-level reference of where things live, and detail-level files that the agent reads on demand. The agent navigates the hierarchy, which costs tokens, but each level filters what is needed at the next level. Compared to dumping every file in the prompt, hierarchical memory consistently uses a fraction of the tokens for equivalent or better outputs.

AGENTS.md as context shorthand. A short, dense file in the project root that tells the agent the things it needs to know to work in this codebase. Conventions, folder layout, do-not-touch zones, project-specific patterns, the deployment target. Written once, read at session start, applies for the whole session. A 200-line AGENTS.md replaces 5K tokens of repeated explanation across sessions because the agent picks it up without you having to type the same context every time. There is a dedicated topic in this curriculum on agent instruction files; here, the relevant point is that AGENTS.md is one of the highest-payoff context engineering moves you can make on a project.

Sub-agent delegation. Send a child agent to do a context-heavy subtask with a fresh, isolated context window. The parent gets a summary back. The parent's context stays clean. This pattern is worth its own section below because the tradeoffs are subtle, but the headline is that sub-agents are the closest thing to a free lunch in context engineering: they let you do expensive exploration without paying the cost forever in the parent's window.

Naive context use

Dump every file the agent might need at the start of the session. Run commands and let the full output land in the prompt. Read the same file three times across a long session because the agent did not notice it had read it. Carry every previous turn forward verbatim. Reasoning traces accumulate. Context fills to 80% by turn twelve, the model starts dropping detail, the output quality degrades, and the cost per useful turn climbs steadily. By turn twenty, you are paying full-rate input tokens for a context that is mostly stale and the model is increasingly confused about what is current.

Engineered context use

Cache the project's stable context (AGENTS.md, architecture brief, key file summaries) at the top of the prompt where it lives at one tenth the input cost on every turn. Let the agent read files on demand rather than pre-loading them. Summarize verbose tool output before it lands. Dispatch sub-agents for exploration-heavy tasks so the parent never sees the noise. Periodically consolidate by asking the agent to summarize what is now stable and starting a clean session. Context stays under 50% even on long sessions, costs stay flat per turn, and output quality stays steady.

The two columns above describe the same project worked with the same agent on the same model. The difference is purely in how context is managed. The cost difference between the two over a working week can easily be 5x to 10x. The quality difference is the kind of thing that compounds over a project: better attention on every turn, fewer context-related mistakes, less time spent recovering from the agent confusing past and current state.

Patterns for Large Codebases

Once the codebase exceeds the context window, the rules change. You can no longer rely on "fit the whole thing in" as a strategy, even with a 1M-token Claude. The agent has to navigate selectively, and the navigation strategy becomes part of the engineering. Four patterns dominate at this scale.

File selection over file inclusion. The agent picks what to read. You do not dump everything. This is a mindset shift more than a technique. The natural impulse is to provide all potentially relevant files at the start of the task, because the human brain wants to make sure the agent has what it needs. The natural impulse is wrong on a large codebase. The agent has tools to read files. Use them. Tell the agent what you want done, point it at the entry point, and let it decide what else to read. The cost of a few extra read calls is much smaller than the cost of dumping ten files into context that the agent never needed.

Index files. A single document in the project root or in a docs folder that lists each module's purpose and entry points. Five to ten lines per module: what it does, where its main file is, what it depends on. The agent reads the index once, then has a map of where to look for what. Without an index, the agent's first move on any new task is to scan the file tree, which costs tokens for every directory it walks. With an index, that scan is replaced by a single read and a more accurate guess at where to dive in.

Architecture overview docs. A 5K-token document that describes the system at the level of "here are the major components, here is how they communicate, here are the things to know that are not obvious from reading the code." A document like this can replace 50K tokens of source reading because it gives the agent the conceptual model that the source files only reveal cumulatively. Write it once. Update it when the architecture changes. Cache it at the top of the prompt. The return on the writing time is enormous.

The "tour" pattern. Before doing any real work in a codebase the agent has not seen, ask it to do an architecture sweep. Read the key files, build an internal model, summarize what it found, save the summary somewhere reusable. Then drive the actual work from that internal map. The tour costs tokens up front and saves them on every subsequent turn because the agent now has a coherent model of the system rather than discovering it piecewise on every task. For a codebase the agent will work on repeatedly, the tour is paid back many times over.

Open with orientation

Tell the agent it is being onboarded to a new codebase. Point it at the entry point (often the README, sometimes package.json, sometimes the top-level layout). Let it read those first to set the mental frame.

Have it read the index or AGENTS.md

If you have a module index or a project AGENTS.md, that is the next read. The agent absorbs the structural map before it touches any source code. This step alone can cut the rest of the onboarding's token cost in half.

Targeted source reads

Ask the agent to read three to five key files: the main entry, the core domain model, the auth or session layer, the database access layer, the deployment config. Not everything. Just the files that explain how the system thinks about itself.

Ask for an architecture summary

Have the agent produce a short architecture summary in its own words. This serves two purposes: it forces the agent to consolidate what it has learned, and it gives you a checkable artifact you can correct if the agent has misunderstood something.

Save the summary

Write the architecture summary into a file in the project. Cache that file in future sessions. The next agent that touches this codebase, including the same agent in a later session, starts from this artifact instead of having to redo the tour.

Now do the actual work

With the map in hand, dispatch the agent on the original task. Its file reads will be more targeted, its reasoning will be more accurate, and the output quality will be measurably higher than if you had skipped the tour and dropped it cold into the task.

The tour is overkill for a single one-shot task. It pays back many times over on any codebase the agent will work on repeatedly. The breakeven point is roughly the third or fourth task on the same codebase. After that, every task starts from a higher floor because the architecture summary is already on disk and gets cached at the top of every session prompt.

When to Use Sub-Agents (Context Isolation)

The sub-agent pattern is the most underrated tool in context engineering. The basic move: instead of doing an exploration-heavy task in the main session, dispatch a child agent with a focused task and a limited context budget, let it explore, and have it return a summary. The parent's context stays clean. The exploration happens in a window the parent never sees. The cost is one extra round trip and the price of the child agent's tokens, which is usually small compared to what the same exploration would cost in the parent's window over the rest of the session.

Why this works mechanically. The parent agent's context window is fixed and cumulative within a session. Every file it reads, every command it runs, every reasoning trace it produces, all of that stays in the window for the rest of the session, getting paid for on every subsequent turn. A sub-agent's context window is its own and is discarded when the sub-agent returns. So if the sub-agent reads twenty files to find one answer, the parent only pays for the one answer, not for the twenty file reads. The asymmetry is large for any task where the exploration cost dwarfs the answer size.

Use cases that consistently benefit. Research tasks where the agent needs to read many files to answer a question that ends up being a paragraph. Large file reads where you want the gist but not the full content in the parent's context. Code-search sweeps across a codebase looking for all places that use a particular pattern. Cross-referencing checks where the agent has to compare implementations in multiple files. Anything where the cognitive payload of "go find this out" is large and the actual deliverable is small.

The tradeoffs. More API calls, which is more cost and a little more latency on the dispatch. The sub-agent does not have access to the full conversational context of the parent unless you give it that explicitly, so the dispatch prompt has to be more self-contained than a turn in the main session. Some kinds of work do not delegate well; if the task requires deep back-and-forth with the user, dispatching a sub-agent loses the value of the conversation. Use sub-agents for delimited tasks with a clear input and a clear output. Keep the conversational tasks in the main session.

// Generic shape of a sub-agent dispatch
// (Claude Code's Agent tool follows this pattern)

// Parent's turn: dispatch a sub-agent
agent.dispatch({
  task: "Find all places in src/ that import from @legacy/auth and " +
        "summarize how each one uses it. Return only the summary, " +
        "not the file contents.",
  budget: 50000  // tokens of context the sub-agent can use
});

// Sub-agent runs in its own session, reads files,
// builds the answer, returns a summary.

// Parent receives just the summary, not the trail of reads.
// Parent's context grows by ~500 tokens instead of ~50K.

The pattern in code form is generic on purpose. The specifics differ across harnesses. Claude Code has the Agent tool for this. Other harnesses have similar primitives under different names. The key property is: a fresh context window, a focused task, a summarized return. Whatever your harness calls it, find it and use it on the right tasks.

One important nuance. Sub-agents are not always cheaper in raw token terms. The sub-agent has to be told the task, has to read the files itself, and has to produce output. If the same exploration would have happened in the parent's context anyway and would not be retained beyond the current turn, the sub-agent is purely overhead. The savings show up specifically when the exploration would have stayed in the parent's window and been paid for repeatedly. Use sub-agents when the alternative is "this exploration sits in the parent's context for the next twenty turns." Skip them when the exploration is already short-lived.

AGENTS.md as Context Shorthand

AGENTS.md is the file in your project root that the agent reads on session start to learn the things it needs to know about your codebase. It is the single highest-payoff piece of context engineering for a long-lived project, because it replaces a fixed amount of repeated explanation with a one-time write that pays itself back forever. There is a dedicated topic on agent instruction files in this curriculum that goes deep on what to put in one and how to write it well. Here, the relevant point is the context efficiency math.

The math. A 200-line AGENTS.md is roughly 2K to 3K tokens. With prompt caching, the cached read of those tokens is roughly a tenth of the input cost on every session. Without an AGENTS.md, the same context typically gets reconstructed in the session through a series of reads and explanations that easily total 5K tokens at full input rate, and the reconstruction has to happen on every fresh session because the agent does not retain anything across sessions. Net: the AGENTS.md saves a small but consistent amount per session, and over a hundred sessions it adds up to real cost. More importantly, it raises the floor of agent quality because the agent always knows the conventions, instead of sometimes guessing them and getting them wrong.

What to put there for context efficiency. The structural facts about the project that do not change often. Folder layout. Naming conventions. The frameworks and libraries in use. The commands for build, test, deploy. Any do-not-touch zones (generated files, vendored code, legacy modules under maintenance freeze). Project-specific patterns that the agent's defaults would get wrong (custom auth shape, non-standard data flow, opinionated component patterns). The deployment target. The database schema's broad shape. Anything you find yourself explaining to the agent more than twice belongs in AGENTS.md.

What not to put there. Detailed examples that are better suited to a separate doc. Long code blocks that bloat the file without adding structural information. Anything that changes frequently, because every update to AGENTS.md invalidates the cache for that file. Tactical instructions about a specific feature you are working on right now. Documentation that is genuinely needed but is going to be read on demand rather than every session. The discipline is to keep AGENTS.md dense and structural. If it grows past 300 lines, ask whether some of it should move to a docs folder and be referenced rather than included.

Takeaway

AGENTS.md is the cheapest, highest-payoff context engineering move available on any long-lived project. Write it. Keep it dense. Refresh it when the structure changes. The same agent on the same model produces measurably better output in a project with a good AGENTS.md than in a project without one, and the cost difference per session is roughly a 90% reduction on the explanatory overhead.

Measuring Context Waste

You cannot manage context efficiently if you cannot see how much you are using. Most modern agent harnesses expose a token counter somewhere, and it is the single most useful diagnostic in the practice. Knowing your context usage at any given moment is the difference between catching a bloat problem at turn five and discovering it at turn twenty when the agent starts forgetting things.

Where to find the counter. In Claude Code, context usage shows up in the status display and is queryable through the agent's introspection. Cursor shows context usage in its session UI. Most harnesses worth using have something. If yours does not, that is a strong signal to switch harnesses, because you cannot do this work blind.

The "X% used after Y turns" rule of thumb. After ten turns of normal work on a 200K-token window, expect roughly 30% utilization. Lower if the work is short-form prompting, higher if the work involves heavy file reads. If you hit 70% utilization by turn ten, something is wrong: probably oversized file dumps, probably verbose tool output, probably accumulated reasoning traces. Check the transcript, find the bloat, prune or restart.

The "halt and summarize" pattern. When context usage approaches its limit (typically 80% to 90% on most harnesses), stop adding new content and ask the agent to consolidate. A consolidation prompt looks like this: "Summarize what we have done in this session, what is currently in flight, and what files have been changed. Be detailed enough that a fresh session could continue from your summary." Save the output. Start a fresh session. Paste the summary. The fresh session has a clean window, the relevant state is captured, and the work continues without the model degrading from over-stuffed attention.

Typical context budget breakdown after 10 turns: file reads ~40%

Typical context budget breakdown after 10 turns: tool output ~25%

Typical context budget breakdown after 10 turns: agent reasoning ~20%

Typical context budget breakdown after 10 turns: user prompts ~10%

Typical context budget breakdown after 10 turns: system + AGENTS.md (cached) ~5%

The breakdown above is a rough average across well-managed sessions. The exact proportions vary by project. The thing to notice: the largest chunk is file reads, which is the easiest place to apply discipline. If your file-read share is creeping past 50%, you are almost certainly dumping more files than the agent needs. Tighten the file selection. Ask the agent to summarize files instead of including their full content. Move large stable references into cached prompt prefixes.

The thing to also notice: the system prompt and AGENTS.md, when cached, are roughly 5% of the budget and are paid at one tenth the rate. That is a tenth of the most stable, most useful context for one-fiftieth of the cost of equivalent uncached content. The economics of caching the right things are enormous.

Build a habit of checking the counter. Not constantly, but at natural break points. Every five turns or so. Every time the agent does something that involved a lot of reading. Every time you feel the session has gone long. The instrumentation is cheap; the visibility it gives you is what makes context engineering tractable rather than mystical.

Architectural Choices That Reduce Context Pressure

Some of the highest-payoff context engineering moves are not at the prompt level. They are at the codebase level. The shape of your code determines how easy it is for an agent to navigate, how much it has to read to do anything, and how much of what it reads is actually relevant. Four architectural choices consistently reduce context pressure.

Smaller files over giant ones. A 2,000-line file forces the agent to read 2,000 lines to know what is in it, even if the work is on one function. A 200-line file reveals its structure quickly and lets the agent skip what it does not need. Modern file conventions in good codebases tend to keep individual files in the 100 to 400 line range for this reason among others. The agent's experience tracks the human reader's: smaller files are easier to navigate, easier to reason about, and cheaper to bring into context. The codebase that is easy on humans is usually easy on agents too.

Co-located files. The component, its tests, and its types in the same folder. The route handler and its validation schema and its tests near each other. The model and its fixtures together. When the agent reads one file, it can see the related files in the same directory listing, which means a single read often reveals the whole working set for a feature. Compare that to a structure where types live in /types, tests live in /tests, and components live in /components: the agent has to do three reads to assemble what should have been one. The architectural pattern that helps human navigation also helps agent navigation, by the same mechanism.

Clear module boundaries. The agent can read one module without needing to read five others. This is just good engineering, but the agent makes it more visibly valuable. A module that exposes a clean interface and hides its internals lets the agent reason about the interface without paying the token cost of the implementation. A module with leaky abstractions, deep cross-references, and shared mutable state forces the agent to read everything to understand anything. The cost of bad abstractions used to show up only in human comprehension. Now it shows up in the model's token bill too.

Tight types. The type signature gives the agent the shape of a function without it having to read the implementation. A function with a clear input type and output type is one the agent can use correctly from its signature alone, in many cases. A function with broad types (anything that returns "any" or accepts loosely typed objects) forces the agent to read the implementation to know what it actually does. TypeScript's static types are a context engineering tool as much as they are a developer experience tool. Same for type hints in Python, generics in Go, and any other type system that makes signatures meaningful. The more the type tells, the less the agent has to read.

Takeaway

Architectural choices compound. A codebase with small, co-located files, clear module boundaries, and tight types is a codebase the agent can work in efficiently for years without ever pressing against the context window. A codebase with the opposite properties is one where every task is an exercise in fitting the relevant code into the prompt. The same agent on the same model produces dramatically different output quality in the two codebases, and the difference is not the agent. It is the architecture.

A Note on Session Hygiene

Beyond the structural patterns, there is a set of small habits that accumulate over a long working session. They are individually unremarkable. Together, they are the difference between someone whose sessions stay productive at turn forty and someone whose sessions degrade by turn fifteen.

End sessions deliberately. Do not let a session drift into incoherence. When the work is at a natural break point, ask the agent to summarize what was done, commit the changes, and start fresh next time. The cost of a fresh session is one extra prompt at the start. The cost of a degraded session is much higher: lower output quality, missed details, wasted tokens on context the agent is already confused about.

Prefer multiple short sessions over one long one. Each short session has clean context. Each one starts with cached project context loaded at low cost. Each one ends with the agent at full attention rather than at 80% utilization. The handoff between sessions is the consolidation summary, which is cheap to produce and useful as a record. Long sessions feel productive in the moment because you are not paying the start-up cost; they cost more later in attention degradation.

Use sub-agents for anything heavy. The pattern was covered above. Worth restating in the hygiene context: any time you are about to dispatch the agent on something that involves a lot of reading or searching, ask whether a sub-agent would do it better. Most of the time the answer is yes.

Cache aggressively. Anything stable belongs in a cached portion of the prompt. AGENTS.md, architecture briefs, key reference docs, your style guide, the schema if it is stable. The cost difference between cached and uncached for the same content is roughly an order of magnitude. There is no reason not to cache the things that do not change.

Prune verbose output. When a command produces a thousand lines of output and you only need three, paste only the three. The harness will sometimes do this for you, but the discipline of doing it manually is worth developing because not every harness handles every kind of output well. The agent does not need the noise. The model does not need to skim past it. Pay tokens for signal, not for buffer.

The Cost Side of the Equation

Context engineering is not only about quality. It is about cost, and the cost numbers in 2026 are large enough that a careless habit can run a serious bill. Concrete shapes are useful here. The pricing structures across providers shift over time and the specific numbers will be slightly different by the time you read this, but the relative scale is stable.

A single Claude Sonnet input token is on the order of single-digit dollars per million tokens. A cached input token is on the order of tens of cents per million. Output tokens are roughly 4x to 5x the input price. A typical agent session that uses 200K tokens of input and generates 20K tokens of output is in the few-dollars range. Multiply by hundreds of sessions per month and the bill is meaningful. Multiply by the difference between caching the right things and not caching them, and the bill changes by a factor of two or three.

Where the cost concentration shows up. The prompt cache is the single largest lever. A project that caches its system prompt and AGENTS.md and architecture brief, all sitting at the top of every session prompt, pays roughly a tenth the cost on those tokens compared to a project that does not. For a stable project context of 5K tokens read on every turn over a hundred sessions, that is a measurable monthly difference.

The second largest lever is sub-agent use, because it removes high-volume exploration from the parent's context. The third is file selection discipline (read what is needed, summarize what is not). The fourth is session hygiene (end sessions before they bloat).

10x

Cost reduction on cached portions of a prompt versus uncached input on the same content, with Claude prompt caching

5 min

Default TTL on Claude prompt cache entries (longer on premium tiers); structure your sessions to hit the cache repeatedly within the window

~70%

Share of agent sessions in 2026 where context cost dominates raw model cost, so context discipline is the primary optimization

The cost framing matters because it reframes the discipline. Context engineering is not a nice-to-have practiced by people who like optimization. It is the practice that determines whether running a coding agent at scale is affordable or not. A team that ships ten features a week with disciplined context use spends a fraction of what a team shipping the same features with sloppy context use spends, on the same model, with the same agent harness.

Common Mistakes

The list of recurring mistakes in context engineering is short and stable. Most working agents make most of these mistakes some of the time; the discipline is to recognize them and correct them.

Dumping every potentially relevant file at the start of a task. The natural impulse, almost always wrong, fixed by trusting the agent to read what it needs.

Not caching the system prompt. The single most expensive sin. A 3K-token system prompt that gets paid at full input rate on every session adds up faster than you would guess. Move it into a cached prefix.

Verbose tool output left in the transcript. The build system printed two thousand lines, three of which mattered, and the rest are now in the prompt forever. Summarize on the way through.

Not using sub-agents for exploration. The parent agent ends up with a context full of file reads from a search task that contributed only a paragraph of useful output. The pattern is so common it is the reason sub-agent dispatch exists.

Letting sessions run until they degrade. The agent loses attention at high context utilization. The signal is subtle: it forgets details, mixes up state, contradicts earlier outputs. The fix is to consolidate and restart, not to push through.

Skipping the architecture tour on a new codebase. The agent does its first task by reading randomly, builds a fragmented picture, and gets things wrong that a 5-minute tour would have prevented.

Treating context as free. It is not. Every token in your prompt costs money and attention. The discipline of asking "does this need to be here?" on every chunk of context is the difference between cheap, fast, accurate sessions and expensive, slow, hallucination-prone ones.

When the Model Is the Bottleneck Anyway

An honest section before closing. There are tasks where context engineering is not the bottleneck. The model itself is. If the work is genuinely beyond the model's capability, no amount of context discipline will save it. The agent that does not know how to write the right algorithm will not write it correctly even with a perfect context. The agent whose reasoning capacity is the constraint is not helped by tighter file selection.

The signal that you have hit a model bottleneck rather than a context bottleneck: the agent has all the information it needs, the prompt is clean, the AGENTS.md is solid, the right files are in scope, and the output is still wrong. In that case, the next move is not more context engineering. It is a stronger model. Move from Sonnet to Opus. Try the extended-thinking mode if your harness supports it. Break the task into smaller pieces the model can handle.

The signal that you have hit a context bottleneck rather than a model bottleneck: the model output is plausible but inconsistent, the agent contradicts itself across turns, it forgets details from earlier in the session, it mixes up file contents. In that case, the next move is the discipline this whole page is about. Tighten the prompt. Cache more. Summarize more. Use sub-agents. Restart sessions when they degrade.

Knowing which kind of failure you are looking at saves time. The two failure modes look superficially similar (both produce wrong output) but the fixes are different. Practice diagnosing which one is happening before reaching for a tool. The wrong fix on the right problem costs you the time of the wrong fix and leaves the original problem in place.

Closing

Context engineering is what separates an agent that ships and an agent that hallucinates. The same model, with bad context, produces worse output than a smaller model with good context. That is not a marketing claim; it is the observable reality of working with these tools at scale. The Sonnet that cannot find the relevant file is worse than the Haiku that has the right file already in scope. The discipline of getting the right information in front of the model at the right cost is the discipline that determines what the model produces.

The shape of the practice is straightforward. Cache what is stable. Read what is needed. Summarize what is verbose. Dispatch sub-agents for heavy exploration. Keep AGENTS.md dense and structural. Watch the token counter. End sessions before they degrade. Architect the codebase so the agent can navigate it cheaply. None of this is exotic. All of it is consistent. Practiced together, the patterns compound, and the gap between a session-on-rails and a session-fighting-itself widens with every turn.

Treat the context window as the load-bearing resource it is. Not as an unlimited buffer. Not as something the agent will manage for you. As the most precious thing in the workflow, allocated by you, with the discipline you bring to any other resource that has a price and a limit. The agents are good. The models are getting better every six months. The thing that does not get better automatically is the context you give them, and the work you put into that part is the work that compounds across every session for the life of the project.

The rest of this curriculum unpacks the specifics. The dedicated topic on agent instruction files goes deep on AGENTS.md. The topic on prompt engineering covers the in-prompt patterns. The tool comparison topic discusses which harnesses do which parts of context management well. The frame is set here. The depth is in the topics that follow. Read on if you want to ship real software with these tools rather than just produce demos that fall over the moment they meet a real codebase.